what is the efficient way of pulling data from s3 among boto3, athena and aws command line utils

Can someone please let me know what is the efficient way of pulling data from s3. Basically I want to pull out data between for a given time range and apply some filters over the data ( JSON ) and store it in a DB. I am new to AWS and after little research found that I can do it via boto3 api, athena queries and aws CLI. But I need some advise on which one to go with.

asked Nov 26 '18 at 17:27

chidori

5763718

One file or lots of files? How big are the files (and how many rows)? How often will you be doing it?

– John Rotenstein
Nov 26 '18 at 18:08

@JohnRotenstein The folders are named date wise with each directory having around 10 files which are compressed. When extracted they would come around 300mb each ( ~2 lakh records ). Atleast for now I am thinking of pulling it once a day may be.

– chidori
Nov 27 '18 at 10:51

add a comment |

asked Nov 26 '18 at 17:27

chidori

5763718

One file or lots of files? How big are the files (and how many rows)? How often will you be doing it?

– John Rotenstein
Nov 26 '18 at 18:08

@JohnRotenstein The folders are named date wise with each directory having around 10 files which are compressed. When extracted they would come around 300mb each ( ~2 lakh records ). Atleast for now I am thinking of pulling it once a day may be.

– chidori
Nov 27 '18 at 10:51

add a comment |

asked Nov 26 '18 at 17:27

chidori

5763718

amazon-web-services amazon-s3

asked Nov 26 '18 at 17:27

chidori

5763718

asked Nov 26 '18 at 17:27

chidori

5763718

asked Nov 26 '18 at 17:27

chidori

5763718

asked Nov 26 '18 at 17:27

chidori

5763718

asked Nov 26 '18 at 17:27

chidori

5763718

One file or lots of files? How big are the files (and how many rows)? How often will you be doing it?

– John Rotenstein
Nov 26 '18 at 18:08

@JohnRotenstein The folders are named date wise with each directory having around 10 files which are compressed. When extracted they would come around 300mb each ( ~2 lakh records ). Atleast for now I am thinking of pulling it once a day may be.

– chidori
Nov 27 '18 at 10:51

add a comment |

One file or lots of files? How big are the files (and how many rows)? How often will you be doing it?

– John Rotenstein
Nov 26 '18 at 18:08

@JohnRotenstein The folders are named date wise with each directory having around 10 files which are compressed. When extracted they would come around 300mb each ( ~2 lakh records ). Atleast for now I am thinking of pulling it once a day may be.

– chidori
Nov 27 '18 at 10:51

One file or lots of files? How big are the files (and how many rows)? How often will you be doing it?

– John Rotenstein
Nov 26 '18 at 18:08

@JohnRotenstein The folders are named date wise with each directory having around 10 files which are compressed. When extracted they would come around 300mb each ( ~2 lakh records ). Atleast for now I am thinking of pulling it once a day may be.

– chidori
Nov 27 '18 at 10:51

add a comment |

2 Answers
2

active

oldest

votes

If you are looking for the simplest and most straight-forward solution, I would recommend the aws cli. It's perfect for running commands to download a file, list a bucket, etc. from the command line or a shell script.

If you are looking for a solution that is a little more robust and integrates with your application, then any of the various AWS SDKs will do fine. The SDKs are a little more feature rich IMO and much cleaner than running shell commands in your application.

If your application that is pulling the data is written in python, then I definitely recommend boto3. Make sure to read the difference between a boto3 client vs resource.

answered Nov 26 '18 at 17:45

PrestonM

1,35011228

add a comment |

Some options:

Download and process: Launch a temporary EC2 instance, have a script download the files of interest (eg one day's files?), use a Python program to process the data. This gives you full control over what is happening.

Amazon S3 Select: This is a simple way to extract data from CSV files, but it only operates on a single file at a time.

Amazon Athena: Provides an SQL interface to query across multiple files using Presto. Serverless, fast. Charged based on the amount of data read from disk (so it is cheaper on compressed data).

Amazon EMR: Hadoop service that provides very efficient processing of large quantities of data. Highly configurable, but quite complex for new users.

Based on your description (10 files, 300MB, 200k records) I would recommend starting with Amazon Athena since it provides a friendly SQL interface across many data files. Start by running queries across one file (this makes it faster for testing) and once you have the desired results, run it across all the data files.

answered Nov 27 '18 at 16:23

John Rotenstein

72.4k782127

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53486197%2fwhat-is-the-efficient-way-of-pulling-data-from-s3-among-boto3-athena-and-aws-co%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

If your application that is pulling the data is written in python, then I definitely recommend boto3. Make sure to read the difference between a boto3 client vs resource.

answered Nov 26 '18 at 17:45

PrestonM

1,35011228

add a comment |

If your application that is pulling the data is written in python, then I definitely recommend boto3. Make sure to read the difference between a boto3 client vs resource.

answered Nov 26 '18 at 17:45

PrestonM

1,35011228

add a comment |

If your application that is pulling the data is written in python, then I definitely recommend boto3. Make sure to read the difference between a boto3 client vs resource.

answered Nov 26 '18 at 17:45

PrestonM

1,35011228

If your application that is pulling the data is written in python, then I definitely recommend boto3. Make sure to read the difference between a boto3 client vs resource.

answered Nov 26 '18 at 17:45

PrestonM

1,35011228

answered Nov 26 '18 at 17:45

PrestonM

1,35011228

answered Nov 26 '18 at 17:45

PrestonM

1,35011228

answered Nov 26 '18 at 17:45

PrestonM

1,35011228

add a comment |

Some options:

Download and process: Launch a temporary EC2 instance, have a script download the files of interest (eg one day's files?), use a Python program to process the data. This gives you full control over what is happening.

Amazon S3 Select: This is a simple way to extract data from CSV files, but it only operates on a single file at a time.

Amazon Athena: Provides an SQL interface to query across multiple files using Presto. Serverless, fast. Charged based on the amount of data read from disk (so it is cheaper on compressed data).

Amazon EMR: Hadoop service that provides very efficient processing of large quantities of data. Highly configurable, but quite complex for new users.

answered Nov 27 '18 at 16:23

John Rotenstein

72.4k782127

add a comment |

Some options:

Download and process: Launch a temporary EC2 instance, have a script download the files of interest (eg one day's files?), use a Python program to process the data. This gives you full control over what is happening.

Amazon S3 Select: This is a simple way to extract data from CSV files, but it only operates on a single file at a time.

Amazon Athena: Provides an SQL interface to query across multiple files using Presto. Serverless, fast. Charged based on the amount of data read from disk (so it is cheaper on compressed data).

Amazon EMR: Hadoop service that provides very efficient processing of large quantities of data. Highly configurable, but quite complex for new users.

answered Nov 27 '18 at 16:23

John Rotenstein

72.4k782127

add a comment |

Some options:

Download and process: Launch a temporary EC2 instance, have a script download the files of interest (eg one day's files?), use a Python program to process the data. This gives you full control over what is happening.

Amazon S3 Select: This is a simple way to extract data from CSV files, but it only operates on a single file at a time.

Amazon Athena: Provides an SQL interface to query across multiple files using Presto. Serverless, fast. Charged based on the amount of data read from disk (so it is cheaper on compressed data).

Amazon EMR: Hadoop service that provides very efficient processing of large quantities of data. Highly configurable, but quite complex for new users.

answered Nov 27 '18 at 16:23

John Rotenstein

72.4k782127

Some options:

Download and process: Launch a temporary EC2 instance, have a script download the files of interest (eg one day's files?), use a Python program to process the data. This gives you full control over what is happening.

Amazon S3 Select: This is a simple way to extract data from CSV files, but it only operates on a single file at a time.

Amazon Athena: Provides an SQL interface to query across multiple files using Presto. Serverless, fast. Charged based on the amount of data read from disk (so it is cheaper on compressed data).

Amazon EMR: Hadoop service that provides very efficient processing of large quantities of data. Highly configurable, but quite complex for new users.

answered Nov 27 '18 at 16:23

John Rotenstein

72.4k782127

answered Nov 27 '18 at 16:23

John Rotenstein

72.4k782127

answered Nov 27 '18 at 16:23

John Rotenstein

72.4k782127

answered Nov 27 '18 at 16:23

John Rotenstein

72.4k782127

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

6enD 9Ga 4jhY,fcm vi30mkP9nU HFbrx73lr2e

搜尋此網誌

Btukfyl