what is the efficient way of pulling data from s3 among boto3, athena and aws command line utils
Can someone please let me know what is the efficient way of pulling data from s3. Basically I want to pull out data between for a given time range and apply some filters over the data ( JSON ) and store it in a DB. I am new to AWS and after little research found that I can do it via boto3 api, athena queries and aws CLI. But I need some advise on which one to go with.
amazon-web-services amazon-s3
add a comment |
Can someone please let me know what is the efficient way of pulling data from s3. Basically I want to pull out data between for a given time range and apply some filters over the data ( JSON ) and store it in a DB. I am new to AWS and after little research found that I can do it via boto3 api, athena queries and aws CLI. But I need some advise on which one to go with.
amazon-web-services amazon-s3
One file or lots of files? How big are the files (and how many rows)? How often will you be doing it?
– John Rotenstein
Nov 26 '18 at 18:08
@JohnRotenstein The folders are named date wise with each directory having around 10 files which are compressed. When extracted they would come around 300mb each ( ~2 lakh records ). Atleast for now I am thinking of pulling it once a day may be.
– chidori
Nov 27 '18 at 10:51
add a comment |
Can someone please let me know what is the efficient way of pulling data from s3. Basically I want to pull out data between for a given time range and apply some filters over the data ( JSON ) and store it in a DB. I am new to AWS and after little research found that I can do it via boto3 api, athena queries and aws CLI. But I need some advise on which one to go with.
amazon-web-services amazon-s3
Can someone please let me know what is the efficient way of pulling data from s3. Basically I want to pull out data between for a given time range and apply some filters over the data ( JSON ) and store it in a DB. I am new to AWS and after little research found that I can do it via boto3 api, athena queries and aws CLI. But I need some advise on which one to go with.
amazon-web-services amazon-s3
amazon-web-services amazon-s3
asked Nov 26 '18 at 17:27
chidorichidori
5763718
5763718
One file or lots of files? How big are the files (and how many rows)? How often will you be doing it?
– John Rotenstein
Nov 26 '18 at 18:08
@JohnRotenstein The folders are named date wise with each directory having around 10 files which are compressed. When extracted they would come around 300mb each ( ~2 lakh records ). Atleast for now I am thinking of pulling it once a day may be.
– chidori
Nov 27 '18 at 10:51
add a comment |
One file or lots of files? How big are the files (and how many rows)? How often will you be doing it?
– John Rotenstein
Nov 26 '18 at 18:08
@JohnRotenstein The folders are named date wise with each directory having around 10 files which are compressed. When extracted they would come around 300mb each ( ~2 lakh records ). Atleast for now I am thinking of pulling it once a day may be.
– chidori
Nov 27 '18 at 10:51
One file or lots of files? How big are the files (and how many rows)? How often will you be doing it?
– John Rotenstein
Nov 26 '18 at 18:08
One file or lots of files? How big are the files (and how many rows)? How often will you be doing it?
– John Rotenstein
Nov 26 '18 at 18:08
@JohnRotenstein The folders are named date wise with each directory having around 10 files which are compressed. When extracted they would come around 300mb each ( ~2 lakh records ). Atleast for now I am thinking of pulling it once a day may be.
– chidori
Nov 27 '18 at 10:51
@JohnRotenstein The folders are named date wise with each directory having around 10 files which are compressed. When extracted they would come around 300mb each ( ~2 lakh records ). Atleast for now I am thinking of pulling it once a day may be.
– chidori
Nov 27 '18 at 10:51
add a comment |
2 Answers
2
active
oldest
votes
If you are looking for the simplest and most straight-forward solution, I would recommend the aws cli. It's perfect for running commands to download a file, list a bucket, etc. from the command line or a shell script.
If you are looking for a solution that is a little more robust and integrates with your application, then any of the various AWS SDKs will do fine. The SDKs are a little more feature rich IMO and much cleaner than running shell commands in your application.
If your application that is pulling the data is written in python, then I definitely recommend boto3. Make sure to read the difference between a boto3 client vs resource.
add a comment |
Some options:
Download and process: Launch a temporary EC2 instance, have a script download the files of interest (eg one day's files?), use a Python program to process the data. This gives you full control over what is happening.
Amazon S3 Select: This is a simple way to extract data from CSV files, but it only operates on a single file at a time.
Amazon Athena: Provides an SQL interface to query across multiple files using Presto. Serverless, fast. Charged based on the amount of data read from disk (so it is cheaper on compressed data).
Amazon EMR: Hadoop service that provides very efficient processing of large quantities of data. Highly configurable, but quite complex for new users.
Based on your description (10 files, 300MB, 200k records) I would recommend starting with Amazon Athena since it provides a friendly SQL interface across many data files. Start by running queries across one file (this makes it faster for testing) and once you have the desired results, run it across all the data files.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53486197%2fwhat-is-the-efficient-way-of-pulling-data-from-s3-among-boto3-athena-and-aws-co%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
If you are looking for the simplest and most straight-forward solution, I would recommend the aws cli. It's perfect for running commands to download a file, list a bucket, etc. from the command line or a shell script.
If you are looking for a solution that is a little more robust and integrates with your application, then any of the various AWS SDKs will do fine. The SDKs are a little more feature rich IMO and much cleaner than running shell commands in your application.
If your application that is pulling the data is written in python, then I definitely recommend boto3. Make sure to read the difference between a boto3 client vs resource.
add a comment |
If you are looking for the simplest and most straight-forward solution, I would recommend the aws cli. It's perfect for running commands to download a file, list a bucket, etc. from the command line or a shell script.
If you are looking for a solution that is a little more robust and integrates with your application, then any of the various AWS SDKs will do fine. The SDKs are a little more feature rich IMO and much cleaner than running shell commands in your application.
If your application that is pulling the data is written in python, then I definitely recommend boto3. Make sure to read the difference between a boto3 client vs resource.
add a comment |
If you are looking for the simplest and most straight-forward solution, I would recommend the aws cli. It's perfect for running commands to download a file, list a bucket, etc. from the command line or a shell script.
If you are looking for a solution that is a little more robust and integrates with your application, then any of the various AWS SDKs will do fine. The SDKs are a little more feature rich IMO and much cleaner than running shell commands in your application.
If your application that is pulling the data is written in python, then I definitely recommend boto3. Make sure to read the difference between a boto3 client vs resource.
If you are looking for the simplest and most straight-forward solution, I would recommend the aws cli. It's perfect for running commands to download a file, list a bucket, etc. from the command line or a shell script.
If you are looking for a solution that is a little more robust and integrates with your application, then any of the various AWS SDKs will do fine. The SDKs are a little more feature rich IMO and much cleaner than running shell commands in your application.
If your application that is pulling the data is written in python, then I definitely recommend boto3. Make sure to read the difference between a boto3 client vs resource.
answered Nov 26 '18 at 17:45
PrestonMPrestonM
1,35011228
1,35011228
add a comment |
add a comment |
Some options:
Download and process: Launch a temporary EC2 instance, have a script download the files of interest (eg one day's files?), use a Python program to process the data. This gives you full control over what is happening.
Amazon S3 Select: This is a simple way to extract data from CSV files, but it only operates on a single file at a time.
Amazon Athena: Provides an SQL interface to query across multiple files using Presto. Serverless, fast. Charged based on the amount of data read from disk (so it is cheaper on compressed data).
Amazon EMR: Hadoop service that provides very efficient processing of large quantities of data. Highly configurable, but quite complex for new users.
Based on your description (10 files, 300MB, 200k records) I would recommend starting with Amazon Athena since it provides a friendly SQL interface across many data files. Start by running queries across one file (this makes it faster for testing) and once you have the desired results, run it across all the data files.
add a comment |
Some options:
Download and process: Launch a temporary EC2 instance, have a script download the files of interest (eg one day's files?), use a Python program to process the data. This gives you full control over what is happening.
Amazon S3 Select: This is a simple way to extract data from CSV files, but it only operates on a single file at a time.
Amazon Athena: Provides an SQL interface to query across multiple files using Presto. Serverless, fast. Charged based on the amount of data read from disk (so it is cheaper on compressed data).
Amazon EMR: Hadoop service that provides very efficient processing of large quantities of data. Highly configurable, but quite complex for new users.
Based on your description (10 files, 300MB, 200k records) I would recommend starting with Amazon Athena since it provides a friendly SQL interface across many data files. Start by running queries across one file (this makes it faster for testing) and once you have the desired results, run it across all the data files.
add a comment |
Some options:
Download and process: Launch a temporary EC2 instance, have a script download the files of interest (eg one day's files?), use a Python program to process the data. This gives you full control over what is happening.
Amazon S3 Select: This is a simple way to extract data from CSV files, but it only operates on a single file at a time.
Amazon Athena: Provides an SQL interface to query across multiple files using Presto. Serverless, fast. Charged based on the amount of data read from disk (so it is cheaper on compressed data).
Amazon EMR: Hadoop service that provides very efficient processing of large quantities of data. Highly configurable, but quite complex for new users.
Based on your description (10 files, 300MB, 200k records) I would recommend starting with Amazon Athena since it provides a friendly SQL interface across many data files. Start by running queries across one file (this makes it faster for testing) and once you have the desired results, run it across all the data files.
Some options:
Download and process: Launch a temporary EC2 instance, have a script download the files of interest (eg one day's files?), use a Python program to process the data. This gives you full control over what is happening.
Amazon S3 Select: This is a simple way to extract data from CSV files, but it only operates on a single file at a time.
Amazon Athena: Provides an SQL interface to query across multiple files using Presto. Serverless, fast. Charged based on the amount of data read from disk (so it is cheaper on compressed data).
Amazon EMR: Hadoop service that provides very efficient processing of large quantities of data. Highly configurable, but quite complex for new users.
Based on your description (10 files, 300MB, 200k records) I would recommend starting with Amazon Athena since it provides a friendly SQL interface across many data files. Start by running queries across one file (this makes it faster for testing) and once you have the desired results, run it across all the data files.
answered Nov 27 '18 at 16:23
John RotensteinJohn Rotenstein
72.4k782127
72.4k782127
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53486197%2fwhat-is-the-efficient-way-of-pulling-data-from-s3-among-boto3-athena-and-aws-co%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
One file or lots of files? How big are the files (and how many rows)? How often will you be doing it?
– John Rotenstein
Nov 26 '18 at 18:08
@JohnRotenstein The folders are named date wise with each directory having around 10 files which are compressed. When extracted they would come around 300mb each ( ~2 lakh records ). Atleast for now I am thinking of pulling it once a day may be.
– chidori
Nov 27 '18 at 10:51