what is the efficient way of pulling data from s3 among boto3, athena and aws command line utils












1















Can someone please let me know what is the efficient way of pulling data from s3. Basically I want to pull out data between for a given time range and apply some filters over the data ( JSON ) and store it in a DB. I am new to AWS and after little research found that I can do it via boto3 api, athena queries and aws CLI. But I need some advise on which one to go with.










share|improve this question























  • One file or lots of files? How big are the files (and how many rows)? How often will you be doing it?

    – John Rotenstein
    Nov 26 '18 at 18:08











  • @JohnRotenstein The folders are named date wise with each directory having around 10 files which are compressed. When extracted they would come around 300mb each ( ~2 lakh records ). Atleast for now I am thinking of pulling it once a day may be.

    – chidori
    Nov 27 '18 at 10:51
















1















Can someone please let me know what is the efficient way of pulling data from s3. Basically I want to pull out data between for a given time range and apply some filters over the data ( JSON ) and store it in a DB. I am new to AWS and after little research found that I can do it via boto3 api, athena queries and aws CLI. But I need some advise on which one to go with.










share|improve this question























  • One file or lots of files? How big are the files (and how many rows)? How often will you be doing it?

    – John Rotenstein
    Nov 26 '18 at 18:08











  • @JohnRotenstein The folders are named date wise with each directory having around 10 files which are compressed. When extracted they would come around 300mb each ( ~2 lakh records ). Atleast for now I am thinking of pulling it once a day may be.

    – chidori
    Nov 27 '18 at 10:51














1












1








1








Can someone please let me know what is the efficient way of pulling data from s3. Basically I want to pull out data between for a given time range and apply some filters over the data ( JSON ) and store it in a DB. I am new to AWS and after little research found that I can do it via boto3 api, athena queries and aws CLI. But I need some advise on which one to go with.










share|improve this question














Can someone please let me know what is the efficient way of pulling data from s3. Basically I want to pull out data between for a given time range and apply some filters over the data ( JSON ) and store it in a DB. I am new to AWS and after little research found that I can do it via boto3 api, athena queries and aws CLI. But I need some advise on which one to go with.







amazon-web-services amazon-s3






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 26 '18 at 17:27









chidorichidori

5763718




5763718













  • One file or lots of files? How big are the files (and how many rows)? How often will you be doing it?

    – John Rotenstein
    Nov 26 '18 at 18:08











  • @JohnRotenstein The folders are named date wise with each directory having around 10 files which are compressed. When extracted they would come around 300mb each ( ~2 lakh records ). Atleast for now I am thinking of pulling it once a day may be.

    – chidori
    Nov 27 '18 at 10:51



















  • One file or lots of files? How big are the files (and how many rows)? How often will you be doing it?

    – John Rotenstein
    Nov 26 '18 at 18:08











  • @JohnRotenstein The folders are named date wise with each directory having around 10 files which are compressed. When extracted they would come around 300mb each ( ~2 lakh records ). Atleast for now I am thinking of pulling it once a day may be.

    – chidori
    Nov 27 '18 at 10:51

















One file or lots of files? How big are the files (and how many rows)? How often will you be doing it?

– John Rotenstein
Nov 26 '18 at 18:08





One file or lots of files? How big are the files (and how many rows)? How often will you be doing it?

– John Rotenstein
Nov 26 '18 at 18:08













@JohnRotenstein The folders are named date wise with each directory having around 10 files which are compressed. When extracted they would come around 300mb each ( ~2 lakh records ). Atleast for now I am thinking of pulling it once a day may be.

– chidori
Nov 27 '18 at 10:51





@JohnRotenstein The folders are named date wise with each directory having around 10 files which are compressed. When extracted they would come around 300mb each ( ~2 lakh records ). Atleast for now I am thinking of pulling it once a day may be.

– chidori
Nov 27 '18 at 10:51












2 Answers
2






active

oldest

votes


















1














If you are looking for the simplest and most straight-forward solution, I would recommend the aws cli. It's perfect for running commands to download a file, list a bucket, etc. from the command line or a shell script.



If you are looking for a solution that is a little more robust and integrates with your application, then any of the various AWS SDKs will do fine. The SDKs are a little more feature rich IMO and much cleaner than running shell commands in your application.



If your application that is pulling the data is written in python, then I definitely recommend boto3. Make sure to read the difference between a boto3 client vs resource.






share|improve this answer































    1














    Some options:





    • Download and process: Launch a temporary EC2 instance, have a script download the files of interest (eg one day's files?), use a Python program to process the data. This gives you full control over what is happening.


    • Amazon S3 Select: This is a simple way to extract data from CSV files, but it only operates on a single file at a time.


    • Amazon Athena: Provides an SQL interface to query across multiple files using Presto. Serverless, fast. Charged based on the amount of data read from disk (so it is cheaper on compressed data).


    • Amazon EMR: Hadoop service that provides very efficient processing of large quantities of data. Highly configurable, but quite complex for new users.


    Based on your description (10 files, 300MB, 200k records) I would recommend starting with Amazon Athena since it provides a friendly SQL interface across many data files. Start by running queries across one file (this makes it faster for testing) and once you have the desired results, run it across all the data files.






    share|improve this answer























      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53486197%2fwhat-is-the-efficient-way-of-pulling-data-from-s3-among-boto3-athena-and-aws-co%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      1














      If you are looking for the simplest and most straight-forward solution, I would recommend the aws cli. It's perfect for running commands to download a file, list a bucket, etc. from the command line or a shell script.



      If you are looking for a solution that is a little more robust and integrates with your application, then any of the various AWS SDKs will do fine. The SDKs are a little more feature rich IMO and much cleaner than running shell commands in your application.



      If your application that is pulling the data is written in python, then I definitely recommend boto3. Make sure to read the difference between a boto3 client vs resource.






      share|improve this answer




























        1














        If you are looking for the simplest and most straight-forward solution, I would recommend the aws cli. It's perfect for running commands to download a file, list a bucket, etc. from the command line or a shell script.



        If you are looking for a solution that is a little more robust and integrates with your application, then any of the various AWS SDKs will do fine. The SDKs are a little more feature rich IMO and much cleaner than running shell commands in your application.



        If your application that is pulling the data is written in python, then I definitely recommend boto3. Make sure to read the difference between a boto3 client vs resource.






        share|improve this answer


























          1












          1








          1







          If you are looking for the simplest and most straight-forward solution, I would recommend the aws cli. It's perfect for running commands to download a file, list a bucket, etc. from the command line or a shell script.



          If you are looking for a solution that is a little more robust and integrates with your application, then any of the various AWS SDKs will do fine. The SDKs are a little more feature rich IMO and much cleaner than running shell commands in your application.



          If your application that is pulling the data is written in python, then I definitely recommend boto3. Make sure to read the difference between a boto3 client vs resource.






          share|improve this answer













          If you are looking for the simplest and most straight-forward solution, I would recommend the aws cli. It's perfect for running commands to download a file, list a bucket, etc. from the command line or a shell script.



          If you are looking for a solution that is a little more robust and integrates with your application, then any of the various AWS SDKs will do fine. The SDKs are a little more feature rich IMO and much cleaner than running shell commands in your application.



          If your application that is pulling the data is written in python, then I definitely recommend boto3. Make sure to read the difference between a boto3 client vs resource.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 26 '18 at 17:45









          PrestonMPrestonM

          1,35011228




          1,35011228

























              1














              Some options:





              • Download and process: Launch a temporary EC2 instance, have a script download the files of interest (eg one day's files?), use a Python program to process the data. This gives you full control over what is happening.


              • Amazon S3 Select: This is a simple way to extract data from CSV files, but it only operates on a single file at a time.


              • Amazon Athena: Provides an SQL interface to query across multiple files using Presto. Serverless, fast. Charged based on the amount of data read from disk (so it is cheaper on compressed data).


              • Amazon EMR: Hadoop service that provides very efficient processing of large quantities of data. Highly configurable, but quite complex for new users.


              Based on your description (10 files, 300MB, 200k records) I would recommend starting with Amazon Athena since it provides a friendly SQL interface across many data files. Start by running queries across one file (this makes it faster for testing) and once you have the desired results, run it across all the data files.






              share|improve this answer




























                1














                Some options:





                • Download and process: Launch a temporary EC2 instance, have a script download the files of interest (eg one day's files?), use a Python program to process the data. This gives you full control over what is happening.


                • Amazon S3 Select: This is a simple way to extract data from CSV files, but it only operates on a single file at a time.


                • Amazon Athena: Provides an SQL interface to query across multiple files using Presto. Serverless, fast. Charged based on the amount of data read from disk (so it is cheaper on compressed data).


                • Amazon EMR: Hadoop service that provides very efficient processing of large quantities of data. Highly configurable, but quite complex for new users.


                Based on your description (10 files, 300MB, 200k records) I would recommend starting with Amazon Athena since it provides a friendly SQL interface across many data files. Start by running queries across one file (this makes it faster for testing) and once you have the desired results, run it across all the data files.






                share|improve this answer


























                  1












                  1








                  1







                  Some options:





                  • Download and process: Launch a temporary EC2 instance, have a script download the files of interest (eg one day's files?), use a Python program to process the data. This gives you full control over what is happening.


                  • Amazon S3 Select: This is a simple way to extract data from CSV files, but it only operates on a single file at a time.


                  • Amazon Athena: Provides an SQL interface to query across multiple files using Presto. Serverless, fast. Charged based on the amount of data read from disk (so it is cheaper on compressed data).


                  • Amazon EMR: Hadoop service that provides very efficient processing of large quantities of data. Highly configurable, but quite complex for new users.


                  Based on your description (10 files, 300MB, 200k records) I would recommend starting with Amazon Athena since it provides a friendly SQL interface across many data files. Start by running queries across one file (this makes it faster for testing) and once you have the desired results, run it across all the data files.






                  share|improve this answer













                  Some options:





                  • Download and process: Launch a temporary EC2 instance, have a script download the files of interest (eg one day's files?), use a Python program to process the data. This gives you full control over what is happening.


                  • Amazon S3 Select: This is a simple way to extract data from CSV files, but it only operates on a single file at a time.


                  • Amazon Athena: Provides an SQL interface to query across multiple files using Presto. Serverless, fast. Charged based on the amount of data read from disk (so it is cheaper on compressed data).


                  • Amazon EMR: Hadoop service that provides very efficient processing of large quantities of data. Highly configurable, but quite complex for new users.


                  Based on your description (10 files, 300MB, 200k records) I would recommend starting with Amazon Athena since it provides a friendly SQL interface across many data files. Start by running queries across one file (this makes it faster for testing) and once you have the desired results, run it across all the data files.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 27 '18 at 16:23









                  John RotensteinJohn Rotenstein

                  72.4k782127




                  72.4k782127






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53486197%2fwhat-is-the-efficient-way-of-pulling-data-from-s3-among-boto3-athena-and-aws-co%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

                      Calculate evaluation metrics using cross_val_predict sklearn

                      Insert data from modal to MySQL (multiple modal on website)