Reading the data written to s3 by Amazon Kinesis Firehose stream












9















I am writing record to Kinesis Firehose stream that is eventually written to a S3 file by Amazon Kinesis Firehose.



My record object looks like



ItemPurchase {
String personId,
String itemId
}


The data is written to S3 looks like:



{"personId":"p-111","itemId":"i-111"}{"personId":"p-222","itemId":"i-222"}{"personId":"p-333","itemId":"i-333"}


NO COMMA SEPERATION.



NO STARTING BRACKET as in a Json Array



[


NO ENDING BRACKET as in a Json Array



]


I want to read this data get a list of ItemPurchase objects.



List<ItemPurchase> purchases = getPurchasesFromS3(IOUtils.toString(s3ObjectContent))


What is the correct way to read this data?










share|improve this question





























    9















    I am writing record to Kinesis Firehose stream that is eventually written to a S3 file by Amazon Kinesis Firehose.



    My record object looks like



    ItemPurchase {
    String personId,
    String itemId
    }


    The data is written to S3 looks like:



    {"personId":"p-111","itemId":"i-111"}{"personId":"p-222","itemId":"i-222"}{"personId":"p-333","itemId":"i-333"}


    NO COMMA SEPERATION.



    NO STARTING BRACKET as in a Json Array



    [


    NO ENDING BRACKET as in a Json Array



    ]


    I want to read this data get a list of ItemPurchase objects.



    List<ItemPurchase> purchases = getPurchasesFromS3(IOUtils.toString(s3ObjectContent))


    What is the correct way to read this data?










    share|improve this question



























      9












      9








      9


      0






      I am writing record to Kinesis Firehose stream that is eventually written to a S3 file by Amazon Kinesis Firehose.



      My record object looks like



      ItemPurchase {
      String personId,
      String itemId
      }


      The data is written to S3 looks like:



      {"personId":"p-111","itemId":"i-111"}{"personId":"p-222","itemId":"i-222"}{"personId":"p-333","itemId":"i-333"}


      NO COMMA SEPERATION.



      NO STARTING BRACKET as in a Json Array



      [


      NO ENDING BRACKET as in a Json Array



      ]


      I want to read this data get a list of ItemPurchase objects.



      List<ItemPurchase> purchases = getPurchasesFromS3(IOUtils.toString(s3ObjectContent))


      What is the correct way to read this data?










      share|improve this question
















      I am writing record to Kinesis Firehose stream that is eventually written to a S3 file by Amazon Kinesis Firehose.



      My record object looks like



      ItemPurchase {
      String personId,
      String itemId
      }


      The data is written to S3 looks like:



      {"personId":"p-111","itemId":"i-111"}{"personId":"p-222","itemId":"i-222"}{"personId":"p-333","itemId":"i-333"}


      NO COMMA SEPERATION.



      NO STARTING BRACKET as in a Json Array



      [


      NO ENDING BRACKET as in a Json Array



      ]


      I want to read this data get a list of ItemPurchase objects.



      List<ItemPurchase> purchases = getPurchasesFromS3(IOUtils.toString(s3ObjectContent))


      What is the correct way to read this data?







      json amazon-s3 amazon-kinesis amazon-kinesis-firehose






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited May 21 '18 at 5:42









      John Rotenstein

      72.4k782127




      72.4k782127










      asked Dec 26 '15 at 3:48









      learner_21learner_21

      19329




      19329
























          7 Answers
          7






          active

          oldest

          votes


















          5














          I also had the same problem, here is how I solved.




          1. replace "}{" with "}n{"


          2. line split by "n".



            input_json_rdd.map(lambda x : re.sub("}{", "}n{", x, flags=re.UNICODE))
            .flatMap(lambda line: line.split("n"))



          A nested json object has several "}"s, so split line by "}" doesn't solve the problem.






          share|improve this answer





















          • 1





            I considered doing something like this, but I think that if one of the strings inside the JSON object happens to include a }{ then this technique will break. Maybe if you go through each character, toggle a boolean if you hit a " (to indicate either entering or leaving a string), count the levels of objects you are in (increment on seeing { outside a string, decrement on seeing } outside a string), then consider the end of the object as when your level counter hits 0 again.

            – Krenair
            Mar 9 '18 at 16:14





















          3














          I've had the same issue.



          It would have been better if AWS allowed us to set a delimiter but we can do it on our own.



          In my use case, I've been listening on a stream of tweets, and once receiving a new tweet I immediately put it to Firehose.



          This, of course, resulted in a 1-line file which could not be parsed.



          So, to solve this, I have concatenated the tweet's JSON with a n.
          This, in turn, let me use some packages that can output lines when reading stream contents, and parse the file easily.



          Hope this helps you.






          share|improve this answer































            2














            It boggles my mind that Amazon Firehose dumps JSON messages to S3 in this manner, and doesn't allow you to set a delimiter or anything.



            Ultimately, the trick I found to deal with the problem was to process the text file using the JSON raw_decode method



            This will allow you to read a bunch of concatenated JSON records without any delimiters between them.



            Python code:



            import json

            decoder = json.JSONDecoder()

            with open('giant_kinesis_s3_text_file_with_concatenated_json_blobs.txt', 'r') as content_file:

            content = content_file.read()

            content_length = len(content)
            decode_index = 0

            while decode_index < content_length:
            try:
            obj, decode_index = decoder.raw_decode(content, decode_index)
            print("File index:", decode_index)
            print(obj)
            except JSONDecodeError as e:
            print("JSONDecodeError:", e)
            # Scan forward and keep trying to decode
            decode_index += 1





            share|improve this answer

































              2














              If the input source for the firehose is an Analytics application, this concatenated JSON without a delimiter is a known issue as cited here. You should have a lambda function as here that outputs JSON objects in multiple lines.






              share|improve this answer

































                0














                If there's a way to change the way data is written, please separate all the records by a line. That way you can read the data simply, line by line. If not, then simply build a scanner object which takes "}" as a delimiter and use the scanner to read. That would do the job.






                share|improve this answer































                  0














                  I think the best ways to tackle this is to first create a properly formatted json file containing well separated json objects within them. In my case I added ',' to the events which was pushed into the firehose. Then After a file is saved in s3, all the files will contain json object separated by some delimitter(comma- in our case). Another thing that must be added are '[' and ']' at the beginning and end of the file. Then you have a proper json file containing multiple json objects. Parsing them will be possible now.






                  share|improve this answer































                    0














                    You can find the each valid JSON by counting the brackets. Assuming the file starts with a { this python snippet should work:



                    import json

                    def read_block(stream):
                    open_brackets = 0
                    block = ''
                    while True:
                    c = stream.read(1)
                    if not c:
                    break

                    if c == '{':
                    open_brackets += 1
                    elif c == '}':
                    open_brackets -= 1

                    block += c

                    if open_brackets == 0:
                    yield block
                    block = ''


                    if __name__ == "__main__":
                    c = 0
                    with open('firehose_json_blob', 'r') as f:
                    for block in read_block(f):
                    record = json.loads(block)
                    print(record)





                    share|improve this answer























                      Your Answer






                      StackExchange.ifUsing("editor", function () {
                      StackExchange.using("externalEditor", function () {
                      StackExchange.using("snippets", function () {
                      StackExchange.snippets.init();
                      });
                      });
                      }, "code-snippets");

                      StackExchange.ready(function() {
                      var channelOptions = {
                      tags: "".split(" "),
                      id: "1"
                      };
                      initTagRenderer("".split(" "), "".split(" "), channelOptions);

                      StackExchange.using("externalEditor", function() {
                      // Have to fire editor after snippets, if snippets enabled
                      if (StackExchange.settings.snippets.snippetsEnabled) {
                      StackExchange.using("snippets", function() {
                      createEditor();
                      });
                      }
                      else {
                      createEditor();
                      }
                      });

                      function createEditor() {
                      StackExchange.prepareEditor({
                      heartbeatType: 'answer',
                      autoActivateHeartbeat: false,
                      convertImagesToLinks: true,
                      noModals: true,
                      showLowRepImageUploadWarning: true,
                      reputationToPostImages: 10,
                      bindNavPrevention: true,
                      postfix: "",
                      imageUploader: {
                      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                      allowUrls: true
                      },
                      onDemand: true,
                      discardSelector: ".discard-answer"
                      ,immediatelyShowMarkdownHelp:true
                      });


                      }
                      });














                      draft saved

                      draft discarded


















                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f34468319%2freading-the-data-written-to-s3-by-amazon-kinesis-firehose-stream%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown

























                      7 Answers
                      7






                      active

                      oldest

                      votes








                      7 Answers
                      7






                      active

                      oldest

                      votes









                      active

                      oldest

                      votes






                      active

                      oldest

                      votes









                      5














                      I also had the same problem, here is how I solved.




                      1. replace "}{" with "}n{"


                      2. line split by "n".



                        input_json_rdd.map(lambda x : re.sub("}{", "}n{", x, flags=re.UNICODE))
                        .flatMap(lambda line: line.split("n"))



                      A nested json object has several "}"s, so split line by "}" doesn't solve the problem.






                      share|improve this answer





















                      • 1





                        I considered doing something like this, but I think that if one of the strings inside the JSON object happens to include a }{ then this technique will break. Maybe if you go through each character, toggle a boolean if you hit a " (to indicate either entering or leaving a string), count the levels of objects you are in (increment on seeing { outside a string, decrement on seeing } outside a string), then consider the end of the object as when your level counter hits 0 again.

                        – Krenair
                        Mar 9 '18 at 16:14


















                      5














                      I also had the same problem, here is how I solved.




                      1. replace "}{" with "}n{"


                      2. line split by "n".



                        input_json_rdd.map(lambda x : re.sub("}{", "}n{", x, flags=re.UNICODE))
                        .flatMap(lambda line: line.split("n"))



                      A nested json object has several "}"s, so split line by "}" doesn't solve the problem.






                      share|improve this answer





















                      • 1





                        I considered doing something like this, but I think that if one of the strings inside the JSON object happens to include a }{ then this technique will break. Maybe if you go through each character, toggle a boolean if you hit a " (to indicate either entering or leaving a string), count the levels of objects you are in (increment on seeing { outside a string, decrement on seeing } outside a string), then consider the end of the object as when your level counter hits 0 again.

                        – Krenair
                        Mar 9 '18 at 16:14
















                      5












                      5








                      5







                      I also had the same problem, here is how I solved.




                      1. replace "}{" with "}n{"


                      2. line split by "n".



                        input_json_rdd.map(lambda x : re.sub("}{", "}n{", x, flags=re.UNICODE))
                        .flatMap(lambda line: line.split("n"))



                      A nested json object has several "}"s, so split line by "}" doesn't solve the problem.






                      share|improve this answer















                      I also had the same problem, here is how I solved.




                      1. replace "}{" with "}n{"


                      2. line split by "n".



                        input_json_rdd.map(lambda x : re.sub("}{", "}n{", x, flags=re.UNICODE))
                        .flatMap(lambda line: line.split("n"))



                      A nested json object has several "}"s, so split line by "}" doesn't solve the problem.







                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited Feb 15 '17 at 22:09

























                      answered Feb 15 '17 at 22:03









                      Xuehua JiangXuehua Jiang

                      5816




                      5816








                      • 1





                        I considered doing something like this, but I think that if one of the strings inside the JSON object happens to include a }{ then this technique will break. Maybe if you go through each character, toggle a boolean if you hit a " (to indicate either entering or leaving a string), count the levels of objects you are in (increment on seeing { outside a string, decrement on seeing } outside a string), then consider the end of the object as when your level counter hits 0 again.

                        – Krenair
                        Mar 9 '18 at 16:14
















                      • 1





                        I considered doing something like this, but I think that if one of the strings inside the JSON object happens to include a }{ then this technique will break. Maybe if you go through each character, toggle a boolean if you hit a " (to indicate either entering or leaving a string), count the levels of objects you are in (increment on seeing { outside a string, decrement on seeing } outside a string), then consider the end of the object as when your level counter hits 0 again.

                        – Krenair
                        Mar 9 '18 at 16:14










                      1




                      1





                      I considered doing something like this, but I think that if one of the strings inside the JSON object happens to include a }{ then this technique will break. Maybe if you go through each character, toggle a boolean if you hit a " (to indicate either entering or leaving a string), count the levels of objects you are in (increment on seeing { outside a string, decrement on seeing } outside a string), then consider the end of the object as when your level counter hits 0 again.

                      – Krenair
                      Mar 9 '18 at 16:14







                      I considered doing something like this, but I think that if one of the strings inside the JSON object happens to include a }{ then this technique will break. Maybe if you go through each character, toggle a boolean if you hit a " (to indicate either entering or leaving a string), count the levels of objects you are in (increment on seeing { outside a string, decrement on seeing } outside a string), then consider the end of the object as when your level counter hits 0 again.

                      – Krenair
                      Mar 9 '18 at 16:14















                      3














                      I've had the same issue.



                      It would have been better if AWS allowed us to set a delimiter but we can do it on our own.



                      In my use case, I've been listening on a stream of tweets, and once receiving a new tweet I immediately put it to Firehose.



                      This, of course, resulted in a 1-line file which could not be parsed.



                      So, to solve this, I have concatenated the tweet's JSON with a n.
                      This, in turn, let me use some packages that can output lines when reading stream contents, and parse the file easily.



                      Hope this helps you.






                      share|improve this answer




























                        3














                        I've had the same issue.



                        It would have been better if AWS allowed us to set a delimiter but we can do it on our own.



                        In my use case, I've been listening on a stream of tweets, and once receiving a new tweet I immediately put it to Firehose.



                        This, of course, resulted in a 1-line file which could not be parsed.



                        So, to solve this, I have concatenated the tweet's JSON with a n.
                        This, in turn, let me use some packages that can output lines when reading stream contents, and parse the file easily.



                        Hope this helps you.






                        share|improve this answer


























                          3












                          3








                          3







                          I've had the same issue.



                          It would have been better if AWS allowed us to set a delimiter but we can do it on our own.



                          In my use case, I've been listening on a stream of tweets, and once receiving a new tweet I immediately put it to Firehose.



                          This, of course, resulted in a 1-line file which could not be parsed.



                          So, to solve this, I have concatenated the tweet's JSON with a n.
                          This, in turn, let me use some packages that can output lines when reading stream contents, and parse the file easily.



                          Hope this helps you.






                          share|improve this answer













                          I've had the same issue.



                          It would have been better if AWS allowed us to set a delimiter but we can do it on our own.



                          In my use case, I've been listening on a stream of tweets, and once receiving a new tweet I immediately put it to Firehose.



                          This, of course, resulted in a 1-line file which could not be parsed.



                          So, to solve this, I have concatenated the tweet's JSON with a n.
                          This, in turn, let me use some packages that can output lines when reading stream contents, and parse the file easily.



                          Hope this helps you.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Jul 15 '16 at 22:35









                          johnijohni

                          2,77342547




                          2,77342547























                              2














                              It boggles my mind that Amazon Firehose dumps JSON messages to S3 in this manner, and doesn't allow you to set a delimiter or anything.



                              Ultimately, the trick I found to deal with the problem was to process the text file using the JSON raw_decode method



                              This will allow you to read a bunch of concatenated JSON records without any delimiters between them.



                              Python code:



                              import json

                              decoder = json.JSONDecoder()

                              with open('giant_kinesis_s3_text_file_with_concatenated_json_blobs.txt', 'r') as content_file:

                              content = content_file.read()

                              content_length = len(content)
                              decode_index = 0

                              while decode_index < content_length:
                              try:
                              obj, decode_index = decoder.raw_decode(content, decode_index)
                              print("File index:", decode_index)
                              print(obj)
                              except JSONDecodeError as e:
                              print("JSONDecodeError:", e)
                              # Scan forward and keep trying to decode
                              decode_index += 1





                              share|improve this answer






























                                2














                                It boggles my mind that Amazon Firehose dumps JSON messages to S3 in this manner, and doesn't allow you to set a delimiter or anything.



                                Ultimately, the trick I found to deal with the problem was to process the text file using the JSON raw_decode method



                                This will allow you to read a bunch of concatenated JSON records without any delimiters between them.



                                Python code:



                                import json

                                decoder = json.JSONDecoder()

                                with open('giant_kinesis_s3_text_file_with_concatenated_json_blobs.txt', 'r') as content_file:

                                content = content_file.read()

                                content_length = len(content)
                                decode_index = 0

                                while decode_index < content_length:
                                try:
                                obj, decode_index = decoder.raw_decode(content, decode_index)
                                print("File index:", decode_index)
                                print(obj)
                                except JSONDecodeError as e:
                                print("JSONDecodeError:", e)
                                # Scan forward and keep trying to decode
                                decode_index += 1





                                share|improve this answer




























                                  2












                                  2








                                  2







                                  It boggles my mind that Amazon Firehose dumps JSON messages to S3 in this manner, and doesn't allow you to set a delimiter or anything.



                                  Ultimately, the trick I found to deal with the problem was to process the text file using the JSON raw_decode method



                                  This will allow you to read a bunch of concatenated JSON records without any delimiters between them.



                                  Python code:



                                  import json

                                  decoder = json.JSONDecoder()

                                  with open('giant_kinesis_s3_text_file_with_concatenated_json_blobs.txt', 'r') as content_file:

                                  content = content_file.read()

                                  content_length = len(content)
                                  decode_index = 0

                                  while decode_index < content_length:
                                  try:
                                  obj, decode_index = decoder.raw_decode(content, decode_index)
                                  print("File index:", decode_index)
                                  print(obj)
                                  except JSONDecodeError as e:
                                  print("JSONDecodeError:", e)
                                  # Scan forward and keep trying to decode
                                  decode_index += 1





                                  share|improve this answer















                                  It boggles my mind that Amazon Firehose dumps JSON messages to S3 in this manner, and doesn't allow you to set a delimiter or anything.



                                  Ultimately, the trick I found to deal with the problem was to process the text file using the JSON raw_decode method



                                  This will allow you to read a bunch of concatenated JSON records without any delimiters between them.



                                  Python code:



                                  import json

                                  decoder = json.JSONDecoder()

                                  with open('giant_kinesis_s3_text_file_with_concatenated_json_blobs.txt', 'r') as content_file:

                                  content = content_file.read()

                                  content_length = len(content)
                                  decode_index = 0

                                  while decode_index < content_length:
                                  try:
                                  obj, decode_index = decoder.raw_decode(content, decode_index)
                                  print("File index:", decode_index)
                                  print(obj)
                                  except JSONDecodeError as e:
                                  print("JSONDecodeError:", e)
                                  # Scan forward and keep trying to decode
                                  decode_index += 1






                                  share|improve this answer














                                  share|improve this answer



                                  share|improve this answer








                                  edited Mar 21 '18 at 22:54

























                                  answered Mar 21 '18 at 22:39









                                  Tom ChapinTom Chapin

                                  1,409119




                                  1,409119























                                      2














                                      If the input source for the firehose is an Analytics application, this concatenated JSON without a delimiter is a known issue as cited here. You should have a lambda function as here that outputs JSON objects in multiple lines.






                                      share|improve this answer






























                                        2














                                        If the input source for the firehose is an Analytics application, this concatenated JSON without a delimiter is a known issue as cited here. You should have a lambda function as here that outputs JSON objects in multiple lines.






                                        share|improve this answer




























                                          2












                                          2








                                          2







                                          If the input source for the firehose is an Analytics application, this concatenated JSON without a delimiter is a known issue as cited here. You should have a lambda function as here that outputs JSON objects in multiple lines.






                                          share|improve this answer















                                          If the input source for the firehose is an Analytics application, this concatenated JSON without a delimiter is a known issue as cited here. You should have a lambda function as here that outputs JSON objects in multiple lines.







                                          share|improve this answer














                                          share|improve this answer



                                          share|improve this answer








                                          edited Nov 26 '18 at 20:10

























                                          answered Nov 26 '18 at 19:35









                                          user2661738user2661738

                                          213




                                          213























                                              0














                                              If there's a way to change the way data is written, please separate all the records by a line. That way you can read the data simply, line by line. If not, then simply build a scanner object which takes "}" as a delimiter and use the scanner to read. That would do the job.






                                              share|improve this answer




























                                                0














                                                If there's a way to change the way data is written, please separate all the records by a line. That way you can read the data simply, line by line. If not, then simply build a scanner object which takes "}" as a delimiter and use the scanner to read. That would do the job.






                                                share|improve this answer


























                                                  0












                                                  0








                                                  0







                                                  If there's a way to change the way data is written, please separate all the records by a line. That way you can read the data simply, line by line. If not, then simply build a scanner object which takes "}" as a delimiter and use the scanner to read. That would do the job.






                                                  share|improve this answer













                                                  If there's a way to change the way data is written, please separate all the records by a line. That way you can read the data simply, line by line. If not, then simply build a scanner object which takes "}" as a delimiter and use the scanner to read. That would do the job.







                                                  share|improve this answer












                                                  share|improve this answer



                                                  share|improve this answer










                                                  answered May 19 '16 at 8:41









                                                  psychoramapsychorama

                                                  7312




                                                  7312























                                                      0














                                                      I think the best ways to tackle this is to first create a properly formatted json file containing well separated json objects within them. In my case I added ',' to the events which was pushed into the firehose. Then After a file is saved in s3, all the files will contain json object separated by some delimitter(comma- in our case). Another thing that must be added are '[' and ']' at the beginning and end of the file. Then you have a proper json file containing multiple json objects. Parsing them will be possible now.






                                                      share|improve this answer




























                                                        0














                                                        I think the best ways to tackle this is to first create a properly formatted json file containing well separated json objects within them. In my case I added ',' to the events which was pushed into the firehose. Then After a file is saved in s3, all the files will contain json object separated by some delimitter(comma- in our case). Another thing that must be added are '[' and ']' at the beginning and end of the file. Then you have a proper json file containing multiple json objects. Parsing them will be possible now.






                                                        share|improve this answer


























                                                          0












                                                          0








                                                          0







                                                          I think the best ways to tackle this is to first create a properly formatted json file containing well separated json objects within them. In my case I added ',' to the events which was pushed into the firehose. Then After a file is saved in s3, all the files will contain json object separated by some delimitter(comma- in our case). Another thing that must be added are '[' and ']' at the beginning and end of the file. Then you have a proper json file containing multiple json objects. Parsing them will be possible now.






                                                          share|improve this answer













                                                          I think the best ways to tackle this is to first create a properly formatted json file containing well separated json objects within them. In my case I added ',' to the events which was pushed into the firehose. Then After a file is saved in s3, all the files will contain json object separated by some delimitter(comma- in our case). Another thing that must be added are '[' and ']' at the beginning and end of the file. Then you have a proper json file containing multiple json objects. Parsing them will be possible now.







                                                          share|improve this answer












                                                          share|improve this answer



                                                          share|improve this answer










                                                          answered Jan 29 '18 at 15:47









                                                          sanjiv upadhyayasanjiv upadhyaya

                                                          1




                                                          1























                                                              0














                                                              You can find the each valid JSON by counting the brackets. Assuming the file starts with a { this python snippet should work:



                                                              import json

                                                              def read_block(stream):
                                                              open_brackets = 0
                                                              block = ''
                                                              while True:
                                                              c = stream.read(1)
                                                              if not c:
                                                              break

                                                              if c == '{':
                                                              open_brackets += 1
                                                              elif c == '}':
                                                              open_brackets -= 1

                                                              block += c

                                                              if open_brackets == 0:
                                                              yield block
                                                              block = ''


                                                              if __name__ == "__main__":
                                                              c = 0
                                                              with open('firehose_json_blob', 'r') as f:
                                                              for block in read_block(f):
                                                              record = json.loads(block)
                                                              print(record)





                                                              share|improve this answer




























                                                                0














                                                                You can find the each valid JSON by counting the brackets. Assuming the file starts with a { this python snippet should work:



                                                                import json

                                                                def read_block(stream):
                                                                open_brackets = 0
                                                                block = ''
                                                                while True:
                                                                c = stream.read(1)
                                                                if not c:
                                                                break

                                                                if c == '{':
                                                                open_brackets += 1
                                                                elif c == '}':
                                                                open_brackets -= 1

                                                                block += c

                                                                if open_brackets == 0:
                                                                yield block
                                                                block = ''


                                                                if __name__ == "__main__":
                                                                c = 0
                                                                with open('firehose_json_blob', 'r') as f:
                                                                for block in read_block(f):
                                                                record = json.loads(block)
                                                                print(record)





                                                                share|improve this answer


























                                                                  0












                                                                  0








                                                                  0







                                                                  You can find the each valid JSON by counting the brackets. Assuming the file starts with a { this python snippet should work:



                                                                  import json

                                                                  def read_block(stream):
                                                                  open_brackets = 0
                                                                  block = ''
                                                                  while True:
                                                                  c = stream.read(1)
                                                                  if not c:
                                                                  break

                                                                  if c == '{':
                                                                  open_brackets += 1
                                                                  elif c == '}':
                                                                  open_brackets -= 1

                                                                  block += c

                                                                  if open_brackets == 0:
                                                                  yield block
                                                                  block = ''


                                                                  if __name__ == "__main__":
                                                                  c = 0
                                                                  with open('firehose_json_blob', 'r') as f:
                                                                  for block in read_block(f):
                                                                  record = json.loads(block)
                                                                  print(record)





                                                                  share|improve this answer













                                                                  You can find the each valid JSON by counting the brackets. Assuming the file starts with a { this python snippet should work:



                                                                  import json

                                                                  def read_block(stream):
                                                                  open_brackets = 0
                                                                  block = ''
                                                                  while True:
                                                                  c = stream.read(1)
                                                                  if not c:
                                                                  break

                                                                  if c == '{':
                                                                  open_brackets += 1
                                                                  elif c == '}':
                                                                  open_brackets -= 1

                                                                  block += c

                                                                  if open_brackets == 0:
                                                                  yield block
                                                                  block = ''


                                                                  if __name__ == "__main__":
                                                                  c = 0
                                                                  with open('firehose_json_blob', 'r') as f:
                                                                  for block in read_block(f):
                                                                  record = json.loads(block)
                                                                  print(record)






                                                                  share|improve this answer












                                                                  share|improve this answer



                                                                  share|improve this answer










                                                                  answered Aug 15 '18 at 14:45









                                                                  Rafael BarbosaRafael Barbosa

                                                                  593514




                                                                  593514






























                                                                      draft saved

                                                                      draft discarded




















































                                                                      Thanks for contributing an answer to Stack Overflow!


                                                                      • Please be sure to answer the question. Provide details and share your research!

                                                                      But avoid



                                                                      • Asking for help, clarification, or responding to other answers.

                                                                      • Making statements based on opinion; back them up with references or personal experience.


                                                                      To learn more, see our tips on writing great answers.




                                                                      draft saved


                                                                      draft discarded














                                                                      StackExchange.ready(
                                                                      function () {
                                                                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f34468319%2freading-the-data-written-to-s3-by-amazon-kinesis-firehose-stream%23new-answer', 'question_page');
                                                                      }
                                                                      );

                                                                      Post as a guest















                                                                      Required, but never shown





















































                                                                      Required, but never shown














                                                                      Required, but never shown












                                                                      Required, but never shown







                                                                      Required, but never shown

































                                                                      Required, but never shown














                                                                      Required, but never shown












                                                                      Required, but never shown







                                                                      Required, but never shown







                                                                      Popular posts from this blog

                                                                      A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

                                                                      Calculate evaluation metrics using cross_val_predict sklearn

                                                                      Insert data from modal to MySQL (multiple modal on website)