How to read a large file block by block and judge by block header?












0















I have a large file which I want to read block by block by matching the headers.
For example, the file is like this:



@header1
a b c 1 2 3
c d e 2 3 4
q w e 3 4 5


@header2
e 89 78 56
s 68 77 26
...


I wrote a script like this:



with open("filename") as f:
line=f.readline()
if line.split()[0]=="@header1":
list1.append(f.readline().split()[0])
list2.append(f.readline().split()[1])
...
elif line.split()[0]=="@header2":
list6.append(f.readline().split()[0])
list7.append(f.readline().split()[1])
...


But it seems to only read the first header and did not read-in the second block. Also, there are some empty lines in between those blocks. How to read the block when the line matches certain strings and skip those empty lines.



I know in C, it would be switch. How to do the similar thing in python?










share|improve this question




















  • 1





    You need to add more details. Are these multiple space-seprated file-segments inside one file? Are the @header... guaranteed to be numbered sequentially and contiguously? If the @header1 occurs all on its own, why do you test line.split()[0]=="@header2" rather than simply line == "@header2"? or just line.startswith('@header') , which should capture them all, and doesn't even need a regex ?

    – smci
    Nov 28 '18 at 0:30













  • Ultimately I expect you want to read the space-separated rows contents (within each section, according to its header), so you'll want to wrap a reader object. Or write a generator to yield each chunk of rows separately, so you can then pass it into a reader object.

    – smci
    Nov 28 '18 at 1:17











  • "Also, there are some empty lines in between those blocks." So, you're guaranteed that empty lines can only occur outside section, not inside them?

    – smci
    Nov 28 '18 at 1:18
















0















I have a large file which I want to read block by block by matching the headers.
For example, the file is like this:



@header1
a b c 1 2 3
c d e 2 3 4
q w e 3 4 5


@header2
e 89 78 56
s 68 77 26
...


I wrote a script like this:



with open("filename") as f:
line=f.readline()
if line.split()[0]=="@header1":
list1.append(f.readline().split()[0])
list2.append(f.readline().split()[1])
...
elif line.split()[0]=="@header2":
list6.append(f.readline().split()[0])
list7.append(f.readline().split()[1])
...


But it seems to only read the first header and did not read-in the second block. Also, there are some empty lines in between those blocks. How to read the block when the line matches certain strings and skip those empty lines.



I know in C, it would be switch. How to do the similar thing in python?










share|improve this question




















  • 1





    You need to add more details. Are these multiple space-seprated file-segments inside one file? Are the @header... guaranteed to be numbered sequentially and contiguously? If the @header1 occurs all on its own, why do you test line.split()[0]=="@header2" rather than simply line == "@header2"? or just line.startswith('@header') , which should capture them all, and doesn't even need a regex ?

    – smci
    Nov 28 '18 at 0:30













  • Ultimately I expect you want to read the space-separated rows contents (within each section, according to its header), so you'll want to wrap a reader object. Or write a generator to yield each chunk of rows separately, so you can then pass it into a reader object.

    – smci
    Nov 28 '18 at 1:17











  • "Also, there are some empty lines in between those blocks." So, you're guaranteed that empty lines can only occur outside section, not inside them?

    – smci
    Nov 28 '18 at 1:18














0












0








0








I have a large file which I want to read block by block by matching the headers.
For example, the file is like this:



@header1
a b c 1 2 3
c d e 2 3 4
q w e 3 4 5


@header2
e 89 78 56
s 68 77 26
...


I wrote a script like this:



with open("filename") as f:
line=f.readline()
if line.split()[0]=="@header1":
list1.append(f.readline().split()[0])
list2.append(f.readline().split()[1])
...
elif line.split()[0]=="@header2":
list6.append(f.readline().split()[0])
list7.append(f.readline().split()[1])
...


But it seems to only read the first header and did not read-in the second block. Also, there are some empty lines in between those blocks. How to read the block when the line matches certain strings and skip those empty lines.



I know in C, it would be switch. How to do the similar thing in python?










share|improve this question
















I have a large file which I want to read block by block by matching the headers.
For example, the file is like this:



@header1
a b c 1 2 3
c d e 2 3 4
q w e 3 4 5


@header2
e 89 78 56
s 68 77 26
...


I wrote a script like this:



with open("filename") as f:
line=f.readline()
if line.split()[0]=="@header1":
list1.append(f.readline().split()[0])
list2.append(f.readline().split()[1])
...
elif line.split()[0]=="@header2":
list6.append(f.readline().split()[0])
list7.append(f.readline().split()[1])
...


But it seems to only read the first header and did not read-in the second block. Also, there are some empty lines in between those blocks. How to read the block when the line matches certain strings and skip those empty lines.



I know in C, it would be switch. How to do the similar thing in python?







python






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 28 '18 at 0:50









martineau

68.9k1091186




68.9k1091186










asked Nov 28 '18 at 0:20









fish_bufish_bu

82




82








  • 1





    You need to add more details. Are these multiple space-seprated file-segments inside one file? Are the @header... guaranteed to be numbered sequentially and contiguously? If the @header1 occurs all on its own, why do you test line.split()[0]=="@header2" rather than simply line == "@header2"? or just line.startswith('@header') , which should capture them all, and doesn't even need a regex ?

    – smci
    Nov 28 '18 at 0:30













  • Ultimately I expect you want to read the space-separated rows contents (within each section, according to its header), so you'll want to wrap a reader object. Or write a generator to yield each chunk of rows separately, so you can then pass it into a reader object.

    – smci
    Nov 28 '18 at 1:17











  • "Also, there are some empty lines in between those blocks." So, you're guaranteed that empty lines can only occur outside section, not inside them?

    – smci
    Nov 28 '18 at 1:18














  • 1





    You need to add more details. Are these multiple space-seprated file-segments inside one file? Are the @header... guaranteed to be numbered sequentially and contiguously? If the @header1 occurs all on its own, why do you test line.split()[0]=="@header2" rather than simply line == "@header2"? or just line.startswith('@header') , which should capture them all, and doesn't even need a regex ?

    – smci
    Nov 28 '18 at 0:30













  • Ultimately I expect you want to read the space-separated rows contents (within each section, according to its header), so you'll want to wrap a reader object. Or write a generator to yield each chunk of rows separately, so you can then pass it into a reader object.

    – smci
    Nov 28 '18 at 1:17











  • "Also, there are some empty lines in between those blocks." So, you're guaranteed that empty lines can only occur outside section, not inside them?

    – smci
    Nov 28 '18 at 1:18








1




1





You need to add more details. Are these multiple space-seprated file-segments inside one file? Are the @header... guaranteed to be numbered sequentially and contiguously? If the @header1 occurs all on its own, why do you test line.split()[0]=="@header2" rather than simply line == "@header2"? or just line.startswith('@header') , which should capture them all, and doesn't even need a regex ?

– smci
Nov 28 '18 at 0:30







You need to add more details. Are these multiple space-seprated file-segments inside one file? Are the @header... guaranteed to be numbered sequentially and contiguously? If the @header1 occurs all on its own, why do you test line.split()[0]=="@header2" rather than simply line == "@header2"? or just line.startswith('@header') , which should capture them all, and doesn't even need a regex ?

– smci
Nov 28 '18 at 0:30















Ultimately I expect you want to read the space-separated rows contents (within each section, according to its header), so you'll want to wrap a reader object. Or write a generator to yield each chunk of rows separately, so you can then pass it into a reader object.

– smci
Nov 28 '18 at 1:17





Ultimately I expect you want to read the space-separated rows contents (within each section, according to its header), so you'll want to wrap a reader object. Or write a generator to yield each chunk of rows separately, so you can then pass it into a reader object.

– smci
Nov 28 '18 at 1:17













"Also, there are some empty lines in between those blocks." So, you're guaranteed that empty lines can only occur outside section, not inside them?

– smci
Nov 28 '18 at 1:18





"Also, there are some empty lines in between those blocks." So, you're guaranteed that empty lines can only occur outside section, not inside them?

– smci
Nov 28 '18 at 1:18












4 Answers
4






active

oldest

votes


















1














IMO, your misconception is about how csv-files can be read. At least I doubt that ´switch´ from C would help here more than what can be done with if-clauses.



However, please understand, that you have to iterate through your file line by line. That is, there is nothing which can deal with whole blocks, if you do not know the length before.



So your algorithm is sth like:



for every line in the file:

. .is header?

. . .then prepare for this specific header

. .is empty line?

. . .then skip

. .is data?

. . .then append according to preparation above



In code this could be sth like



block_ctr = -1
block_data =
with open(filename) as f:
for line in f:
if line: # test if line is not empty
if line.startswith('@header'):
block_ctr += 1
block_data.append()
else:
block_data[block_ctr].append(line.split())





share|improve this answer


























  • It lends itself to a generator approach, see my answer

    – smci
    Nov 28 '18 at 13:08



















0














I don't know what you want to achieve exactly but maybe something like this:



with open(filename) as f:
for line in f:
if line.startswith('@'):
print('header')
# do something with header here
else:
print('regular line')
# do something with the line here





share|improve this answer































    0














    Attached at bottom is a solution using a Python generator split_into_chunks(f) to extract each section (as list-of-string), squelch empty lines, detect missing @headers and EOF. The generator approach is really neat because it allows you to further wrap e.g. a CSV reader object which handles space-separated value (e.g. pandas read_csv):



    with open('your.ssv') as f:
    for chunk in split_into_chunks(f):
    # Do stuff on chunk. Presumably, wrap a reader e.g. pandas read_csv
    # print(chunk)


    Code is below. I also parameterized the value demarcator='@header' for you. Note that we have to iterate with line = inputstream.readline(), while line, instead of the usual iterating with for line in f, since if we see the @header of the next section, we need to pushback with seek/tell() ; see this and this for explanation why. And if you want to modify the generator to yield the chunk header and body separately (e.g. as a list of two items), that's trivial.



    def split_into_chunks(inputstream, demarcator='@header'):
    """Utility generator to get sections from file, demarcated by '@header'"""

    while True:
    chunk =

    line = inputstream.readline()
    # At EOF?
    if not line: break
    # Expect that each chunk starts with one header line
    if not line.startswith(demarcator):
    raise RuntimeError(f"Bad chunk, missing {demarcator}")

    chunk.append(line.rstrip('n'))

    # Can't use `for line in inputstream:` since we may need to pushback
    while line:
    # Remember our file-pointer position in case we need to pushback a header row
    last_pos = inputstream.tell()
    line = inputstream.readline()

    # Saw next chunk's header line? Pushback the header line, then yield the current chunk
    if line.startswith(demarcator):
    inputstream.seek(last_pos)
    break

    # Ignore blank or whitespace-only lines
    #line = line.rstrip('n')
    if line:
    chunk.append(line.rstrip('n'))

    yield chunk


    with open('your.ssv') as f:
    for chunk in split_into_chunks(f):
    # Do stuff on chunk. Presumably, wrap it with a reader which handles space-sparated value, e.g. pandas read_csv
    print(chunk)





    share|improve this answer































      0














      I saw another post similar to this question and copied the idea here. I agree that SpghttCd is right although I have not tried that.



          with open(filename) as f:
      #find each line number that contains header
      for i,line in enumerate(f,1):
      if 'some_header' in line:
      index1=i
      elif 'another_header' in line:
      index2=i
      ...
      with open(filename) as f:
      #read the first block:
      for i in range(int(index1)):
      line=f.readline()
      for i in range('the block size'):
      'read, split and store'
      f.seek(0)
      #read the second block, third and ...
      ...





      share|improve this answer























        Your Answer






        StackExchange.ifUsing("editor", function () {
        StackExchange.using("externalEditor", function () {
        StackExchange.using("snippets", function () {
        StackExchange.snippets.init();
        });
        });
        }, "code-snippets");

        StackExchange.ready(function() {
        var channelOptions = {
        tags: "".split(" "),
        id: "1"
        };
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using("snippets", function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: true,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: 10,
        bindNavPrevention: true,
        postfix: "",
        imageUploader: {
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        },
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        });


        }
        });














        draft saved

        draft discarded


















        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53510246%2fhow-to-read-a-large-file-block-by-block-and-judge-by-block-header%23new-answer', 'question_page');
        }
        );

        Post as a guest















        Required, but never shown

























        4 Answers
        4






        active

        oldest

        votes








        4 Answers
        4






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        1














        IMO, your misconception is about how csv-files can be read. At least I doubt that ´switch´ from C would help here more than what can be done with if-clauses.



        However, please understand, that you have to iterate through your file line by line. That is, there is nothing which can deal with whole blocks, if you do not know the length before.



        So your algorithm is sth like:



        for every line in the file:

        . .is header?

        . . .then prepare for this specific header

        . .is empty line?

        . . .then skip

        . .is data?

        . . .then append according to preparation above



        In code this could be sth like



        block_ctr = -1
        block_data =
        with open(filename) as f:
        for line in f:
        if line: # test if line is not empty
        if line.startswith('@header'):
        block_ctr += 1
        block_data.append()
        else:
        block_data[block_ctr].append(line.split())





        share|improve this answer


























        • It lends itself to a generator approach, see my answer

          – smci
          Nov 28 '18 at 13:08
















        1














        IMO, your misconception is about how csv-files can be read. At least I doubt that ´switch´ from C would help here more than what can be done with if-clauses.



        However, please understand, that you have to iterate through your file line by line. That is, there is nothing which can deal with whole blocks, if you do not know the length before.



        So your algorithm is sth like:



        for every line in the file:

        . .is header?

        . . .then prepare for this specific header

        . .is empty line?

        . . .then skip

        . .is data?

        . . .then append according to preparation above



        In code this could be sth like



        block_ctr = -1
        block_data =
        with open(filename) as f:
        for line in f:
        if line: # test if line is not empty
        if line.startswith('@header'):
        block_ctr += 1
        block_data.append()
        else:
        block_data[block_ctr].append(line.split())





        share|improve this answer


























        • It lends itself to a generator approach, see my answer

          – smci
          Nov 28 '18 at 13:08














        1












        1








        1







        IMO, your misconception is about how csv-files can be read. At least I doubt that ´switch´ from C would help here more than what can be done with if-clauses.



        However, please understand, that you have to iterate through your file line by line. That is, there is nothing which can deal with whole blocks, if you do not know the length before.



        So your algorithm is sth like:



        for every line in the file:

        . .is header?

        . . .then prepare for this specific header

        . .is empty line?

        . . .then skip

        . .is data?

        . . .then append according to preparation above



        In code this could be sth like



        block_ctr = -1
        block_data =
        with open(filename) as f:
        for line in f:
        if line: # test if line is not empty
        if line.startswith('@header'):
        block_ctr += 1
        block_data.append()
        else:
        block_data[block_ctr].append(line.split())





        share|improve this answer















        IMO, your misconception is about how csv-files can be read. At least I doubt that ´switch´ from C would help here more than what can be done with if-clauses.



        However, please understand, that you have to iterate through your file line by line. That is, there is nothing which can deal with whole blocks, if you do not know the length before.



        So your algorithm is sth like:



        for every line in the file:

        . .is header?

        . . .then prepare for this specific header

        . .is empty line?

        . . .then skip

        . .is data?

        . . .then append according to preparation above



        In code this could be sth like



        block_ctr = -1
        block_data =
        with open(filename) as f:
        for line in f:
        if line: # test if line is not empty
        if line.startswith('@header'):
        block_ctr += 1
        block_data.append()
        else:
        block_data[block_ctr].append(line.split())






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 28 '18 at 7:48

























        answered Nov 28 '18 at 0:34









        SpghttCdSpghttCd

        4,8672314




        4,8672314













        • It lends itself to a generator approach, see my answer

          – smci
          Nov 28 '18 at 13:08



















        • It lends itself to a generator approach, see my answer

          – smci
          Nov 28 '18 at 13:08

















        It lends itself to a generator approach, see my answer

        – smci
        Nov 28 '18 at 13:08





        It lends itself to a generator approach, see my answer

        – smci
        Nov 28 '18 at 13:08













        0














        I don't know what you want to achieve exactly but maybe something like this:



        with open(filename) as f:
        for line in f:
        if line.startswith('@'):
        print('header')
        # do something with header here
        else:
        print('regular line')
        # do something with the line here





        share|improve this answer




























          0














          I don't know what you want to achieve exactly but maybe something like this:



          with open(filename) as f:
          for line in f:
          if line.startswith('@'):
          print('header')
          # do something with header here
          else:
          print('regular line')
          # do something with the line here





          share|improve this answer


























            0












            0








            0







            I don't know what you want to achieve exactly but maybe something like this:



            with open(filename) as f:
            for line in f:
            if line.startswith('@'):
            print('header')
            # do something with header here
            else:
            print('regular line')
            # do something with the line here





            share|improve this answer













            I don't know what you want to achieve exactly but maybe something like this:



            with open(filename) as f:
            for line in f:
            if line.startswith('@'):
            print('header')
            # do something with header here
            else:
            print('regular line')
            # do something with the line here






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 28 '18 at 0:33









            AkariusAkarius

            34616




            34616























                0














                Attached at bottom is a solution using a Python generator split_into_chunks(f) to extract each section (as list-of-string), squelch empty lines, detect missing @headers and EOF. The generator approach is really neat because it allows you to further wrap e.g. a CSV reader object which handles space-separated value (e.g. pandas read_csv):



                with open('your.ssv') as f:
                for chunk in split_into_chunks(f):
                # Do stuff on chunk. Presumably, wrap a reader e.g. pandas read_csv
                # print(chunk)


                Code is below. I also parameterized the value demarcator='@header' for you. Note that we have to iterate with line = inputstream.readline(), while line, instead of the usual iterating with for line in f, since if we see the @header of the next section, we need to pushback with seek/tell() ; see this and this for explanation why. And if you want to modify the generator to yield the chunk header and body separately (e.g. as a list of two items), that's trivial.



                def split_into_chunks(inputstream, demarcator='@header'):
                """Utility generator to get sections from file, demarcated by '@header'"""

                while True:
                chunk =

                line = inputstream.readline()
                # At EOF?
                if not line: break
                # Expect that each chunk starts with one header line
                if not line.startswith(demarcator):
                raise RuntimeError(f"Bad chunk, missing {demarcator}")

                chunk.append(line.rstrip('n'))

                # Can't use `for line in inputstream:` since we may need to pushback
                while line:
                # Remember our file-pointer position in case we need to pushback a header row
                last_pos = inputstream.tell()
                line = inputstream.readline()

                # Saw next chunk's header line? Pushback the header line, then yield the current chunk
                if line.startswith(demarcator):
                inputstream.seek(last_pos)
                break

                # Ignore blank or whitespace-only lines
                #line = line.rstrip('n')
                if line:
                chunk.append(line.rstrip('n'))

                yield chunk


                with open('your.ssv') as f:
                for chunk in split_into_chunks(f):
                # Do stuff on chunk. Presumably, wrap it with a reader which handles space-sparated value, e.g. pandas read_csv
                print(chunk)





                share|improve this answer




























                  0














                  Attached at bottom is a solution using a Python generator split_into_chunks(f) to extract each section (as list-of-string), squelch empty lines, detect missing @headers and EOF. The generator approach is really neat because it allows you to further wrap e.g. a CSV reader object which handles space-separated value (e.g. pandas read_csv):



                  with open('your.ssv') as f:
                  for chunk in split_into_chunks(f):
                  # Do stuff on chunk. Presumably, wrap a reader e.g. pandas read_csv
                  # print(chunk)


                  Code is below. I also parameterized the value demarcator='@header' for you. Note that we have to iterate with line = inputstream.readline(), while line, instead of the usual iterating with for line in f, since if we see the @header of the next section, we need to pushback with seek/tell() ; see this and this for explanation why. And if you want to modify the generator to yield the chunk header and body separately (e.g. as a list of two items), that's trivial.



                  def split_into_chunks(inputstream, demarcator='@header'):
                  """Utility generator to get sections from file, demarcated by '@header'"""

                  while True:
                  chunk =

                  line = inputstream.readline()
                  # At EOF?
                  if not line: break
                  # Expect that each chunk starts with one header line
                  if not line.startswith(demarcator):
                  raise RuntimeError(f"Bad chunk, missing {demarcator}")

                  chunk.append(line.rstrip('n'))

                  # Can't use `for line in inputstream:` since we may need to pushback
                  while line:
                  # Remember our file-pointer position in case we need to pushback a header row
                  last_pos = inputstream.tell()
                  line = inputstream.readline()

                  # Saw next chunk's header line? Pushback the header line, then yield the current chunk
                  if line.startswith(demarcator):
                  inputstream.seek(last_pos)
                  break

                  # Ignore blank or whitespace-only lines
                  #line = line.rstrip('n')
                  if line:
                  chunk.append(line.rstrip('n'))

                  yield chunk


                  with open('your.ssv') as f:
                  for chunk in split_into_chunks(f):
                  # Do stuff on chunk. Presumably, wrap it with a reader which handles space-sparated value, e.g. pandas read_csv
                  print(chunk)





                  share|improve this answer


























                    0












                    0








                    0







                    Attached at bottom is a solution using a Python generator split_into_chunks(f) to extract each section (as list-of-string), squelch empty lines, detect missing @headers and EOF. The generator approach is really neat because it allows you to further wrap e.g. a CSV reader object which handles space-separated value (e.g. pandas read_csv):



                    with open('your.ssv') as f:
                    for chunk in split_into_chunks(f):
                    # Do stuff on chunk. Presumably, wrap a reader e.g. pandas read_csv
                    # print(chunk)


                    Code is below. I also parameterized the value demarcator='@header' for you. Note that we have to iterate with line = inputstream.readline(), while line, instead of the usual iterating with for line in f, since if we see the @header of the next section, we need to pushback with seek/tell() ; see this and this for explanation why. And if you want to modify the generator to yield the chunk header and body separately (e.g. as a list of two items), that's trivial.



                    def split_into_chunks(inputstream, demarcator='@header'):
                    """Utility generator to get sections from file, demarcated by '@header'"""

                    while True:
                    chunk =

                    line = inputstream.readline()
                    # At EOF?
                    if not line: break
                    # Expect that each chunk starts with one header line
                    if not line.startswith(demarcator):
                    raise RuntimeError(f"Bad chunk, missing {demarcator}")

                    chunk.append(line.rstrip('n'))

                    # Can't use `for line in inputstream:` since we may need to pushback
                    while line:
                    # Remember our file-pointer position in case we need to pushback a header row
                    last_pos = inputstream.tell()
                    line = inputstream.readline()

                    # Saw next chunk's header line? Pushback the header line, then yield the current chunk
                    if line.startswith(demarcator):
                    inputstream.seek(last_pos)
                    break

                    # Ignore blank or whitespace-only lines
                    #line = line.rstrip('n')
                    if line:
                    chunk.append(line.rstrip('n'))

                    yield chunk


                    with open('your.ssv') as f:
                    for chunk in split_into_chunks(f):
                    # Do stuff on chunk. Presumably, wrap it with a reader which handles space-sparated value, e.g. pandas read_csv
                    print(chunk)





                    share|improve this answer













                    Attached at bottom is a solution using a Python generator split_into_chunks(f) to extract each section (as list-of-string), squelch empty lines, detect missing @headers and EOF. The generator approach is really neat because it allows you to further wrap e.g. a CSV reader object which handles space-separated value (e.g. pandas read_csv):



                    with open('your.ssv') as f:
                    for chunk in split_into_chunks(f):
                    # Do stuff on chunk. Presumably, wrap a reader e.g. pandas read_csv
                    # print(chunk)


                    Code is below. I also parameterized the value demarcator='@header' for you. Note that we have to iterate with line = inputstream.readline(), while line, instead of the usual iterating with for line in f, since if we see the @header of the next section, we need to pushback with seek/tell() ; see this and this for explanation why. And if you want to modify the generator to yield the chunk header and body separately (e.g. as a list of two items), that's trivial.



                    def split_into_chunks(inputstream, demarcator='@header'):
                    """Utility generator to get sections from file, demarcated by '@header'"""

                    while True:
                    chunk =

                    line = inputstream.readline()
                    # At EOF?
                    if not line: break
                    # Expect that each chunk starts with one header line
                    if not line.startswith(demarcator):
                    raise RuntimeError(f"Bad chunk, missing {demarcator}")

                    chunk.append(line.rstrip('n'))

                    # Can't use `for line in inputstream:` since we may need to pushback
                    while line:
                    # Remember our file-pointer position in case we need to pushback a header row
                    last_pos = inputstream.tell()
                    line = inputstream.readline()

                    # Saw next chunk's header line? Pushback the header line, then yield the current chunk
                    if line.startswith(demarcator):
                    inputstream.seek(last_pos)
                    break

                    # Ignore blank or whitespace-only lines
                    #line = line.rstrip('n')
                    if line:
                    chunk.append(line.rstrip('n'))

                    yield chunk


                    with open('your.ssv') as f:
                    for chunk in split_into_chunks(f):
                    # Do stuff on chunk. Presumably, wrap it with a reader which handles space-sparated value, e.g. pandas read_csv
                    print(chunk)






                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Nov 28 '18 at 12:46









                    smcismci

                    15.3k676108




                    15.3k676108























                        0














                        I saw another post similar to this question and copied the idea here. I agree that SpghttCd is right although I have not tried that.



                            with open(filename) as f:
                        #find each line number that contains header
                        for i,line in enumerate(f,1):
                        if 'some_header' in line:
                        index1=i
                        elif 'another_header' in line:
                        index2=i
                        ...
                        with open(filename) as f:
                        #read the first block:
                        for i in range(int(index1)):
                        line=f.readline()
                        for i in range('the block size'):
                        'read, split and store'
                        f.seek(0)
                        #read the second block, third and ...
                        ...





                        share|improve this answer




























                          0














                          I saw another post similar to this question and copied the idea here. I agree that SpghttCd is right although I have not tried that.



                              with open(filename) as f:
                          #find each line number that contains header
                          for i,line in enumerate(f,1):
                          if 'some_header' in line:
                          index1=i
                          elif 'another_header' in line:
                          index2=i
                          ...
                          with open(filename) as f:
                          #read the first block:
                          for i in range(int(index1)):
                          line=f.readline()
                          for i in range('the block size'):
                          'read, split and store'
                          f.seek(0)
                          #read the second block, third and ...
                          ...





                          share|improve this answer


























                            0












                            0








                            0







                            I saw another post similar to this question and copied the idea here. I agree that SpghttCd is right although I have not tried that.



                                with open(filename) as f:
                            #find each line number that contains header
                            for i,line in enumerate(f,1):
                            if 'some_header' in line:
                            index1=i
                            elif 'another_header' in line:
                            index2=i
                            ...
                            with open(filename) as f:
                            #read the first block:
                            for i in range(int(index1)):
                            line=f.readline()
                            for i in range('the block size'):
                            'read, split and store'
                            f.seek(0)
                            #read the second block, third and ...
                            ...





                            share|improve this answer













                            I saw another post similar to this question and copied the idea here. I agree that SpghttCd is right although I have not tried that.



                                with open(filename) as f:
                            #find each line number that contains header
                            for i,line in enumerate(f,1):
                            if 'some_header' in line:
                            index1=i
                            elif 'another_header' in line:
                            index2=i
                            ...
                            with open(filename) as f:
                            #read the first block:
                            for i in range(int(index1)):
                            line=f.readline()
                            for i in range('the block size'):
                            'read, split and store'
                            f.seek(0)
                            #read the second block, third and ...
                            ...






                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Nov 29 '18 at 19:53









                            fish_bufish_bu

                            82




                            82






























                                draft saved

                                draft discarded




















































                                Thanks for contributing an answer to Stack Overflow!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid



                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53510246%2fhow-to-read-a-large-file-block-by-block-and-judge-by-block-header%23new-answer', 'question_page');
                                }
                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

                                Calculate evaluation metrics using cross_val_predict sklearn

                                Insert data from modal to MySQL (multiple modal on website)