How to parse very big files in python?












1














I have a very big tsv file: 1.5 GB. i want to parse this file. Im using the following function:



def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = {}
for row in eval:
ids = row.split("t")
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] =
evalIDs[ids[0]].append(ids[1])
eval.close()


return evalIDs


It is take more than 10 hours and it is still working. I dont know how to accelerate this step and if there is another method to parse such as file










share|improve this question
























  • Maybe duplicated with stackoverflow.com/questions/17444679/reading-a-huge-csv-file
    – Guillaume Jacquenot
    Nov 23 at 6:36










  • what operating system?
    – Patrick Artner
    Nov 23 at 6:39










  • im working on a linux server with 130 gb as ram.
    – bib
    Nov 23 at 6:41










  • @GuillaumeJacquenot nope, not at all. OP isn't reading the lines first.
    – Jean-François Fabre
    Nov 23 at 7:01










  • @bib Stepping back a little, what do you plan to use dictionary eval for?
    – Noufal Ibrahim
    Nov 23 at 7:59
















1














I have a very big tsv file: 1.5 GB. i want to parse this file. Im using the following function:



def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = {}
for row in eval:
ids = row.split("t")
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] =
evalIDs[ids[0]].append(ids[1])
eval.close()


return evalIDs


It is take more than 10 hours and it is still working. I dont know how to accelerate this step and if there is another method to parse such as file










share|improve this question
























  • Maybe duplicated with stackoverflow.com/questions/17444679/reading-a-huge-csv-file
    – Guillaume Jacquenot
    Nov 23 at 6:36










  • what operating system?
    – Patrick Artner
    Nov 23 at 6:39










  • im working on a linux server with 130 gb as ram.
    – bib
    Nov 23 at 6:41










  • @GuillaumeJacquenot nope, not at all. OP isn't reading the lines first.
    – Jean-François Fabre
    Nov 23 at 7:01










  • @bib Stepping back a little, what do you plan to use dictionary eval for?
    – Noufal Ibrahim
    Nov 23 at 7:59














1












1








1


1





I have a very big tsv file: 1.5 GB. i want to parse this file. Im using the following function:



def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = {}
for row in eval:
ids = row.split("t")
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] =
evalIDs[ids[0]].append(ids[1])
eval.close()


return evalIDs


It is take more than 10 hours and it is still working. I dont know how to accelerate this step and if there is another method to parse such as file










share|improve this question















I have a very big tsv file: 1.5 GB. i want to parse this file. Im using the following function:



def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = {}
for row in eval:
ids = row.split("t")
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] =
evalIDs[ids[0]].append(ids[1])
eval.close()


return evalIDs


It is take more than 10 hours and it is still working. I dont know how to accelerate this step and if there is another method to parse such as file







python bigdata






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 23 at 6:48









Erik Šťastný

785327




785327










asked Nov 23 at 6:31









bib

8310




8310












  • Maybe duplicated with stackoverflow.com/questions/17444679/reading-a-huge-csv-file
    – Guillaume Jacquenot
    Nov 23 at 6:36










  • what operating system?
    – Patrick Artner
    Nov 23 at 6:39










  • im working on a linux server with 130 gb as ram.
    – bib
    Nov 23 at 6:41










  • @GuillaumeJacquenot nope, not at all. OP isn't reading the lines first.
    – Jean-François Fabre
    Nov 23 at 7:01










  • @bib Stepping back a little, what do you plan to use dictionary eval for?
    – Noufal Ibrahim
    Nov 23 at 7:59


















  • Maybe duplicated with stackoverflow.com/questions/17444679/reading-a-huge-csv-file
    – Guillaume Jacquenot
    Nov 23 at 6:36










  • what operating system?
    – Patrick Artner
    Nov 23 at 6:39










  • im working on a linux server with 130 gb as ram.
    – bib
    Nov 23 at 6:41










  • @GuillaumeJacquenot nope, not at all. OP isn't reading the lines first.
    – Jean-François Fabre
    Nov 23 at 7:01










  • @bib Stepping back a little, what do you plan to use dictionary eval for?
    – Noufal Ibrahim
    Nov 23 at 7:59
















Maybe duplicated with stackoverflow.com/questions/17444679/reading-a-huge-csv-file
– Guillaume Jacquenot
Nov 23 at 6:36




Maybe duplicated with stackoverflow.com/questions/17444679/reading-a-huge-csv-file
– Guillaume Jacquenot
Nov 23 at 6:36












what operating system?
– Patrick Artner
Nov 23 at 6:39




what operating system?
– Patrick Artner
Nov 23 at 6:39












im working on a linux server with 130 gb as ram.
– bib
Nov 23 at 6:41




im working on a linux server with 130 gb as ram.
– bib
Nov 23 at 6:41












@GuillaumeJacquenot nope, not at all. OP isn't reading the lines first.
– Jean-François Fabre
Nov 23 at 7:01




@GuillaumeJacquenot nope, not at all. OP isn't reading the lines first.
– Jean-François Fabre
Nov 23 at 7:01












@bib Stepping back a little, what do you plan to use dictionary eval for?
– Noufal Ibrahim
Nov 23 at 7:59




@bib Stepping back a little, what do you plan to use dictionary eval for?
– Noufal Ibrahim
Nov 23 at 7:59












4 Answers
4






active

oldest

votes


















0














Maybe, you can make it somewhat faster; change it:



if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] =
evalIDs[ids[0]].append(ids[1])


to



evalIDs.setdefault(ids[0],).append(ids[1])


The 1st solution searches 3 times in the "evalID" dictionary.






share|improve this answer





















  • setdefault is slower than a defaultdict. timeit.timeit(lambda : d.setdefault('x',).append(1)) reports 0.4583683079981711 and timeit.timeit(lambda : c['x'].append(1)) reports 0.28720847200020216 where d is {} and c is collections.defaultdict(list). All the answers have recommended so. Why have you selected this as the correct one? The solution here is inferior to the others mentioned.
    – Noufal Ibrahim
    Nov 26 at 4:45










  • I can't measure significant difference (Python 3.7.1), but the OP should measure it.
    – kantal
    Nov 26 at 7:21










  • The defaultdict is roughly twice as fast as the setdefault in my measurement. (3.5.3) and I think that's reasonable given how setdefault evaluates its arguments each time you call it (a new empty list is created each time).
    – Noufal Ibrahim
    Nov 26 at 9:13



















3














several issues here:




  • testing for keys with if ids[0] not in evalIDs.keys() takes forever in python 2, because keys() is a list. .keys() is rarely useful anyway. A better way already is if ids[0] not in evalIDs, but, but...

  • why not use a collections.defaultdict instead?

  • why not use csv module?

  • overriding eval built-in (well, not really an issue seeing how dangerous it is)


my proposal:



import csv, collections

def readEvalFileAsDictInverse(evalFile):
with open(evalFile, "r") as handle:
evalIDs = collections.defaultdict(list)
cr = csv.reader(handle,delimiter='t')
for ids in cr:
evalIDs[ids[0]].append(ids[1]]


the magic evalIDs[ids[0]].append(ids[1]] creates a list if doesn't already exist. It's also portable and very fast whatever the python version and saves a if



I don't think it could be faster with default libraries, but a pandas solution probably would.






share|improve this answer































    2














    Some suggestions:



    Use a defaultdict(list) instead of creating inner lists yourself or using dict.setdefault().



    dict.setfdefault() will create the defautvalue every time, thats a time burner - defautldict(list) does not - it is optimized:



    from collections import defaultdict    
    def readEvalFileAsDictInverse(evalFile):
    eval = open(evalFile, "r")
    evalIDs = defaultdict(list)
    for row in eval:
    ids = row.split("t")
    evalIDs[ids[0]].append(ids[1])
    eval.close()




    If your keys are valid file names you might want to investigate awk for much more performance then doing this in python.



    Something along the lines of



    awk -F $'t' '{print > $1}' file1


    will create your split files much faster and you can simply use the latter part of the following code to read from each file (assuming your keys are valid filenames) to construct your lists. (Attributation: here ) - You would need to grab your created files with os.walk or similar means. Each line inside the files will still be tab-seperated and contain the ID in front





    If your keys are not filenames in their own right, consider storing your different lines into different files and only keep a dictionary of key,filename around.



    After splitting the data, load the files as lists again:



    Create testfile:



    with open ("file.txt","w") as w:

    w.write("""
    1ttatati
    2tyippti
    3turksti
    1tTTtatati
    2tYYyippti
    3tUUurksti
    1ttttttttatati
    2tyyyyyyyippti
    3tuuuuuuurksti

    """)


    Code:



    # f.e. https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename
    def make_filename(k):
    """In case your keys contain non-filename-characters, make it a valid name"""
    return k # assuming k is a valid file name else modify it

    evalFile = "file.txt"
    files = {}
    with open(evalFile, "r") as eval_file:
    for line in eval_file:
    if not line.strip():
    continue
    key,value, *rest = line.split("t") # omit ,*rest if you only have 2 values
    fn = files.setdefault(key, make_filename(key))

    # this wil open and close files _a lot_ you might want to keep file handles
    # instead in your dict - but that depends on the key/data/lines ratio in
    # your data - if you have few keys, file handles ought to be better, if
    # have many it does not matter
    with open(fn,"a") as f:
    f.write(value+"n")

    # create your list data from your files:
    data = {}
    for key,fn in files.items():
    with open(fn) as r:
    data[key] = [x.strip() for x in r]

    print(data)


    Output:



    # for my data: loaded from files called '1', '2' and '3'
    {'1': ['tata', 'TTtata', 'tttttttata'],
    '2': ['yipp', 'YYyipp', 'yyyyyyyipp'],
    '3': ['urks', 'UUurks', 'uuuuuuurks']}





    share|improve this answer



















    • 2




      why creating a defaultdict and test the keys?
      – Jean-François Fabre
      Nov 23 at 6:59












    • @Jean-FrançoisFabre copy & paste error - thanks for pointing out
      – Patrick Artner
      Nov 23 at 7:13



















    1















    1. Change evalIDs to a collections.defaultdict(list). You can avoid the if to check if a key is there.

    2. Consider splitting the file externally using split(1) or even inside python using a read offset. Then use multiprocessing.pool to parallelise the loading.






    share|improve this answer





















    • can you please give more details. i dont have any idea about multiprocessing.pool
      – bib
      Nov 23 at 6:56










    • parallelize the loading won't do no good if I/O is the bottleneck
      – Jean-François Fabre
      Nov 23 at 6:57










    • If I/O is the bottleneck, then yes, it won't do much good but apart from the defaultdict which everyone has suggested, it's the only other thing I can think of worth trying.
      – Noufal Ibrahim
      Nov 23 at 6:59










    • @bib Are all the lines in the file of the same length?
      – Noufal Ibrahim
      Nov 23 at 7:01











    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53441625%2fhow-to-parse-very-big-files-in-python%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    4 Answers
    4






    active

    oldest

    votes








    4 Answers
    4






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    Maybe, you can make it somewhat faster; change it:



    if ids[0] not in evalIDs.keys():
    evalIDs[ids[0]] =
    evalIDs[ids[0]].append(ids[1])


    to



    evalIDs.setdefault(ids[0],).append(ids[1])


    The 1st solution searches 3 times in the "evalID" dictionary.






    share|improve this answer





















    • setdefault is slower than a defaultdict. timeit.timeit(lambda : d.setdefault('x',).append(1)) reports 0.4583683079981711 and timeit.timeit(lambda : c['x'].append(1)) reports 0.28720847200020216 where d is {} and c is collections.defaultdict(list). All the answers have recommended so. Why have you selected this as the correct one? The solution here is inferior to the others mentioned.
      – Noufal Ibrahim
      Nov 26 at 4:45










    • I can't measure significant difference (Python 3.7.1), but the OP should measure it.
      – kantal
      Nov 26 at 7:21










    • The defaultdict is roughly twice as fast as the setdefault in my measurement. (3.5.3) and I think that's reasonable given how setdefault evaluates its arguments each time you call it (a new empty list is created each time).
      – Noufal Ibrahim
      Nov 26 at 9:13
















    0














    Maybe, you can make it somewhat faster; change it:



    if ids[0] not in evalIDs.keys():
    evalIDs[ids[0]] =
    evalIDs[ids[0]].append(ids[1])


    to



    evalIDs.setdefault(ids[0],).append(ids[1])


    The 1st solution searches 3 times in the "evalID" dictionary.






    share|improve this answer





















    • setdefault is slower than a defaultdict. timeit.timeit(lambda : d.setdefault('x',).append(1)) reports 0.4583683079981711 and timeit.timeit(lambda : c['x'].append(1)) reports 0.28720847200020216 where d is {} and c is collections.defaultdict(list). All the answers have recommended so. Why have you selected this as the correct one? The solution here is inferior to the others mentioned.
      – Noufal Ibrahim
      Nov 26 at 4:45










    • I can't measure significant difference (Python 3.7.1), but the OP should measure it.
      – kantal
      Nov 26 at 7:21










    • The defaultdict is roughly twice as fast as the setdefault in my measurement. (3.5.3) and I think that's reasonable given how setdefault evaluates its arguments each time you call it (a new empty list is created each time).
      – Noufal Ibrahim
      Nov 26 at 9:13














    0












    0








    0






    Maybe, you can make it somewhat faster; change it:



    if ids[0] not in evalIDs.keys():
    evalIDs[ids[0]] =
    evalIDs[ids[0]].append(ids[1])


    to



    evalIDs.setdefault(ids[0],).append(ids[1])


    The 1st solution searches 3 times in the "evalID" dictionary.






    share|improve this answer












    Maybe, you can make it somewhat faster; change it:



    if ids[0] not in evalIDs.keys():
    evalIDs[ids[0]] =
    evalIDs[ids[0]].append(ids[1])


    to



    evalIDs.setdefault(ids[0],).append(ids[1])


    The 1st solution searches 3 times in the "evalID" dictionary.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 23 at 6:45









    kantal

    62227




    62227












    • setdefault is slower than a defaultdict. timeit.timeit(lambda : d.setdefault('x',).append(1)) reports 0.4583683079981711 and timeit.timeit(lambda : c['x'].append(1)) reports 0.28720847200020216 where d is {} and c is collections.defaultdict(list). All the answers have recommended so. Why have you selected this as the correct one? The solution here is inferior to the others mentioned.
      – Noufal Ibrahim
      Nov 26 at 4:45










    • I can't measure significant difference (Python 3.7.1), but the OP should measure it.
      – kantal
      Nov 26 at 7:21










    • The defaultdict is roughly twice as fast as the setdefault in my measurement. (3.5.3) and I think that's reasonable given how setdefault evaluates its arguments each time you call it (a new empty list is created each time).
      – Noufal Ibrahim
      Nov 26 at 9:13


















    • setdefault is slower than a defaultdict. timeit.timeit(lambda : d.setdefault('x',).append(1)) reports 0.4583683079981711 and timeit.timeit(lambda : c['x'].append(1)) reports 0.28720847200020216 where d is {} and c is collections.defaultdict(list). All the answers have recommended so. Why have you selected this as the correct one? The solution here is inferior to the others mentioned.
      – Noufal Ibrahim
      Nov 26 at 4:45










    • I can't measure significant difference (Python 3.7.1), but the OP should measure it.
      – kantal
      Nov 26 at 7:21










    • The defaultdict is roughly twice as fast as the setdefault in my measurement. (3.5.3) and I think that's reasonable given how setdefault evaluates its arguments each time you call it (a new empty list is created each time).
      – Noufal Ibrahim
      Nov 26 at 9:13
















    setdefault is slower than a defaultdict. timeit.timeit(lambda : d.setdefault('x',).append(1)) reports 0.4583683079981711 and timeit.timeit(lambda : c['x'].append(1)) reports 0.28720847200020216 where d is {} and c is collections.defaultdict(list). All the answers have recommended so. Why have you selected this as the correct one? The solution here is inferior to the others mentioned.
    – Noufal Ibrahim
    Nov 26 at 4:45




    setdefault is slower than a defaultdict. timeit.timeit(lambda : d.setdefault('x',).append(1)) reports 0.4583683079981711 and timeit.timeit(lambda : c['x'].append(1)) reports 0.28720847200020216 where d is {} and c is collections.defaultdict(list). All the answers have recommended so. Why have you selected this as the correct one? The solution here is inferior to the others mentioned.
    – Noufal Ibrahim
    Nov 26 at 4:45












    I can't measure significant difference (Python 3.7.1), but the OP should measure it.
    – kantal
    Nov 26 at 7:21




    I can't measure significant difference (Python 3.7.1), but the OP should measure it.
    – kantal
    Nov 26 at 7:21












    The defaultdict is roughly twice as fast as the setdefault in my measurement. (3.5.3) and I think that's reasonable given how setdefault evaluates its arguments each time you call it (a new empty list is created each time).
    – Noufal Ibrahim
    Nov 26 at 9:13




    The defaultdict is roughly twice as fast as the setdefault in my measurement. (3.5.3) and I think that's reasonable given how setdefault evaluates its arguments each time you call it (a new empty list is created each time).
    – Noufal Ibrahim
    Nov 26 at 9:13













    3














    several issues here:




    • testing for keys with if ids[0] not in evalIDs.keys() takes forever in python 2, because keys() is a list. .keys() is rarely useful anyway. A better way already is if ids[0] not in evalIDs, but, but...

    • why not use a collections.defaultdict instead?

    • why not use csv module?

    • overriding eval built-in (well, not really an issue seeing how dangerous it is)


    my proposal:



    import csv, collections

    def readEvalFileAsDictInverse(evalFile):
    with open(evalFile, "r") as handle:
    evalIDs = collections.defaultdict(list)
    cr = csv.reader(handle,delimiter='t')
    for ids in cr:
    evalIDs[ids[0]].append(ids[1]]


    the magic evalIDs[ids[0]].append(ids[1]] creates a list if doesn't already exist. It's also portable and very fast whatever the python version and saves a if



    I don't think it could be faster with default libraries, but a pandas solution probably would.






    share|improve this answer




























      3














      several issues here:




      • testing for keys with if ids[0] not in evalIDs.keys() takes forever in python 2, because keys() is a list. .keys() is rarely useful anyway. A better way already is if ids[0] not in evalIDs, but, but...

      • why not use a collections.defaultdict instead?

      • why not use csv module?

      • overriding eval built-in (well, not really an issue seeing how dangerous it is)


      my proposal:



      import csv, collections

      def readEvalFileAsDictInverse(evalFile):
      with open(evalFile, "r") as handle:
      evalIDs = collections.defaultdict(list)
      cr = csv.reader(handle,delimiter='t')
      for ids in cr:
      evalIDs[ids[0]].append(ids[1]]


      the magic evalIDs[ids[0]].append(ids[1]] creates a list if doesn't already exist. It's also portable and very fast whatever the python version and saves a if



      I don't think it could be faster with default libraries, but a pandas solution probably would.






      share|improve this answer


























        3












        3








        3






        several issues here:




        • testing for keys with if ids[0] not in evalIDs.keys() takes forever in python 2, because keys() is a list. .keys() is rarely useful anyway. A better way already is if ids[0] not in evalIDs, but, but...

        • why not use a collections.defaultdict instead?

        • why not use csv module?

        • overriding eval built-in (well, not really an issue seeing how dangerous it is)


        my proposal:



        import csv, collections

        def readEvalFileAsDictInverse(evalFile):
        with open(evalFile, "r") as handle:
        evalIDs = collections.defaultdict(list)
        cr = csv.reader(handle,delimiter='t')
        for ids in cr:
        evalIDs[ids[0]].append(ids[1]]


        the magic evalIDs[ids[0]].append(ids[1]] creates a list if doesn't already exist. It's also portable and very fast whatever the python version and saves a if



        I don't think it could be faster with default libraries, but a pandas solution probably would.






        share|improve this answer














        several issues here:




        • testing for keys with if ids[0] not in evalIDs.keys() takes forever in python 2, because keys() is a list. .keys() is rarely useful anyway. A better way already is if ids[0] not in evalIDs, but, but...

        • why not use a collections.defaultdict instead?

        • why not use csv module?

        • overriding eval built-in (well, not really an issue seeing how dangerous it is)


        my proposal:



        import csv, collections

        def readEvalFileAsDictInverse(evalFile):
        with open(evalFile, "r") as handle:
        evalIDs = collections.defaultdict(list)
        cr = csv.reader(handle,delimiter='t')
        for ids in cr:
        evalIDs[ids[0]].append(ids[1]]


        the magic evalIDs[ids[0]].append(ids[1]] creates a list if doesn't already exist. It's also portable and very fast whatever the python version and saves a if



        I don't think it could be faster with default libraries, but a pandas solution probably would.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 23 at 8:49

























        answered Nov 23 at 6:51









        Jean-François Fabre

        100k954109




        100k954109























            2














            Some suggestions:



            Use a defaultdict(list) instead of creating inner lists yourself or using dict.setdefault().



            dict.setfdefault() will create the defautvalue every time, thats a time burner - defautldict(list) does not - it is optimized:



            from collections import defaultdict    
            def readEvalFileAsDictInverse(evalFile):
            eval = open(evalFile, "r")
            evalIDs = defaultdict(list)
            for row in eval:
            ids = row.split("t")
            evalIDs[ids[0]].append(ids[1])
            eval.close()




            If your keys are valid file names you might want to investigate awk for much more performance then doing this in python.



            Something along the lines of



            awk -F $'t' '{print > $1}' file1


            will create your split files much faster and you can simply use the latter part of the following code to read from each file (assuming your keys are valid filenames) to construct your lists. (Attributation: here ) - You would need to grab your created files with os.walk or similar means. Each line inside the files will still be tab-seperated and contain the ID in front





            If your keys are not filenames in their own right, consider storing your different lines into different files and only keep a dictionary of key,filename around.



            After splitting the data, load the files as lists again:



            Create testfile:



            with open ("file.txt","w") as w:

            w.write("""
            1ttatati
            2tyippti
            3turksti
            1tTTtatati
            2tYYyippti
            3tUUurksti
            1ttttttttatati
            2tyyyyyyyippti
            3tuuuuuuurksti

            """)


            Code:



            # f.e. https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename
            def make_filename(k):
            """In case your keys contain non-filename-characters, make it a valid name"""
            return k # assuming k is a valid file name else modify it

            evalFile = "file.txt"
            files = {}
            with open(evalFile, "r") as eval_file:
            for line in eval_file:
            if not line.strip():
            continue
            key,value, *rest = line.split("t") # omit ,*rest if you only have 2 values
            fn = files.setdefault(key, make_filename(key))

            # this wil open and close files _a lot_ you might want to keep file handles
            # instead in your dict - but that depends on the key/data/lines ratio in
            # your data - if you have few keys, file handles ought to be better, if
            # have many it does not matter
            with open(fn,"a") as f:
            f.write(value+"n")

            # create your list data from your files:
            data = {}
            for key,fn in files.items():
            with open(fn) as r:
            data[key] = [x.strip() for x in r]

            print(data)


            Output:



            # for my data: loaded from files called '1', '2' and '3'
            {'1': ['tata', 'TTtata', 'tttttttata'],
            '2': ['yipp', 'YYyipp', 'yyyyyyyipp'],
            '3': ['urks', 'UUurks', 'uuuuuuurks']}





            share|improve this answer



















            • 2




              why creating a defaultdict and test the keys?
              – Jean-François Fabre
              Nov 23 at 6:59












            • @Jean-FrançoisFabre copy & paste error - thanks for pointing out
              – Patrick Artner
              Nov 23 at 7:13
















            2














            Some suggestions:



            Use a defaultdict(list) instead of creating inner lists yourself or using dict.setdefault().



            dict.setfdefault() will create the defautvalue every time, thats a time burner - defautldict(list) does not - it is optimized:



            from collections import defaultdict    
            def readEvalFileAsDictInverse(evalFile):
            eval = open(evalFile, "r")
            evalIDs = defaultdict(list)
            for row in eval:
            ids = row.split("t")
            evalIDs[ids[0]].append(ids[1])
            eval.close()




            If your keys are valid file names you might want to investigate awk for much more performance then doing this in python.



            Something along the lines of



            awk -F $'t' '{print > $1}' file1


            will create your split files much faster and you can simply use the latter part of the following code to read from each file (assuming your keys are valid filenames) to construct your lists. (Attributation: here ) - You would need to grab your created files with os.walk or similar means. Each line inside the files will still be tab-seperated and contain the ID in front





            If your keys are not filenames in their own right, consider storing your different lines into different files and only keep a dictionary of key,filename around.



            After splitting the data, load the files as lists again:



            Create testfile:



            with open ("file.txt","w") as w:

            w.write("""
            1ttatati
            2tyippti
            3turksti
            1tTTtatati
            2tYYyippti
            3tUUurksti
            1ttttttttatati
            2tyyyyyyyippti
            3tuuuuuuurksti

            """)


            Code:



            # f.e. https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename
            def make_filename(k):
            """In case your keys contain non-filename-characters, make it a valid name"""
            return k # assuming k is a valid file name else modify it

            evalFile = "file.txt"
            files = {}
            with open(evalFile, "r") as eval_file:
            for line in eval_file:
            if not line.strip():
            continue
            key,value, *rest = line.split("t") # omit ,*rest if you only have 2 values
            fn = files.setdefault(key, make_filename(key))

            # this wil open and close files _a lot_ you might want to keep file handles
            # instead in your dict - but that depends on the key/data/lines ratio in
            # your data - if you have few keys, file handles ought to be better, if
            # have many it does not matter
            with open(fn,"a") as f:
            f.write(value+"n")

            # create your list data from your files:
            data = {}
            for key,fn in files.items():
            with open(fn) as r:
            data[key] = [x.strip() for x in r]

            print(data)


            Output:



            # for my data: loaded from files called '1', '2' and '3'
            {'1': ['tata', 'TTtata', 'tttttttata'],
            '2': ['yipp', 'YYyipp', 'yyyyyyyipp'],
            '3': ['urks', 'UUurks', 'uuuuuuurks']}





            share|improve this answer



















            • 2




              why creating a defaultdict and test the keys?
              – Jean-François Fabre
              Nov 23 at 6:59












            • @Jean-FrançoisFabre copy & paste error - thanks for pointing out
              – Patrick Artner
              Nov 23 at 7:13














            2












            2








            2






            Some suggestions:



            Use a defaultdict(list) instead of creating inner lists yourself or using dict.setdefault().



            dict.setfdefault() will create the defautvalue every time, thats a time burner - defautldict(list) does not - it is optimized:



            from collections import defaultdict    
            def readEvalFileAsDictInverse(evalFile):
            eval = open(evalFile, "r")
            evalIDs = defaultdict(list)
            for row in eval:
            ids = row.split("t")
            evalIDs[ids[0]].append(ids[1])
            eval.close()




            If your keys are valid file names you might want to investigate awk for much more performance then doing this in python.



            Something along the lines of



            awk -F $'t' '{print > $1}' file1


            will create your split files much faster and you can simply use the latter part of the following code to read from each file (assuming your keys are valid filenames) to construct your lists. (Attributation: here ) - You would need to grab your created files with os.walk or similar means. Each line inside the files will still be tab-seperated and contain the ID in front





            If your keys are not filenames in their own right, consider storing your different lines into different files and only keep a dictionary of key,filename around.



            After splitting the data, load the files as lists again:



            Create testfile:



            with open ("file.txt","w") as w:

            w.write("""
            1ttatati
            2tyippti
            3turksti
            1tTTtatati
            2tYYyippti
            3tUUurksti
            1ttttttttatati
            2tyyyyyyyippti
            3tuuuuuuurksti

            """)


            Code:



            # f.e. https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename
            def make_filename(k):
            """In case your keys contain non-filename-characters, make it a valid name"""
            return k # assuming k is a valid file name else modify it

            evalFile = "file.txt"
            files = {}
            with open(evalFile, "r") as eval_file:
            for line in eval_file:
            if not line.strip():
            continue
            key,value, *rest = line.split("t") # omit ,*rest if you only have 2 values
            fn = files.setdefault(key, make_filename(key))

            # this wil open and close files _a lot_ you might want to keep file handles
            # instead in your dict - but that depends on the key/data/lines ratio in
            # your data - if you have few keys, file handles ought to be better, if
            # have many it does not matter
            with open(fn,"a") as f:
            f.write(value+"n")

            # create your list data from your files:
            data = {}
            for key,fn in files.items():
            with open(fn) as r:
            data[key] = [x.strip() for x in r]

            print(data)


            Output:



            # for my data: loaded from files called '1', '2' and '3'
            {'1': ['tata', 'TTtata', 'tttttttata'],
            '2': ['yipp', 'YYyipp', 'yyyyyyyipp'],
            '3': ['urks', 'UUurks', 'uuuuuuurks']}





            share|improve this answer














            Some suggestions:



            Use a defaultdict(list) instead of creating inner lists yourself or using dict.setdefault().



            dict.setfdefault() will create the defautvalue every time, thats a time burner - defautldict(list) does not - it is optimized:



            from collections import defaultdict    
            def readEvalFileAsDictInverse(evalFile):
            eval = open(evalFile, "r")
            evalIDs = defaultdict(list)
            for row in eval:
            ids = row.split("t")
            evalIDs[ids[0]].append(ids[1])
            eval.close()




            If your keys are valid file names you might want to investigate awk for much more performance then doing this in python.



            Something along the lines of



            awk -F $'t' '{print > $1}' file1


            will create your split files much faster and you can simply use the latter part of the following code to read from each file (assuming your keys are valid filenames) to construct your lists. (Attributation: here ) - You would need to grab your created files with os.walk or similar means. Each line inside the files will still be tab-seperated and contain the ID in front





            If your keys are not filenames in their own right, consider storing your different lines into different files and only keep a dictionary of key,filename around.



            After splitting the data, load the files as lists again:



            Create testfile:



            with open ("file.txt","w") as w:

            w.write("""
            1ttatati
            2tyippti
            3turksti
            1tTTtatati
            2tYYyippti
            3tUUurksti
            1ttttttttatati
            2tyyyyyyyippti
            3tuuuuuuurksti

            """)


            Code:



            # f.e. https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename
            def make_filename(k):
            """In case your keys contain non-filename-characters, make it a valid name"""
            return k # assuming k is a valid file name else modify it

            evalFile = "file.txt"
            files = {}
            with open(evalFile, "r") as eval_file:
            for line in eval_file:
            if not line.strip():
            continue
            key,value, *rest = line.split("t") # omit ,*rest if you only have 2 values
            fn = files.setdefault(key, make_filename(key))

            # this wil open and close files _a lot_ you might want to keep file handles
            # instead in your dict - but that depends on the key/data/lines ratio in
            # your data - if you have few keys, file handles ought to be better, if
            # have many it does not matter
            with open(fn,"a") as f:
            f.write(value+"n")

            # create your list data from your files:
            data = {}
            for key,fn in files.items():
            with open(fn) as r:
            data[key] = [x.strip() for x in r]

            print(data)


            Output:



            # for my data: loaded from files called '1', '2' and '3'
            {'1': ['tata', 'TTtata', 'tttttttata'],
            '2': ['yipp', 'YYyipp', 'yyyyyyyipp'],
            '3': ['urks', 'UUurks', 'uuuuuuurks']}






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Nov 23 at 7:56

























            answered Nov 23 at 6:52









            Patrick Artner

            20.7k52142




            20.7k52142








            • 2




              why creating a defaultdict and test the keys?
              – Jean-François Fabre
              Nov 23 at 6:59












            • @Jean-FrançoisFabre copy & paste error - thanks for pointing out
              – Patrick Artner
              Nov 23 at 7:13














            • 2




              why creating a defaultdict and test the keys?
              – Jean-François Fabre
              Nov 23 at 6:59












            • @Jean-FrançoisFabre copy & paste error - thanks for pointing out
              – Patrick Artner
              Nov 23 at 7:13








            2




            2




            why creating a defaultdict and test the keys?
            – Jean-François Fabre
            Nov 23 at 6:59






            why creating a defaultdict and test the keys?
            – Jean-François Fabre
            Nov 23 at 6:59














            @Jean-FrançoisFabre copy & paste error - thanks for pointing out
            – Patrick Artner
            Nov 23 at 7:13




            @Jean-FrançoisFabre copy & paste error - thanks for pointing out
            – Patrick Artner
            Nov 23 at 7:13











            1















            1. Change evalIDs to a collections.defaultdict(list). You can avoid the if to check if a key is there.

            2. Consider splitting the file externally using split(1) or even inside python using a read offset. Then use multiprocessing.pool to parallelise the loading.






            share|improve this answer





















            • can you please give more details. i dont have any idea about multiprocessing.pool
              – bib
              Nov 23 at 6:56










            • parallelize the loading won't do no good if I/O is the bottleneck
              – Jean-François Fabre
              Nov 23 at 6:57










            • If I/O is the bottleneck, then yes, it won't do much good but apart from the defaultdict which everyone has suggested, it's the only other thing I can think of worth trying.
              – Noufal Ibrahim
              Nov 23 at 6:59










            • @bib Are all the lines in the file of the same length?
              – Noufal Ibrahim
              Nov 23 at 7:01
















            1















            1. Change evalIDs to a collections.defaultdict(list). You can avoid the if to check if a key is there.

            2. Consider splitting the file externally using split(1) or even inside python using a read offset. Then use multiprocessing.pool to parallelise the loading.






            share|improve this answer





















            • can you please give more details. i dont have any idea about multiprocessing.pool
              – bib
              Nov 23 at 6:56










            • parallelize the loading won't do no good if I/O is the bottleneck
              – Jean-François Fabre
              Nov 23 at 6:57










            • If I/O is the bottleneck, then yes, it won't do much good but apart from the defaultdict which everyone has suggested, it's the only other thing I can think of worth trying.
              – Noufal Ibrahim
              Nov 23 at 6:59










            • @bib Are all the lines in the file of the same length?
              – Noufal Ibrahim
              Nov 23 at 7:01














            1












            1








            1







            1. Change evalIDs to a collections.defaultdict(list). You can avoid the if to check if a key is there.

            2. Consider splitting the file externally using split(1) or even inside python using a read offset. Then use multiprocessing.pool to parallelise the loading.






            share|improve this answer













            1. Change evalIDs to a collections.defaultdict(list). You can avoid the if to check if a key is there.

            2. Consider splitting the file externally using split(1) or even inside python using a read offset. Then use multiprocessing.pool to parallelise the loading.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 23 at 6:53









            Noufal Ibrahim

            55.5k10104149




            55.5k10104149












            • can you please give more details. i dont have any idea about multiprocessing.pool
              – bib
              Nov 23 at 6:56










            • parallelize the loading won't do no good if I/O is the bottleneck
              – Jean-François Fabre
              Nov 23 at 6:57










            • If I/O is the bottleneck, then yes, it won't do much good but apart from the defaultdict which everyone has suggested, it's the only other thing I can think of worth trying.
              – Noufal Ibrahim
              Nov 23 at 6:59










            • @bib Are all the lines in the file of the same length?
              – Noufal Ibrahim
              Nov 23 at 7:01


















            • can you please give more details. i dont have any idea about multiprocessing.pool
              – bib
              Nov 23 at 6:56










            • parallelize the loading won't do no good if I/O is the bottleneck
              – Jean-François Fabre
              Nov 23 at 6:57










            • If I/O is the bottleneck, then yes, it won't do much good but apart from the defaultdict which everyone has suggested, it's the only other thing I can think of worth trying.
              – Noufal Ibrahim
              Nov 23 at 6:59










            • @bib Are all the lines in the file of the same length?
              – Noufal Ibrahim
              Nov 23 at 7:01
















            can you please give more details. i dont have any idea about multiprocessing.pool
            – bib
            Nov 23 at 6:56




            can you please give more details. i dont have any idea about multiprocessing.pool
            – bib
            Nov 23 at 6:56












            parallelize the loading won't do no good if I/O is the bottleneck
            – Jean-François Fabre
            Nov 23 at 6:57




            parallelize the loading won't do no good if I/O is the bottleneck
            – Jean-François Fabre
            Nov 23 at 6:57












            If I/O is the bottleneck, then yes, it won't do much good but apart from the defaultdict which everyone has suggested, it's the only other thing I can think of worth trying.
            – Noufal Ibrahim
            Nov 23 at 6:59




            If I/O is the bottleneck, then yes, it won't do much good but apart from the defaultdict which everyone has suggested, it's the only other thing I can think of worth trying.
            – Noufal Ibrahim
            Nov 23 at 6:59












            @bib Are all the lines in the file of the same length?
            – Noufal Ibrahim
            Nov 23 at 7:01




            @bib Are all the lines in the file of the same length?
            – Noufal Ibrahim
            Nov 23 at 7:01


















            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53441625%2fhow-to-parse-very-big-files-in-python%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

            Calculate evaluation metrics using cross_val_predict sklearn

            Insert data from modal to MySQL (multiple modal on website)