How to parse very big files in python?

I have a very big tsv file: 1.5 GB. i want to parse this file. Im using the following function:

def readEvalFileAsDictInverse(evalFile):

  eval = open(evalFile, "r")

  evalIDs = {}

  for row in eval:

    ids = row.split("t")

    if ids[0] not in evalIDs.keys():

      evalIDs[ids[0]] = 

    evalIDs[ids[0]].append(ids[1])

  eval.close()





  return evalIDs

It is take more than 10 hours and it is still working. I dont know how to accelerate this step and if there is another method to parse such as file

edited Nov 23 at 6:48

Erik Šťastný

785327

asked Nov 23 at 6:31

bib

8310

Maybe duplicated with stackoverflow.com/questions/17444679/reading-a-huge-csv-file
– Guillaume Jacquenot
Nov 23 at 6:36

what operating system?
– Patrick Artner
Nov 23 at 6:39

im working on a linux server with 130 gb as ram.
– bib
Nov 23 at 6:41

@GuillaumeJacquenot nope, not at all. OP isn't reading the lines first.
– Jean-François Fabre
Nov 23 at 7:01

@bib Stepping back a little, what do you plan to use dictionary eval for?
– Noufal Ibrahim
Nov 23 at 7:59

|
show 1 more comment

I have a very big tsv file: 1.5 GB. i want to parse this file. Im using the following function:

def readEvalFileAsDictInverse(evalFile):

  eval = open(evalFile, "r")

  evalIDs = {}

  for row in eval:

    ids = row.split("t")

    if ids[0] not in evalIDs.keys():

      evalIDs[ids[0]] = 

    evalIDs[ids[0]].append(ids[1])

  eval.close()





  return evalIDs

It is take more than 10 hours and it is still working. I dont know how to accelerate this step and if there is another method to parse such as file

edited Nov 23 at 6:48

Erik Šťastný

785327

asked Nov 23 at 6:31

bib

8310

Maybe duplicated with stackoverflow.com/questions/17444679/reading-a-huge-csv-file
– Guillaume Jacquenot
Nov 23 at 6:36

what operating system?
– Patrick Artner
Nov 23 at 6:39

im working on a linux server with 130 gb as ram.
– bib
Nov 23 at 6:41

@GuillaumeJacquenot nope, not at all. OP isn't reading the lines first.
– Jean-François Fabre
Nov 23 at 7:01

@bib Stepping back a little, what do you plan to use dictionary eval for?
– Noufal Ibrahim
Nov 23 at 7:59

|
show 1 more comment

I have a very big tsv file: 1.5 GB. i want to parse this file. Im using the following function:

def readEvalFileAsDictInverse(evalFile):

  eval = open(evalFile, "r")

  evalIDs = {}

  for row in eval:

    ids = row.split("t")

    if ids[0] not in evalIDs.keys():

      evalIDs[ids[0]] = 

    evalIDs[ids[0]].append(ids[1])

  eval.close()





  return evalIDs

It is take more than 10 hours and it is still working. I dont know how to accelerate this step and if there is another method to parse such as file

edited Nov 23 at 6:48

Erik Šťastný

785327

asked Nov 23 at 6:31

bib

8310

I have a very big tsv file: 1.5 GB. i want to parse this file. Im using the following function:

def readEvalFileAsDictInverse(evalFile):

  eval = open(evalFile, "r")

  evalIDs = {}

  for row in eval:

    ids = row.split("t")

    if ids[0] not in evalIDs.keys():

      evalIDs[ids[0]] = 

    evalIDs[ids[0]].append(ids[1])

  eval.close()





  return evalIDs

It is take more than 10 hours and it is still working. I dont know how to accelerate this step and if there is another method to parse such as file

python bigdata

edited Nov 23 at 6:48

Erik Šťastný

785327

asked Nov 23 at 6:31

bib

8310

edited Nov 23 at 6:48

Erik Šťastný

785327

asked Nov 23 at 6:31

bib

8310

edited Nov 23 at 6:48

Erik Šťastný

785327

edited Nov 23 at 6:48

Erik Šťastný

785327

edited Nov 23 at 6:48

Erik Šťastný

785327

asked Nov 23 at 6:31

bib

8310

asked Nov 23 at 6:31

bib

8310

asked Nov 23 at 6:31

bib

8310

Maybe duplicated with stackoverflow.com/questions/17444679/reading-a-huge-csv-file
– Guillaume Jacquenot
Nov 23 at 6:36

what operating system?
– Patrick Artner
Nov 23 at 6:39

im working on a linux server with 130 gb as ram.
– bib
Nov 23 at 6:41

@GuillaumeJacquenot nope, not at all. OP isn't reading the lines first.
– Jean-François Fabre
Nov 23 at 7:01

@bib Stepping back a little, what do you plan to use dictionary eval for?
– Noufal Ibrahim
Nov 23 at 7:59

|
show 1 more comment

Maybe duplicated with stackoverflow.com/questions/17444679/reading-a-huge-csv-file
– Guillaume Jacquenot
Nov 23 at 6:36

what operating system?
– Patrick Artner
Nov 23 at 6:39

im working on a linux server with 130 gb as ram.
– bib
Nov 23 at 6:41

@GuillaumeJacquenot nope, not at all. OP isn't reading the lines first.
– Jean-François Fabre
Nov 23 at 7:01

@bib Stepping back a little, what do you plan to use dictionary eval for?
– Noufal Ibrahim
Nov 23 at 7:59

Maybe duplicated with stackoverflow.com/questions/17444679/reading-a-huge-csv-file
– Guillaume Jacquenot
Nov 23 at 6:36

what operating system?
– Patrick Artner
Nov 23 at 6:39

im working on a linux server with 130 gb as ram.
– bib
Nov 23 at 6:41

@GuillaumeJacquenot nope, not at all. OP isn't reading the lines first.
– Jean-François Fabre
Nov 23 at 7:01

@bib Stepping back a little, what do you plan to use dictionary eval for?
– Noufal Ibrahim
Nov 23 at 7:59

|
show 1 more comment

4 Answers
4

active

oldest

votes

Maybe, you can make it somewhat faster; change it:

if ids[0] not in evalIDs.keys():

      evalIDs[ids[0]] = 

evalIDs[ids[0]].append(ids[1])

evalIDs.setdefault(ids[0],).append(ids[1])

The 1st solution searches 3 times in the "evalID" dictionary.

answered Nov 23 at 6:45

kantal

62227

setdefault is slower than a defaultdict. timeit.timeit(lambda : d.setdefault('x',).append(1)) reports 0.4583683079981711 and timeit.timeit(lambda : c['x'].append(1)) reports 0.28720847200020216 where d is {} and c is collections.defaultdict(list). All the answers have recommended so. Why have you selected this as the correct one? The solution here is inferior to the others mentioned.
– Noufal Ibrahim
Nov 26 at 4:45

I can't measure significant difference (Python 3.7.1), but the OP should measure it.
– kantal
Nov 26 at 7:21

The defaultdict is roughly twice as fast as the setdefault in my measurement. (3.5.3) and I think that's reasonable given how setdefault evaluates its arguments each time you call it (a new empty list is created each time).
– Noufal Ibrahim
Nov 26 at 9:13

add a comment |

several issues here:

testing for keys with if ids[0] not in evalIDs.keys() takes forever in python 2, because keys() is a list. .keys() is rarely useful anyway. A better way already is if ids[0] not in evalIDs, but, but...

why not use a collections.defaultdict instead?

why not use csv module?

overriding eval built-in (well, not really an issue seeing how dangerous it is)

my proposal:

import csv, collections



def readEvalFileAsDictInverse(evalFile):

  with open(evalFile, "r") as handle:

     evalIDs = collections.defaultdict(list)

     cr = csv.reader(handle,delimiter='t')

     for ids in cr:

        evalIDs[ids[0]].append(ids[1]]

the magic evalIDs[ids[0]].append(ids[1]] creates a list if doesn't already exist. It's also portable and very fast whatever the python version and saves a if

I don't think it could be faster with default libraries, but a pandas solution probably would.

edited Nov 23 at 8:49

answered Nov 23 at 6:51

Jean-François Fabre

100k954109

add a comment |

Some suggestions:

Use a defaultdict(list) instead of creating inner lists yourself or using dict.setdefault().

dict.setfdefault() will create the defautvalue every time, thats a time burner - defautldict(list) does not - it is optimized:

from collections import defaultdict    

def readEvalFileAsDictInverse(evalFile):

  eval = open(evalFile, "r")

  evalIDs = defaultdict(list)

  for row in eval:

    ids = row.split("t")

    evalIDs[ids[0]].append(ids[1])

  eval.close()

If your keys are valid file names you might want to investigate awk for much more performance then doing this in python.

Something along the lines of

awk -F $'t' '{print > $1}' file1

will create your split files much faster and you can simply use the latter part of the following code to read from each file (assuming your keys are valid filenames) to construct your lists. (Attributation: here ) - You would need to grab your created files with os.walk or similar means. Each line inside the files will still be tab-seperated and contain the ID in front

If your keys are not filenames in their own right, consider storing your different lines into different files and only keep a dictionary of key,filename around.

After splitting the data, load the files as lists again:

Create testfile:

with open ("file.txt","w") as w:



    w.write("""

1ttatati

2tyippti

3turksti

1tTTtatati

2tYYyippti

3tUUurksti

1ttttttttatati

2tyyyyyyyippti

3tuuuuuuurksti



    """)

Code:

# f.e. https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename

def make_filename(k):

    """In case your keys contain non-filename-characters, make it a valid name"""          

    return k # assuming k is a valid file name else modify it



evalFile = "file.txt"

files = {}

with open(evalFile, "r") as eval_file:

    for line in eval_file:

        if not line.strip():

            continue

        key,value, *rest = line.split("t") # omit ,*rest if you only have 2 values

        fn = files.setdefault(key, make_filename(key))



        # this wil open and close files _a lot_ you might want to keep file handles

        # instead in your dict - but that depends on the key/data/lines ratio in 

        # your data - if you have few keys, file handles ought to be better, if 

        # have many it does not matter

        with open(fn,"a") as f:

            f.write(value+"n")



# create your list data from your files:

data = {}

for key,fn in files.items():

    with open(fn) as r:

        data[key] = [x.strip() for x in r]



print(data)

Output:

# for my data: loaded from files called '1', '2' and '3'

{'1': ['tata', 'TTtata', 'tttttttata'], 

 '2': ['yipp', 'YYyipp', 'yyyyyyyipp'], 

 '3': ['urks', 'UUurks', 'uuuuuuurks']}

edited Nov 23 at 7:56

answered Nov 23 at 6:52

Patrick Artner

20.7k52142

2

why creating a defaultdict and test the keys?
– Jean-François Fabre
Nov 23 at 6:59

@Jean-FrançoisFabre copy & paste error - thanks for pointing out
– Patrick Artner
Nov 23 at 7:13

add a comment |

Change evalIDs to a collections.defaultdict(list). You can avoid the if to check if a key is there.

Consider splitting the file externally using split(1) or even inside python using a read offset. Then use multiprocessing.pool to parallelise the loading.

answered Nov 23 at 6:53

Noufal Ibrahim

55.5k10104149

can you please give more details. i dont have any idea about multiprocessing.pool
– bib
Nov 23 at 6:56

parallelize the loading won't do no good if I/O is the bottleneck
– Jean-François Fabre
Nov 23 at 6:57

If I/O is the bottleneck, then yes, it won't do much good but apart from the defaultdict which everyone has suggested, it's the only other thing I can think of worth trying.
– Noufal Ibrahim
Nov 23 at 6:59

@bib Are all the lines in the file of the same length?
– Noufal Ibrahim
Nov 23 at 7:01

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53441625%2fhow-to-parse-very-big-files-in-python%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

Maybe, you can make it somewhat faster; change it:

if ids[0] not in evalIDs.keys():

      evalIDs[ids[0]] = 

evalIDs[ids[0]].append(ids[1])

evalIDs.setdefault(ids[0],).append(ids[1])

The 1st solution searches 3 times in the "evalID" dictionary.

answered Nov 23 at 6:45

kantal

62227

setdefault is slower than a defaultdict. timeit.timeit(lambda : d.setdefault('x',).append(1)) reports 0.4583683079981711 and timeit.timeit(lambda : c['x'].append(1)) reports 0.28720847200020216 where d is {} and c is collections.defaultdict(list). All the answers have recommended so. Why have you selected this as the correct one? The solution here is inferior to the others mentioned.
– Noufal Ibrahim
Nov 26 at 4:45

I can't measure significant difference (Python 3.7.1), but the OP should measure it.
– kantal
Nov 26 at 7:21

The defaultdict is roughly twice as fast as the setdefault in my measurement. (3.5.3) and I think that's reasonable given how setdefault evaluates its arguments each time you call it (a new empty list is created each time).
– Noufal Ibrahim
Nov 26 at 9:13

add a comment |

Maybe, you can make it somewhat faster; change it:

if ids[0] not in evalIDs.keys():

      evalIDs[ids[0]] = 

evalIDs[ids[0]].append(ids[1])

evalIDs.setdefault(ids[0],).append(ids[1])

The 1st solution searches 3 times in the "evalID" dictionary.

answered Nov 23 at 6:45

kantal

62227

setdefault is slower than a defaultdict. timeit.timeit(lambda : d.setdefault('x',).append(1)) reports 0.4583683079981711 and timeit.timeit(lambda : c['x'].append(1)) reports 0.28720847200020216 where d is {} and c is collections.defaultdict(list). All the answers have recommended so. Why have you selected this as the correct one? The solution here is inferior to the others mentioned.
– Noufal Ibrahim
Nov 26 at 4:45

I can't measure significant difference (Python 3.7.1), but the OP should measure it.
– kantal
Nov 26 at 7:21

The defaultdict is roughly twice as fast as the setdefault in my measurement. (3.5.3) and I think that's reasonable given how setdefault evaluates its arguments each time you call it (a new empty list is created each time).
– Noufal Ibrahim
Nov 26 at 9:13

add a comment |

Maybe, you can make it somewhat faster; change it:

if ids[0] not in evalIDs.keys():

      evalIDs[ids[0]] = 

evalIDs[ids[0]].append(ids[1])

evalIDs.setdefault(ids[0],).append(ids[1])

The 1st solution searches 3 times in the "evalID" dictionary.

answered Nov 23 at 6:45

kantal

62227

Maybe, you can make it somewhat faster; change it:

if ids[0] not in evalIDs.keys():

      evalIDs[ids[0]] = 

evalIDs[ids[0]].append(ids[1])

evalIDs.setdefault(ids[0],).append(ids[1])

The 1st solution searches 3 times in the "evalID" dictionary.

answered Nov 23 at 6:45

kantal

62227

answered Nov 23 at 6:45

kantal

62227

answered Nov 23 at 6:45

kantal

62227

answered Nov 23 at 6:45

kantal

62227

setdefault is slower than a defaultdict. timeit.timeit(lambda : d.setdefault('x',).append(1)) reports 0.4583683079981711 and timeit.timeit(lambda : c['x'].append(1)) reports 0.28720847200020216 where d is {} and c is collections.defaultdict(list). All the answers have recommended so. Why have you selected this as the correct one? The solution here is inferior to the others mentioned.
– Noufal Ibrahim
Nov 26 at 4:45

I can't measure significant difference (Python 3.7.1), but the OP should measure it.
– kantal
Nov 26 at 7:21

The defaultdict is roughly twice as fast as the setdefault in my measurement. (3.5.3) and I think that's reasonable given how setdefault evaluates its arguments each time you call it (a new empty list is created each time).
– Noufal Ibrahim
Nov 26 at 9:13

add a comment |

setdefault is slower than a defaultdict. timeit.timeit(lambda : d.setdefault('x',).append(1)) reports 0.4583683079981711 and timeit.timeit(lambda : c['x'].append(1)) reports 0.28720847200020216 where d is {} and c is collections.defaultdict(list). All the answers have recommended so. Why have you selected this as the correct one? The solution here is inferior to the others mentioned.
– Noufal Ibrahim
Nov 26 at 4:45

I can't measure significant difference (Python 3.7.1), but the OP should measure it.
– kantal
Nov 26 at 7:21

The defaultdict is roughly twice as fast as the setdefault in my measurement. (3.5.3) and I think that's reasonable given how setdefault evaluates its arguments each time you call it (a new empty list is created each time).
– Noufal Ibrahim
Nov 26 at 9:13

setdefault is slower than a defaultdict. timeit.timeit(lambda : d.setdefault('x',).append(1)) reports 0.4583683079981711 and timeit.timeit(lambda : c['x'].append(1)) reports 0.28720847200020216 where d is {} and c is collections.defaultdict(list). All the answers have recommended so. Why have you selected this as the correct one? The solution here is inferior to the others mentioned.
– Noufal Ibrahim
Nov 26 at 4:45

I can't measure significant difference (Python 3.7.1), but the OP should measure it.
– kantal
Nov 26 at 7:21

The defaultdict is roughly twice as fast as the setdefault in my measurement. (3.5.3) and I think that's reasonable given how setdefault evaluates its arguments each time you call it (a new empty list is created each time).
– Noufal Ibrahim
Nov 26 at 9:13

add a comment |

several issues here:

testing for keys with if ids[0] not in evalIDs.keys() takes forever in python 2, because keys() is a list. .keys() is rarely useful anyway. A better way already is if ids[0] not in evalIDs, but, but...

why not use a collections.defaultdict instead?

why not use csv module?

overriding eval built-in (well, not really an issue seeing how dangerous it is)

my proposal:

import csv, collections



def readEvalFileAsDictInverse(evalFile):

  with open(evalFile, "r") as handle:

     evalIDs = collections.defaultdict(list)

     cr = csv.reader(handle,delimiter='t')

     for ids in cr:

        evalIDs[ids[0]].append(ids[1]]

the magic evalIDs[ids[0]].append(ids[1]] creates a list if doesn't already exist. It's also portable and very fast whatever the python version and saves a if

I don't think it could be faster with default libraries, but a pandas solution probably would.

edited Nov 23 at 8:49

answered Nov 23 at 6:51

Jean-François Fabre

100k954109

add a comment |

several issues here:

testing for keys with if ids[0] not in evalIDs.keys() takes forever in python 2, because keys() is a list. .keys() is rarely useful anyway. A better way already is if ids[0] not in evalIDs, but, but...

why not use a collections.defaultdict instead?

why not use csv module?

overriding eval built-in (well, not really an issue seeing how dangerous it is)

my proposal:

import csv, collections



def readEvalFileAsDictInverse(evalFile):

  with open(evalFile, "r") as handle:

     evalIDs = collections.defaultdict(list)

     cr = csv.reader(handle,delimiter='t')

     for ids in cr:

        evalIDs[ids[0]].append(ids[1]]

the magic evalIDs[ids[0]].append(ids[1]] creates a list if doesn't already exist. It's also portable and very fast whatever the python version and saves a if

I don't think it could be faster with default libraries, but a pandas solution probably would.

edited Nov 23 at 8:49

answered Nov 23 at 6:51

Jean-François Fabre

100k954109

add a comment |

several issues here:

testing for keys with if ids[0] not in evalIDs.keys() takes forever in python 2, because keys() is a list. .keys() is rarely useful anyway. A better way already is if ids[0] not in evalIDs, but, but...

why not use a collections.defaultdict instead?

why not use csv module?

overriding eval built-in (well, not really an issue seeing how dangerous it is)

my proposal:

import csv, collections



def readEvalFileAsDictInverse(evalFile):

  with open(evalFile, "r") as handle:

     evalIDs = collections.defaultdict(list)

     cr = csv.reader(handle,delimiter='t')

     for ids in cr:

        evalIDs[ids[0]].append(ids[1]]

the magic evalIDs[ids[0]].append(ids[1]] creates a list if doesn't already exist. It's also portable and very fast whatever the python version and saves a if

I don't think it could be faster with default libraries, but a pandas solution probably would.

edited Nov 23 at 8:49

answered Nov 23 at 6:51

Jean-François Fabre

100k954109

several issues here:

testing for keys with if ids[0] not in evalIDs.keys() takes forever in python 2, because keys() is a list. .keys() is rarely useful anyway. A better way already is if ids[0] not in evalIDs, but, but...

why not use a collections.defaultdict instead?

why not use csv module?

overriding eval built-in (well, not really an issue seeing how dangerous it is)

my proposal:

import csv, collections



def readEvalFileAsDictInverse(evalFile):

  with open(evalFile, "r") as handle:

     evalIDs = collections.defaultdict(list)

     cr = csv.reader(handle,delimiter='t')

     for ids in cr:

        evalIDs[ids[0]].append(ids[1]]

the magic evalIDs[ids[0]].append(ids[1]] creates a list if doesn't already exist. It's also portable and very fast whatever the python version and saves a if

I don't think it could be faster with default libraries, but a pandas solution probably would.

edited Nov 23 at 8:49

answered Nov 23 at 6:51

Jean-François Fabre

100k954109

edited Nov 23 at 8:49

answered Nov 23 at 6:51

Jean-François Fabre

100k954109

answered Nov 23 at 6:51

Jean-François Fabre

100k954109

answered Nov 23 at 6:51

Jean-François Fabre

100k954109

add a comment |

Some suggestions:

Use a defaultdict(list) instead of creating inner lists yourself or using dict.setdefault().

dict.setfdefault() will create the defautvalue every time, thats a time burner - defautldict(list) does not - it is optimized:

from collections import defaultdict    

def readEvalFileAsDictInverse(evalFile):

  eval = open(evalFile, "r")

  evalIDs = defaultdict(list)

  for row in eval:

    ids = row.split("t")

    evalIDs[ids[0]].append(ids[1])

  eval.close()

If your keys are valid file names you might want to investigate awk for much more performance then doing this in python.

Something along the lines of

awk -F $'t' '{print > $1}' file1

If your keys are not filenames in their own right, consider storing your different lines into different files and only keep a dictionary of key,filename around.

After splitting the data, load the files as lists again:

Create testfile:

with open ("file.txt","w") as w:



    w.write("""

1ttatati

2tyippti

3turksti

1tTTtatati

2tYYyippti

3tUUurksti

1ttttttttatati

2tyyyyyyyippti

3tuuuuuuurksti



    """)

Code:

# f.e. https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename

def make_filename(k):

    """In case your keys contain non-filename-characters, make it a valid name"""          

    return k # assuming k is a valid file name else modify it



evalFile = "file.txt"

files = {}

with open(evalFile, "r") as eval_file:

    for line in eval_file:

        if not line.strip():

            continue

        key,value, *rest = line.split("t") # omit ,*rest if you only have 2 values

        fn = files.setdefault(key, make_filename(key))



        # this wil open and close files _a lot_ you might want to keep file handles

        # instead in your dict - but that depends on the key/data/lines ratio in 

        # your data - if you have few keys, file handles ought to be better, if 

        # have many it does not matter

        with open(fn,"a") as f:

            f.write(value+"n")



# create your list data from your files:

data = {}

for key,fn in files.items():

    with open(fn) as r:

        data[key] = [x.strip() for x in r]



print(data)

Output:

# for my data: loaded from files called '1', '2' and '3'

{'1': ['tata', 'TTtata', 'tttttttata'], 

 '2': ['yipp', 'YYyipp', 'yyyyyyyipp'], 

 '3': ['urks', 'UUurks', 'uuuuuuurks']}

edited Nov 23 at 7:56

answered Nov 23 at 6:52

Patrick Artner

20.7k52142

2

why creating a defaultdict and test the keys?
– Jean-François Fabre
Nov 23 at 6:59

@Jean-FrançoisFabre copy & paste error - thanks for pointing out
– Patrick Artner
Nov 23 at 7:13

add a comment |

Some suggestions:

Use a defaultdict(list) instead of creating inner lists yourself or using dict.setdefault().

dict.setfdefault() will create the defautvalue every time, thats a time burner - defautldict(list) does not - it is optimized:

from collections import defaultdict    

def readEvalFileAsDictInverse(evalFile):

  eval = open(evalFile, "r")

  evalIDs = defaultdict(list)

  for row in eval:

    ids = row.split("t")

    evalIDs[ids[0]].append(ids[1])

  eval.close()

If your keys are valid file names you might want to investigate awk for much more performance then doing this in python.

Something along the lines of

awk -F $'t' '{print > $1}' file1

If your keys are not filenames in their own right, consider storing your different lines into different files and only keep a dictionary of key,filename around.

After splitting the data, load the files as lists again:

Create testfile:

with open ("file.txt","w") as w:



    w.write("""

1ttatati

2tyippti

3turksti

1tTTtatati

2tYYyippti

3tUUurksti

1ttttttttatati

2tyyyyyyyippti

3tuuuuuuurksti



    """)

Code:

# f.e. https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename

def make_filename(k):

    """In case your keys contain non-filename-characters, make it a valid name"""          

    return k # assuming k is a valid file name else modify it



evalFile = "file.txt"

files = {}

with open(evalFile, "r") as eval_file:

    for line in eval_file:

        if not line.strip():

            continue

        key,value, *rest = line.split("t") # omit ,*rest if you only have 2 values

        fn = files.setdefault(key, make_filename(key))



        # this wil open and close files _a lot_ you might want to keep file handles

        # instead in your dict - but that depends on the key/data/lines ratio in 

        # your data - if you have few keys, file handles ought to be better, if 

        # have many it does not matter

        with open(fn,"a") as f:

            f.write(value+"n")



# create your list data from your files:

data = {}

for key,fn in files.items():

    with open(fn) as r:

        data[key] = [x.strip() for x in r]



print(data)

Output:

# for my data: loaded from files called '1', '2' and '3'

{'1': ['tata', 'TTtata', 'tttttttata'], 

 '2': ['yipp', 'YYyipp', 'yyyyyyyipp'], 

 '3': ['urks', 'UUurks', 'uuuuuuurks']}

edited Nov 23 at 7:56

answered Nov 23 at 6:52

Patrick Artner

20.7k52142

2

why creating a defaultdict and test the keys?
– Jean-François Fabre
Nov 23 at 6:59

@Jean-FrançoisFabre copy & paste error - thanks for pointing out
– Patrick Artner
Nov 23 at 7:13

add a comment |

Some suggestions:

Use a defaultdict(list) instead of creating inner lists yourself or using dict.setdefault().

dict.setfdefault() will create the defautvalue every time, thats a time burner - defautldict(list) does not - it is optimized:

from collections import defaultdict    

def readEvalFileAsDictInverse(evalFile):

  eval = open(evalFile, "r")

  evalIDs = defaultdict(list)

  for row in eval:

    ids = row.split("t")

    evalIDs[ids[0]].append(ids[1])

  eval.close()

If your keys are valid file names you might want to investigate awk for much more performance then doing this in python.

Something along the lines of

awk -F $'t' '{print > $1}' file1

If your keys are not filenames in their own right, consider storing your different lines into different files and only keep a dictionary of key,filename around.

After splitting the data, load the files as lists again:

Create testfile:

with open ("file.txt","w") as w:



    w.write("""

1ttatati

2tyippti

3turksti

1tTTtatati

2tYYyippti

3tUUurksti

1ttttttttatati

2tyyyyyyyippti

3tuuuuuuurksti



    """)

Code:

# f.e. https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename

def make_filename(k):

    """In case your keys contain non-filename-characters, make it a valid name"""          

    return k # assuming k is a valid file name else modify it



evalFile = "file.txt"

files = {}

with open(evalFile, "r") as eval_file:

    for line in eval_file:

        if not line.strip():

            continue

        key,value, *rest = line.split("t") # omit ,*rest if you only have 2 values

        fn = files.setdefault(key, make_filename(key))



        # this wil open and close files _a lot_ you might want to keep file handles

        # instead in your dict - but that depends on the key/data/lines ratio in 

        # your data - if you have few keys, file handles ought to be better, if 

        # have many it does not matter

        with open(fn,"a") as f:

            f.write(value+"n")



# create your list data from your files:

data = {}

for key,fn in files.items():

    with open(fn) as r:

        data[key] = [x.strip() for x in r]



print(data)

Output:

# for my data: loaded from files called '1', '2' and '3'

{'1': ['tata', 'TTtata', 'tttttttata'], 

 '2': ['yipp', 'YYyipp', 'yyyyyyyipp'], 

 '3': ['urks', 'UUurks', 'uuuuuuurks']}

edited Nov 23 at 7:56

answered Nov 23 at 6:52

Patrick Artner

20.7k52142

Some suggestions:

Use a defaultdict(list) instead of creating inner lists yourself or using dict.setdefault().

dict.setfdefault() will create the defautvalue every time, thats a time burner - defautldict(list) does not - it is optimized:

from collections import defaultdict    

def readEvalFileAsDictInverse(evalFile):

  eval = open(evalFile, "r")

  evalIDs = defaultdict(list)

  for row in eval:

    ids = row.split("t")

    evalIDs[ids[0]].append(ids[1])

  eval.close()

If your keys are valid file names you might want to investigate awk for much more performance then doing this in python.

Something along the lines of

awk -F $'t' '{print > $1}' file1

If your keys are not filenames in their own right, consider storing your different lines into different files and only keep a dictionary of key,filename around.

After splitting the data, load the files as lists again:

Create testfile:

with open ("file.txt","w") as w:



    w.write("""

1ttatati

2tyippti

3turksti

1tTTtatati

2tYYyippti

3tUUurksti

1ttttttttatati

2tyyyyyyyippti

3tuuuuuuurksti



    """)

Code:

# f.e. https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename

def make_filename(k):

    """In case your keys contain non-filename-characters, make it a valid name"""          

    return k # assuming k is a valid file name else modify it



evalFile = "file.txt"

files = {}

with open(evalFile, "r") as eval_file:

    for line in eval_file:

        if not line.strip():

            continue

        key,value, *rest = line.split("t") # omit ,*rest if you only have 2 values

        fn = files.setdefault(key, make_filename(key))



        # this wil open and close files _a lot_ you might want to keep file handles

        # instead in your dict - but that depends on the key/data/lines ratio in 

        # your data - if you have few keys, file handles ought to be better, if 

        # have many it does not matter

        with open(fn,"a") as f:

            f.write(value+"n")



# create your list data from your files:

data = {}

for key,fn in files.items():

    with open(fn) as r:

        data[key] = [x.strip() for x in r]



print(data)

Output:

# for my data: loaded from files called '1', '2' and '3'

{'1': ['tata', 'TTtata', 'tttttttata'], 

 '2': ['yipp', 'YYyipp', 'yyyyyyyipp'], 

 '3': ['urks', 'UUurks', 'uuuuuuurks']}

edited Nov 23 at 7:56

answered Nov 23 at 6:52

Patrick Artner

20.7k52142

edited Nov 23 at 7:56

answered Nov 23 at 6:52

Patrick Artner

20.7k52142

answered Nov 23 at 6:52

Patrick Artner

20.7k52142

answered Nov 23 at 6:52

Patrick Artner

20.7k52142

2

why creating a defaultdict and test the keys?
– Jean-François Fabre
Nov 23 at 6:59

@Jean-FrançoisFabre copy & paste error - thanks for pointing out
– Patrick Artner
Nov 23 at 7:13

add a comment |

2

why creating a defaultdict and test the keys?
– Jean-François Fabre
Nov 23 at 6:59

@Jean-FrançoisFabre copy & paste error - thanks for pointing out
– Patrick Artner
Nov 23 at 7:13

why creating a defaultdict and test the keys?
– Jean-François Fabre
Nov 23 at 6:59

@Jean-FrançoisFabre copy & paste error - thanks for pointing out
– Patrick Artner
Nov 23 at 7:13

add a comment |

Change evalIDs to a collections.defaultdict(list). You can avoid the if to check if a key is there.

Consider splitting the file externally using split(1) or even inside python using a read offset. Then use multiprocessing.pool to parallelise the loading.

answered Nov 23 at 6:53

Noufal Ibrahim

55.5k10104149

can you please give more details. i dont have any idea about multiprocessing.pool
– bib
Nov 23 at 6:56

parallelize the loading won't do no good if I/O is the bottleneck
– Jean-François Fabre
Nov 23 at 6:57

If I/O is the bottleneck, then yes, it won't do much good but apart from the defaultdict which everyone has suggested, it's the only other thing I can think of worth trying.
– Noufal Ibrahim
Nov 23 at 6:59

@bib Are all the lines in the file of the same length?
– Noufal Ibrahim
Nov 23 at 7:01

add a comment |

Change evalIDs to a collections.defaultdict(list). You can avoid the if to check if a key is there.

Consider splitting the file externally using split(1) or even inside python using a read offset. Then use multiprocessing.pool to parallelise the loading.

answered Nov 23 at 6:53

Noufal Ibrahim

55.5k10104149

can you please give more details. i dont have any idea about multiprocessing.pool
– bib
Nov 23 at 6:56

parallelize the loading won't do no good if I/O is the bottleneck
– Jean-François Fabre
Nov 23 at 6:57

If I/O is the bottleneck, then yes, it won't do much good but apart from the defaultdict which everyone has suggested, it's the only other thing I can think of worth trying.
– Noufal Ibrahim
Nov 23 at 6:59

@bib Are all the lines in the file of the same length?
– Noufal Ibrahim
Nov 23 at 7:01

add a comment |

Change evalIDs to a collections.defaultdict(list). You can avoid the if to check if a key is there.

Consider splitting the file externally using split(1) or even inside python using a read offset. Then use multiprocessing.pool to parallelise the loading.

answered Nov 23 at 6:53

Noufal Ibrahim

55.5k10104149

Change evalIDs to a collections.defaultdict(list). You can avoid the if to check if a key is there.

Consider splitting the file externally using split(1) or even inside python using a read offset. Then use multiprocessing.pool to parallelise the loading.

answered Nov 23 at 6:53

Noufal Ibrahim

55.5k10104149

answered Nov 23 at 6:53

Noufal Ibrahim

55.5k10104149

answered Nov 23 at 6:53

Noufal Ibrahim

55.5k10104149

answered Nov 23 at 6:53

Noufal Ibrahim

55.5k10104149

can you please give more details. i dont have any idea about multiprocessing.pool
– bib
Nov 23 at 6:56

parallelize the loading won't do no good if I/O is the bottleneck
– Jean-François Fabre
Nov 23 at 6:57

If I/O is the bottleneck, then yes, it won't do much good but apart from the defaultdict which everyone has suggested, it's the only other thing I can think of worth trying.
– Noufal Ibrahim
Nov 23 at 6:59

@bib Are all the lines in the file of the same length?
– Noufal Ibrahim
Nov 23 at 7:01

add a comment |

can you please give more details. i dont have any idea about multiprocessing.pool
– bib
Nov 23 at 6:56

parallelize the loading won't do no good if I/O is the bottleneck
– Jean-François Fabre
Nov 23 at 6:57

If I/O is the bottleneck, then yes, it won't do much good but apart from the defaultdict which everyone has suggested, it's the only other thing I can think of worth trying.
– Noufal Ibrahim
Nov 23 at 6:59

@bib Are all the lines in the file of the same length?
– Noufal Ibrahim
Nov 23 at 7:01

can you please give more details. i dont have any idea about multiprocessing.pool
– bib
Nov 23 at 6:56

parallelize the loading won't do no good if I/O is the bottleneck
– Jean-François Fabre
Nov 23 at 6:57

If I/O is the bottleneck, then yes, it won't do much good but apart from the defaultdict which everyone has suggested, it's the only other thing I can think of worth trying.
– Noufal Ibrahim
Nov 23 at 6:59

@bib Are all the lines in the file of the same length?
– Noufal Ibrahim
Nov 23 at 7:01

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Btukfyl