How to parse very big files in python?
I have a very big tsv file: 1.5 GB. i want to parse this file. Im using the following function:
def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = {}
for row in eval:
ids = row.split("t")
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] =
evalIDs[ids[0]].append(ids[1])
eval.close()
return evalIDs
It is take more than 10 hours and it is still working. I dont know how to accelerate this step and if there is another method to parse such as file
python bigdata
|
show 1 more comment
I have a very big tsv file: 1.5 GB. i want to parse this file. Im using the following function:
def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = {}
for row in eval:
ids = row.split("t")
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] =
evalIDs[ids[0]].append(ids[1])
eval.close()
return evalIDs
It is take more than 10 hours and it is still working. I dont know how to accelerate this step and if there is another method to parse such as file
python bigdata
Maybe duplicated with stackoverflow.com/questions/17444679/reading-a-huge-csv-file
– Guillaume Jacquenot
Nov 23 at 6:36
what operating system?
– Patrick Artner
Nov 23 at 6:39
im working on a linux server with 130 gb as ram.
– bib
Nov 23 at 6:41
@GuillaumeJacquenot nope, not at all. OP isn't reading the lines first.
– Jean-François Fabre
Nov 23 at 7:01
@bib Stepping back a little, what do you plan to use dictionaryeval
for?
– Noufal Ibrahim
Nov 23 at 7:59
|
show 1 more comment
I have a very big tsv file: 1.5 GB. i want to parse this file. Im using the following function:
def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = {}
for row in eval:
ids = row.split("t")
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] =
evalIDs[ids[0]].append(ids[1])
eval.close()
return evalIDs
It is take more than 10 hours and it is still working. I dont know how to accelerate this step and if there is another method to parse such as file
python bigdata
I have a very big tsv file: 1.5 GB. i want to parse this file. Im using the following function:
def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = {}
for row in eval:
ids = row.split("t")
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] =
evalIDs[ids[0]].append(ids[1])
eval.close()
return evalIDs
It is take more than 10 hours and it is still working. I dont know how to accelerate this step and if there is another method to parse such as file
python bigdata
python bigdata
edited Nov 23 at 6:48
Erik Šťastný
785327
785327
asked Nov 23 at 6:31
bib
8310
8310
Maybe duplicated with stackoverflow.com/questions/17444679/reading-a-huge-csv-file
– Guillaume Jacquenot
Nov 23 at 6:36
what operating system?
– Patrick Artner
Nov 23 at 6:39
im working on a linux server with 130 gb as ram.
– bib
Nov 23 at 6:41
@GuillaumeJacquenot nope, not at all. OP isn't reading the lines first.
– Jean-François Fabre
Nov 23 at 7:01
@bib Stepping back a little, what do you plan to use dictionaryeval
for?
– Noufal Ibrahim
Nov 23 at 7:59
|
show 1 more comment
Maybe duplicated with stackoverflow.com/questions/17444679/reading-a-huge-csv-file
– Guillaume Jacquenot
Nov 23 at 6:36
what operating system?
– Patrick Artner
Nov 23 at 6:39
im working on a linux server with 130 gb as ram.
– bib
Nov 23 at 6:41
@GuillaumeJacquenot nope, not at all. OP isn't reading the lines first.
– Jean-François Fabre
Nov 23 at 7:01
@bib Stepping back a little, what do you plan to use dictionaryeval
for?
– Noufal Ibrahim
Nov 23 at 7:59
Maybe duplicated with stackoverflow.com/questions/17444679/reading-a-huge-csv-file
– Guillaume Jacquenot
Nov 23 at 6:36
Maybe duplicated with stackoverflow.com/questions/17444679/reading-a-huge-csv-file
– Guillaume Jacquenot
Nov 23 at 6:36
what operating system?
– Patrick Artner
Nov 23 at 6:39
what operating system?
– Patrick Artner
Nov 23 at 6:39
im working on a linux server with 130 gb as ram.
– bib
Nov 23 at 6:41
im working on a linux server with 130 gb as ram.
– bib
Nov 23 at 6:41
@GuillaumeJacquenot nope, not at all. OP isn't reading the lines first.
– Jean-François Fabre
Nov 23 at 7:01
@GuillaumeJacquenot nope, not at all. OP isn't reading the lines first.
– Jean-François Fabre
Nov 23 at 7:01
@bib Stepping back a little, what do you plan to use dictionary
eval
for?– Noufal Ibrahim
Nov 23 at 7:59
@bib Stepping back a little, what do you plan to use dictionary
eval
for?– Noufal Ibrahim
Nov 23 at 7:59
|
show 1 more comment
4 Answers
4
active
oldest
votes
Maybe, you can make it somewhat faster; change it:
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] =
evalIDs[ids[0]].append(ids[1])
to
evalIDs.setdefault(ids[0],).append(ids[1])
The 1st solution searches 3 times in the "evalID" dictionary.
setdefault
is slower than a defaultdict.timeit.timeit(lambda : d.setdefault('x',).append(1))
reports0.4583683079981711
andtimeit.timeit(lambda : c['x'].append(1))
reports0.28720847200020216
whered
is{}
andc
iscollections.defaultdict(list)
. All the answers have recommended so. Why have you selected this as the correct one? The solution here is inferior to the others mentioned.
– Noufal Ibrahim
Nov 26 at 4:45
I can't measure significant difference (Python 3.7.1), but the OP should measure it.
– kantal
Nov 26 at 7:21
The defaultdict is roughly twice as fast as the setdefault in my measurement. (3.5.3) and I think that's reasonable given how setdefault evaluates its arguments each time you call it (a new empty list is created each time).
– Noufal Ibrahim
Nov 26 at 9:13
add a comment |
several issues here:
- testing for keys with
if ids[0] not in evalIDs.keys()
takes forever in python 2, becausekeys()
is alist
..keys()
is rarely useful anyway. A better way already isif ids[0] not in evalIDs
, but, but... - why not use a
collections.defaultdict
instead? - why not use
csv
module? - overriding
eval
built-in (well, not really an issue seeing how dangerous it is)
my proposal:
import csv, collections
def readEvalFileAsDictInverse(evalFile):
with open(evalFile, "r") as handle:
evalIDs = collections.defaultdict(list)
cr = csv.reader(handle,delimiter='t')
for ids in cr:
evalIDs[ids[0]].append(ids[1]]
the magic evalIDs[ids[0]].append(ids[1]]
creates a list
if doesn't already exist. It's also portable and very fast whatever the python version and saves a if
I don't think it could be faster with default libraries, but a pandas solution probably would.
add a comment |
Some suggestions:
Use a defaultdict(list)
instead of creating inner lists yourself or using dict.setdefault()
.
dict.setfdefault()
will create the defautvalue every time, thats a time burner - defautldict(list)
does not - it is optimized:
from collections import defaultdict
def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = defaultdict(list)
for row in eval:
ids = row.split("t")
evalIDs[ids[0]].append(ids[1])
eval.close()
If your keys are valid file names you might want to investigate awk
for much more performance then doing this in python.
Something along the lines of
awk -F $'t' '{print > $1}' file1
will create your split files much faster and you can simply use the latter part of the following code to read from each file (assuming your keys are valid filenames) to construct your lists. (Attributation: here ) - You would need to grab your created files with os.walk
or similar means. Each line inside the files will still be tab-seperated and contain the ID in front
If your keys are not filenames in their own right, consider storing your different lines into different files and only keep a dictionary of key,filename
around.
After splitting the data, load the files as lists again:
Create testfile:
with open ("file.txt","w") as w:
w.write("""
1ttatati
2tyippti
3turksti
1tTTtatati
2tYYyippti
3tUUurksti
1ttttttttatati
2tyyyyyyyippti
3tuuuuuuurksti
""")
Code:
# f.e. https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename
def make_filename(k):
"""In case your keys contain non-filename-characters, make it a valid name"""
return k # assuming k is a valid file name else modify it
evalFile = "file.txt"
files = {}
with open(evalFile, "r") as eval_file:
for line in eval_file:
if not line.strip():
continue
key,value, *rest = line.split("t") # omit ,*rest if you only have 2 values
fn = files.setdefault(key, make_filename(key))
# this wil open and close files _a lot_ you might want to keep file handles
# instead in your dict - but that depends on the key/data/lines ratio in
# your data - if you have few keys, file handles ought to be better, if
# have many it does not matter
with open(fn,"a") as f:
f.write(value+"n")
# create your list data from your files:
data = {}
for key,fn in files.items():
with open(fn) as r:
data[key] = [x.strip() for x in r]
print(data)
Output:
# for my data: loaded from files called '1', '2' and '3'
{'1': ['tata', 'TTtata', 'tttttttata'],
'2': ['yipp', 'YYyipp', 'yyyyyyyipp'],
'3': ['urks', 'UUurks', 'uuuuuuurks']}
2
why creating adefaultdict
and test the keys?
– Jean-François Fabre
Nov 23 at 6:59
@Jean-FrançoisFabre copy & paste error - thanks for pointing out
– Patrick Artner
Nov 23 at 7:13
add a comment |
- Change
evalIDs
to acollections.defaultdict(list)
. You can avoid theif
to check if a key is there. - Consider splitting the file externally using
split(1)
or even inside python using a read offset. Then usemultiprocessing.pool
to parallelise the loading.
can you please give more details. i dont have any idea about multiprocessing.pool
– bib
Nov 23 at 6:56
parallelize the loading won't do no good if I/O is the bottleneck
– Jean-François Fabre
Nov 23 at 6:57
If I/O is the bottleneck, then yes, it won't do much good but apart from thedefaultdict
which everyone has suggested, it's the only other thing I can think of worth trying.
– Noufal Ibrahim
Nov 23 at 6:59
@bib Are all the lines in the file of the same length?
– Noufal Ibrahim
Nov 23 at 7:01
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53441625%2fhow-to-parse-very-big-files-in-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
Maybe, you can make it somewhat faster; change it:
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] =
evalIDs[ids[0]].append(ids[1])
to
evalIDs.setdefault(ids[0],).append(ids[1])
The 1st solution searches 3 times in the "evalID" dictionary.
setdefault
is slower than a defaultdict.timeit.timeit(lambda : d.setdefault('x',).append(1))
reports0.4583683079981711
andtimeit.timeit(lambda : c['x'].append(1))
reports0.28720847200020216
whered
is{}
andc
iscollections.defaultdict(list)
. All the answers have recommended so. Why have you selected this as the correct one? The solution here is inferior to the others mentioned.
– Noufal Ibrahim
Nov 26 at 4:45
I can't measure significant difference (Python 3.7.1), but the OP should measure it.
– kantal
Nov 26 at 7:21
The defaultdict is roughly twice as fast as the setdefault in my measurement. (3.5.3) and I think that's reasonable given how setdefault evaluates its arguments each time you call it (a new empty list is created each time).
– Noufal Ibrahim
Nov 26 at 9:13
add a comment |
Maybe, you can make it somewhat faster; change it:
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] =
evalIDs[ids[0]].append(ids[1])
to
evalIDs.setdefault(ids[0],).append(ids[1])
The 1st solution searches 3 times in the "evalID" dictionary.
setdefault
is slower than a defaultdict.timeit.timeit(lambda : d.setdefault('x',).append(1))
reports0.4583683079981711
andtimeit.timeit(lambda : c['x'].append(1))
reports0.28720847200020216
whered
is{}
andc
iscollections.defaultdict(list)
. All the answers have recommended so. Why have you selected this as the correct one? The solution here is inferior to the others mentioned.
– Noufal Ibrahim
Nov 26 at 4:45
I can't measure significant difference (Python 3.7.1), but the OP should measure it.
– kantal
Nov 26 at 7:21
The defaultdict is roughly twice as fast as the setdefault in my measurement. (3.5.3) and I think that's reasonable given how setdefault evaluates its arguments each time you call it (a new empty list is created each time).
– Noufal Ibrahim
Nov 26 at 9:13
add a comment |
Maybe, you can make it somewhat faster; change it:
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] =
evalIDs[ids[0]].append(ids[1])
to
evalIDs.setdefault(ids[0],).append(ids[1])
The 1st solution searches 3 times in the "evalID" dictionary.
Maybe, you can make it somewhat faster; change it:
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] =
evalIDs[ids[0]].append(ids[1])
to
evalIDs.setdefault(ids[0],).append(ids[1])
The 1st solution searches 3 times in the "evalID" dictionary.
answered Nov 23 at 6:45
kantal
62227
62227
setdefault
is slower than a defaultdict.timeit.timeit(lambda : d.setdefault('x',).append(1))
reports0.4583683079981711
andtimeit.timeit(lambda : c['x'].append(1))
reports0.28720847200020216
whered
is{}
andc
iscollections.defaultdict(list)
. All the answers have recommended so. Why have you selected this as the correct one? The solution here is inferior to the others mentioned.
– Noufal Ibrahim
Nov 26 at 4:45
I can't measure significant difference (Python 3.7.1), but the OP should measure it.
– kantal
Nov 26 at 7:21
The defaultdict is roughly twice as fast as the setdefault in my measurement. (3.5.3) and I think that's reasonable given how setdefault evaluates its arguments each time you call it (a new empty list is created each time).
– Noufal Ibrahim
Nov 26 at 9:13
add a comment |
setdefault
is slower than a defaultdict.timeit.timeit(lambda : d.setdefault('x',).append(1))
reports0.4583683079981711
andtimeit.timeit(lambda : c['x'].append(1))
reports0.28720847200020216
whered
is{}
andc
iscollections.defaultdict(list)
. All the answers have recommended so. Why have you selected this as the correct one? The solution here is inferior to the others mentioned.
– Noufal Ibrahim
Nov 26 at 4:45
I can't measure significant difference (Python 3.7.1), but the OP should measure it.
– kantal
Nov 26 at 7:21
The defaultdict is roughly twice as fast as the setdefault in my measurement. (3.5.3) and I think that's reasonable given how setdefault evaluates its arguments each time you call it (a new empty list is created each time).
– Noufal Ibrahim
Nov 26 at 9:13
setdefault
is slower than a defaultdict. timeit.timeit(lambda : d.setdefault('x',).append(1))
reports 0.4583683079981711
and timeit.timeit(lambda : c['x'].append(1))
reports 0.28720847200020216
where d
is {}
and c
is collections.defaultdict(list)
. All the answers have recommended so. Why have you selected this as the correct one? The solution here is inferior to the others mentioned.– Noufal Ibrahim
Nov 26 at 4:45
setdefault
is slower than a defaultdict. timeit.timeit(lambda : d.setdefault('x',).append(1))
reports 0.4583683079981711
and timeit.timeit(lambda : c['x'].append(1))
reports 0.28720847200020216
where d
is {}
and c
is collections.defaultdict(list)
. All the answers have recommended so. Why have you selected this as the correct one? The solution here is inferior to the others mentioned.– Noufal Ibrahim
Nov 26 at 4:45
I can't measure significant difference (Python 3.7.1), but the OP should measure it.
– kantal
Nov 26 at 7:21
I can't measure significant difference (Python 3.7.1), but the OP should measure it.
– kantal
Nov 26 at 7:21
The defaultdict is roughly twice as fast as the setdefault in my measurement. (3.5.3) and I think that's reasonable given how setdefault evaluates its arguments each time you call it (a new empty list is created each time).
– Noufal Ibrahim
Nov 26 at 9:13
The defaultdict is roughly twice as fast as the setdefault in my measurement. (3.5.3) and I think that's reasonable given how setdefault evaluates its arguments each time you call it (a new empty list is created each time).
– Noufal Ibrahim
Nov 26 at 9:13
add a comment |
several issues here:
- testing for keys with
if ids[0] not in evalIDs.keys()
takes forever in python 2, becausekeys()
is alist
..keys()
is rarely useful anyway. A better way already isif ids[0] not in evalIDs
, but, but... - why not use a
collections.defaultdict
instead? - why not use
csv
module? - overriding
eval
built-in (well, not really an issue seeing how dangerous it is)
my proposal:
import csv, collections
def readEvalFileAsDictInverse(evalFile):
with open(evalFile, "r") as handle:
evalIDs = collections.defaultdict(list)
cr = csv.reader(handle,delimiter='t')
for ids in cr:
evalIDs[ids[0]].append(ids[1]]
the magic evalIDs[ids[0]].append(ids[1]]
creates a list
if doesn't already exist. It's also portable and very fast whatever the python version and saves a if
I don't think it could be faster with default libraries, but a pandas solution probably would.
add a comment |
several issues here:
- testing for keys with
if ids[0] not in evalIDs.keys()
takes forever in python 2, becausekeys()
is alist
..keys()
is rarely useful anyway. A better way already isif ids[0] not in evalIDs
, but, but... - why not use a
collections.defaultdict
instead? - why not use
csv
module? - overriding
eval
built-in (well, not really an issue seeing how dangerous it is)
my proposal:
import csv, collections
def readEvalFileAsDictInverse(evalFile):
with open(evalFile, "r") as handle:
evalIDs = collections.defaultdict(list)
cr = csv.reader(handle,delimiter='t')
for ids in cr:
evalIDs[ids[0]].append(ids[1]]
the magic evalIDs[ids[0]].append(ids[1]]
creates a list
if doesn't already exist. It's also portable and very fast whatever the python version and saves a if
I don't think it could be faster with default libraries, but a pandas solution probably would.
add a comment |
several issues here:
- testing for keys with
if ids[0] not in evalIDs.keys()
takes forever in python 2, becausekeys()
is alist
..keys()
is rarely useful anyway. A better way already isif ids[0] not in evalIDs
, but, but... - why not use a
collections.defaultdict
instead? - why not use
csv
module? - overriding
eval
built-in (well, not really an issue seeing how dangerous it is)
my proposal:
import csv, collections
def readEvalFileAsDictInverse(evalFile):
with open(evalFile, "r") as handle:
evalIDs = collections.defaultdict(list)
cr = csv.reader(handle,delimiter='t')
for ids in cr:
evalIDs[ids[0]].append(ids[1]]
the magic evalIDs[ids[0]].append(ids[1]]
creates a list
if doesn't already exist. It's also portable and very fast whatever the python version and saves a if
I don't think it could be faster with default libraries, but a pandas solution probably would.
several issues here:
- testing for keys with
if ids[0] not in evalIDs.keys()
takes forever in python 2, becausekeys()
is alist
..keys()
is rarely useful anyway. A better way already isif ids[0] not in evalIDs
, but, but... - why not use a
collections.defaultdict
instead? - why not use
csv
module? - overriding
eval
built-in (well, not really an issue seeing how dangerous it is)
my proposal:
import csv, collections
def readEvalFileAsDictInverse(evalFile):
with open(evalFile, "r") as handle:
evalIDs = collections.defaultdict(list)
cr = csv.reader(handle,delimiter='t')
for ids in cr:
evalIDs[ids[0]].append(ids[1]]
the magic evalIDs[ids[0]].append(ids[1]]
creates a list
if doesn't already exist. It's also portable and very fast whatever the python version and saves a if
I don't think it could be faster with default libraries, but a pandas solution probably would.
edited Nov 23 at 8:49
answered Nov 23 at 6:51
Jean-François Fabre
100k954109
100k954109
add a comment |
add a comment |
Some suggestions:
Use a defaultdict(list)
instead of creating inner lists yourself or using dict.setdefault()
.
dict.setfdefault()
will create the defautvalue every time, thats a time burner - defautldict(list)
does not - it is optimized:
from collections import defaultdict
def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = defaultdict(list)
for row in eval:
ids = row.split("t")
evalIDs[ids[0]].append(ids[1])
eval.close()
If your keys are valid file names you might want to investigate awk
for much more performance then doing this in python.
Something along the lines of
awk -F $'t' '{print > $1}' file1
will create your split files much faster and you can simply use the latter part of the following code to read from each file (assuming your keys are valid filenames) to construct your lists. (Attributation: here ) - You would need to grab your created files with os.walk
or similar means. Each line inside the files will still be tab-seperated and contain the ID in front
If your keys are not filenames in their own right, consider storing your different lines into different files and only keep a dictionary of key,filename
around.
After splitting the data, load the files as lists again:
Create testfile:
with open ("file.txt","w") as w:
w.write("""
1ttatati
2tyippti
3turksti
1tTTtatati
2tYYyippti
3tUUurksti
1ttttttttatati
2tyyyyyyyippti
3tuuuuuuurksti
""")
Code:
# f.e. https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename
def make_filename(k):
"""In case your keys contain non-filename-characters, make it a valid name"""
return k # assuming k is a valid file name else modify it
evalFile = "file.txt"
files = {}
with open(evalFile, "r") as eval_file:
for line in eval_file:
if not line.strip():
continue
key,value, *rest = line.split("t") # omit ,*rest if you only have 2 values
fn = files.setdefault(key, make_filename(key))
# this wil open and close files _a lot_ you might want to keep file handles
# instead in your dict - but that depends on the key/data/lines ratio in
# your data - if you have few keys, file handles ought to be better, if
# have many it does not matter
with open(fn,"a") as f:
f.write(value+"n")
# create your list data from your files:
data = {}
for key,fn in files.items():
with open(fn) as r:
data[key] = [x.strip() for x in r]
print(data)
Output:
# for my data: loaded from files called '1', '2' and '3'
{'1': ['tata', 'TTtata', 'tttttttata'],
'2': ['yipp', 'YYyipp', 'yyyyyyyipp'],
'3': ['urks', 'UUurks', 'uuuuuuurks']}
2
why creating adefaultdict
and test the keys?
– Jean-François Fabre
Nov 23 at 6:59
@Jean-FrançoisFabre copy & paste error - thanks for pointing out
– Patrick Artner
Nov 23 at 7:13
add a comment |
Some suggestions:
Use a defaultdict(list)
instead of creating inner lists yourself or using dict.setdefault()
.
dict.setfdefault()
will create the defautvalue every time, thats a time burner - defautldict(list)
does not - it is optimized:
from collections import defaultdict
def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = defaultdict(list)
for row in eval:
ids = row.split("t")
evalIDs[ids[0]].append(ids[1])
eval.close()
If your keys are valid file names you might want to investigate awk
for much more performance then doing this in python.
Something along the lines of
awk -F $'t' '{print > $1}' file1
will create your split files much faster and you can simply use the latter part of the following code to read from each file (assuming your keys are valid filenames) to construct your lists. (Attributation: here ) - You would need to grab your created files with os.walk
or similar means. Each line inside the files will still be tab-seperated and contain the ID in front
If your keys are not filenames in their own right, consider storing your different lines into different files and only keep a dictionary of key,filename
around.
After splitting the data, load the files as lists again:
Create testfile:
with open ("file.txt","w") as w:
w.write("""
1ttatati
2tyippti
3turksti
1tTTtatati
2tYYyippti
3tUUurksti
1ttttttttatati
2tyyyyyyyippti
3tuuuuuuurksti
""")
Code:
# f.e. https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename
def make_filename(k):
"""In case your keys contain non-filename-characters, make it a valid name"""
return k # assuming k is a valid file name else modify it
evalFile = "file.txt"
files = {}
with open(evalFile, "r") as eval_file:
for line in eval_file:
if not line.strip():
continue
key,value, *rest = line.split("t") # omit ,*rest if you only have 2 values
fn = files.setdefault(key, make_filename(key))
# this wil open and close files _a lot_ you might want to keep file handles
# instead in your dict - but that depends on the key/data/lines ratio in
# your data - if you have few keys, file handles ought to be better, if
# have many it does not matter
with open(fn,"a") as f:
f.write(value+"n")
# create your list data from your files:
data = {}
for key,fn in files.items():
with open(fn) as r:
data[key] = [x.strip() for x in r]
print(data)
Output:
# for my data: loaded from files called '1', '2' and '3'
{'1': ['tata', 'TTtata', 'tttttttata'],
'2': ['yipp', 'YYyipp', 'yyyyyyyipp'],
'3': ['urks', 'UUurks', 'uuuuuuurks']}
2
why creating adefaultdict
and test the keys?
– Jean-François Fabre
Nov 23 at 6:59
@Jean-FrançoisFabre copy & paste error - thanks for pointing out
– Patrick Artner
Nov 23 at 7:13
add a comment |
Some suggestions:
Use a defaultdict(list)
instead of creating inner lists yourself or using dict.setdefault()
.
dict.setfdefault()
will create the defautvalue every time, thats a time burner - defautldict(list)
does not - it is optimized:
from collections import defaultdict
def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = defaultdict(list)
for row in eval:
ids = row.split("t")
evalIDs[ids[0]].append(ids[1])
eval.close()
If your keys are valid file names you might want to investigate awk
for much more performance then doing this in python.
Something along the lines of
awk -F $'t' '{print > $1}' file1
will create your split files much faster and you can simply use the latter part of the following code to read from each file (assuming your keys are valid filenames) to construct your lists. (Attributation: here ) - You would need to grab your created files with os.walk
or similar means. Each line inside the files will still be tab-seperated and contain the ID in front
If your keys are not filenames in their own right, consider storing your different lines into different files and only keep a dictionary of key,filename
around.
After splitting the data, load the files as lists again:
Create testfile:
with open ("file.txt","w") as w:
w.write("""
1ttatati
2tyippti
3turksti
1tTTtatati
2tYYyippti
3tUUurksti
1ttttttttatati
2tyyyyyyyippti
3tuuuuuuurksti
""")
Code:
# f.e. https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename
def make_filename(k):
"""In case your keys contain non-filename-characters, make it a valid name"""
return k # assuming k is a valid file name else modify it
evalFile = "file.txt"
files = {}
with open(evalFile, "r") as eval_file:
for line in eval_file:
if not line.strip():
continue
key,value, *rest = line.split("t") # omit ,*rest if you only have 2 values
fn = files.setdefault(key, make_filename(key))
# this wil open and close files _a lot_ you might want to keep file handles
# instead in your dict - but that depends on the key/data/lines ratio in
# your data - if you have few keys, file handles ought to be better, if
# have many it does not matter
with open(fn,"a") as f:
f.write(value+"n")
# create your list data from your files:
data = {}
for key,fn in files.items():
with open(fn) as r:
data[key] = [x.strip() for x in r]
print(data)
Output:
# for my data: loaded from files called '1', '2' and '3'
{'1': ['tata', 'TTtata', 'tttttttata'],
'2': ['yipp', 'YYyipp', 'yyyyyyyipp'],
'3': ['urks', 'UUurks', 'uuuuuuurks']}
Some suggestions:
Use a defaultdict(list)
instead of creating inner lists yourself or using dict.setdefault()
.
dict.setfdefault()
will create the defautvalue every time, thats a time burner - defautldict(list)
does not - it is optimized:
from collections import defaultdict
def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = defaultdict(list)
for row in eval:
ids = row.split("t")
evalIDs[ids[0]].append(ids[1])
eval.close()
If your keys are valid file names you might want to investigate awk
for much more performance then doing this in python.
Something along the lines of
awk -F $'t' '{print > $1}' file1
will create your split files much faster and you can simply use the latter part of the following code to read from each file (assuming your keys are valid filenames) to construct your lists. (Attributation: here ) - You would need to grab your created files with os.walk
or similar means. Each line inside the files will still be tab-seperated and contain the ID in front
If your keys are not filenames in their own right, consider storing your different lines into different files and only keep a dictionary of key,filename
around.
After splitting the data, load the files as lists again:
Create testfile:
with open ("file.txt","w") as w:
w.write("""
1ttatati
2tyippti
3turksti
1tTTtatati
2tYYyippti
3tUUurksti
1ttttttttatati
2tyyyyyyyippti
3tuuuuuuurksti
""")
Code:
# f.e. https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename
def make_filename(k):
"""In case your keys contain non-filename-characters, make it a valid name"""
return k # assuming k is a valid file name else modify it
evalFile = "file.txt"
files = {}
with open(evalFile, "r") as eval_file:
for line in eval_file:
if not line.strip():
continue
key,value, *rest = line.split("t") # omit ,*rest if you only have 2 values
fn = files.setdefault(key, make_filename(key))
# this wil open and close files _a lot_ you might want to keep file handles
# instead in your dict - but that depends on the key/data/lines ratio in
# your data - if you have few keys, file handles ought to be better, if
# have many it does not matter
with open(fn,"a") as f:
f.write(value+"n")
# create your list data from your files:
data = {}
for key,fn in files.items():
with open(fn) as r:
data[key] = [x.strip() for x in r]
print(data)
Output:
# for my data: loaded from files called '1', '2' and '3'
{'1': ['tata', 'TTtata', 'tttttttata'],
'2': ['yipp', 'YYyipp', 'yyyyyyyipp'],
'3': ['urks', 'UUurks', 'uuuuuuurks']}
edited Nov 23 at 7:56
answered Nov 23 at 6:52
Patrick Artner
20.7k52142
20.7k52142
2
why creating adefaultdict
and test the keys?
– Jean-François Fabre
Nov 23 at 6:59
@Jean-FrançoisFabre copy & paste error - thanks for pointing out
– Patrick Artner
Nov 23 at 7:13
add a comment |
2
why creating adefaultdict
and test the keys?
– Jean-François Fabre
Nov 23 at 6:59
@Jean-FrançoisFabre copy & paste error - thanks for pointing out
– Patrick Artner
Nov 23 at 7:13
2
2
why creating a
defaultdict
and test the keys?– Jean-François Fabre
Nov 23 at 6:59
why creating a
defaultdict
and test the keys?– Jean-François Fabre
Nov 23 at 6:59
@Jean-FrançoisFabre copy & paste error - thanks for pointing out
– Patrick Artner
Nov 23 at 7:13
@Jean-FrançoisFabre copy & paste error - thanks for pointing out
– Patrick Artner
Nov 23 at 7:13
add a comment |
- Change
evalIDs
to acollections.defaultdict(list)
. You can avoid theif
to check if a key is there. - Consider splitting the file externally using
split(1)
or even inside python using a read offset. Then usemultiprocessing.pool
to parallelise the loading.
can you please give more details. i dont have any idea about multiprocessing.pool
– bib
Nov 23 at 6:56
parallelize the loading won't do no good if I/O is the bottleneck
– Jean-François Fabre
Nov 23 at 6:57
If I/O is the bottleneck, then yes, it won't do much good but apart from thedefaultdict
which everyone has suggested, it's the only other thing I can think of worth trying.
– Noufal Ibrahim
Nov 23 at 6:59
@bib Are all the lines in the file of the same length?
– Noufal Ibrahim
Nov 23 at 7:01
add a comment |
- Change
evalIDs
to acollections.defaultdict(list)
. You can avoid theif
to check if a key is there. - Consider splitting the file externally using
split(1)
or even inside python using a read offset. Then usemultiprocessing.pool
to parallelise the loading.
can you please give more details. i dont have any idea about multiprocessing.pool
– bib
Nov 23 at 6:56
parallelize the loading won't do no good if I/O is the bottleneck
– Jean-François Fabre
Nov 23 at 6:57
If I/O is the bottleneck, then yes, it won't do much good but apart from thedefaultdict
which everyone has suggested, it's the only other thing I can think of worth trying.
– Noufal Ibrahim
Nov 23 at 6:59
@bib Are all the lines in the file of the same length?
– Noufal Ibrahim
Nov 23 at 7:01
add a comment |
- Change
evalIDs
to acollections.defaultdict(list)
. You can avoid theif
to check if a key is there. - Consider splitting the file externally using
split(1)
or even inside python using a read offset. Then usemultiprocessing.pool
to parallelise the loading.
- Change
evalIDs
to acollections.defaultdict(list)
. You can avoid theif
to check if a key is there. - Consider splitting the file externally using
split(1)
or even inside python using a read offset. Then usemultiprocessing.pool
to parallelise the loading.
answered Nov 23 at 6:53
Noufal Ibrahim
55.5k10104149
55.5k10104149
can you please give more details. i dont have any idea about multiprocessing.pool
– bib
Nov 23 at 6:56
parallelize the loading won't do no good if I/O is the bottleneck
– Jean-François Fabre
Nov 23 at 6:57
If I/O is the bottleneck, then yes, it won't do much good but apart from thedefaultdict
which everyone has suggested, it's the only other thing I can think of worth trying.
– Noufal Ibrahim
Nov 23 at 6:59
@bib Are all the lines in the file of the same length?
– Noufal Ibrahim
Nov 23 at 7:01
add a comment |
can you please give more details. i dont have any idea about multiprocessing.pool
– bib
Nov 23 at 6:56
parallelize the loading won't do no good if I/O is the bottleneck
– Jean-François Fabre
Nov 23 at 6:57
If I/O is the bottleneck, then yes, it won't do much good but apart from thedefaultdict
which everyone has suggested, it's the only other thing I can think of worth trying.
– Noufal Ibrahim
Nov 23 at 6:59
@bib Are all the lines in the file of the same length?
– Noufal Ibrahim
Nov 23 at 7:01
can you please give more details. i dont have any idea about multiprocessing.pool
– bib
Nov 23 at 6:56
can you please give more details. i dont have any idea about multiprocessing.pool
– bib
Nov 23 at 6:56
parallelize the loading won't do no good if I/O is the bottleneck
– Jean-François Fabre
Nov 23 at 6:57
parallelize the loading won't do no good if I/O is the bottleneck
– Jean-François Fabre
Nov 23 at 6:57
If I/O is the bottleneck, then yes, it won't do much good but apart from the
defaultdict
which everyone has suggested, it's the only other thing I can think of worth trying.– Noufal Ibrahim
Nov 23 at 6:59
If I/O is the bottleneck, then yes, it won't do much good but apart from the
defaultdict
which everyone has suggested, it's the only other thing I can think of worth trying.– Noufal Ibrahim
Nov 23 at 6:59
@bib Are all the lines in the file of the same length?
– Noufal Ibrahim
Nov 23 at 7:01
@bib Are all the lines in the file of the same length?
– Noufal Ibrahim
Nov 23 at 7:01
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53441625%2fhow-to-parse-very-big-files-in-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Maybe duplicated with stackoverflow.com/questions/17444679/reading-a-huge-csv-file
– Guillaume Jacquenot
Nov 23 at 6:36
what operating system?
– Patrick Artner
Nov 23 at 6:39
im working on a linux server with 130 gb as ram.
– bib
Nov 23 at 6:41
@GuillaumeJacquenot nope, not at all. OP isn't reading the lines first.
– Jean-François Fabre
Nov 23 at 7:01
@bib Stepping back a little, what do you plan to use dictionary
eval
for?– Noufal Ibrahim
Nov 23 at 7:59