Instant access to line from a large file without loading the file
In one of my recent projects I need to perform this simple task but I'm not sure what is the most efficient way to do so.
I have several large text files (>5GB) and I need to continuously extract random lines from those files. The requirements are: I can't load the files into memory, I need to perform this very efficiently ( >>1000 lines a second), and preferably I need to do as less pre-processing as possible.
The files consists of many short lines ~(20 mil lines). The "raw" files has varying line length, but with a short pre-processing I can make all lines have the same length (though, the perfect solution would not require pre-processing)
I already tried the default python solutions mentioned here but they were too slow (and the linecache solution loads the file into memory, therefore is not usable here)
The next solution I thought about is to create some kind of index. I found this solution but it's very outdated so it needs some work to get working, and even then I'm not sure if the overhead created during the processing of the index file won't slow down the process to time-scale of the solution above.
Another solution is converting the file into a binary file and then getting instant access to lines this way. For this solution I couldn't find any python package that supports binary-text work, and I feel like creating a robust parser this way could take very long time and could create many hard-to-diagnose errors down the line because of small miscalculations/mistakes.
The final solution I thought about is using some kind of database (sqlite in my case) which will require transferring the lines into a database and loading them this way.
Note: I will also load thousands of (random) lines each time, therefore solutions which work better for groups of lines will have an advantage.
Thanks in advance,
Art.
python performance text loading large-data
add a comment |
In one of my recent projects I need to perform this simple task but I'm not sure what is the most efficient way to do so.
I have several large text files (>5GB) and I need to continuously extract random lines from those files. The requirements are: I can't load the files into memory, I need to perform this very efficiently ( >>1000 lines a second), and preferably I need to do as less pre-processing as possible.
The files consists of many short lines ~(20 mil lines). The "raw" files has varying line length, but with a short pre-processing I can make all lines have the same length (though, the perfect solution would not require pre-processing)
I already tried the default python solutions mentioned here but they were too slow (and the linecache solution loads the file into memory, therefore is not usable here)
The next solution I thought about is to create some kind of index. I found this solution but it's very outdated so it needs some work to get working, and even then I'm not sure if the overhead created during the processing of the index file won't slow down the process to time-scale of the solution above.
Another solution is converting the file into a binary file and then getting instant access to lines this way. For this solution I couldn't find any python package that supports binary-text work, and I feel like creating a robust parser this way could take very long time and could create many hard-to-diagnose errors down the line because of small miscalculations/mistakes.
The final solution I thought about is using some kind of database (sqlite in my case) which will require transferring the lines into a database and loading them this way.
Note: I will also load thousands of (random) lines each time, therefore solutions which work better for groups of lines will have an advantage.
Thanks in advance,
Art.
python performance text loading large-data
2
I recommend sqlite. it fits very well with this problem and there is no need to install.
– MEdwin
Nov 26 '18 at 14:44
2
As MEdwin said and as you alluded to, I think your best bet is to change the file away from a text file, and into either some form of SQL file or HDF5. You might be able to read in the file quicker if you pickle it or something, but in my experience, that doesn't make a huge difference.
– John Rouhana
Nov 26 '18 at 14:46
2
Another possible approach is to seek to a random position in the file and move backwards/forwards until you can isolate a line, and repeat...
– Jon Clements♦
Nov 26 '18 at 14:49
add a comment |
In one of my recent projects I need to perform this simple task but I'm not sure what is the most efficient way to do so.
I have several large text files (>5GB) and I need to continuously extract random lines from those files. The requirements are: I can't load the files into memory, I need to perform this very efficiently ( >>1000 lines a second), and preferably I need to do as less pre-processing as possible.
The files consists of many short lines ~(20 mil lines). The "raw" files has varying line length, but with a short pre-processing I can make all lines have the same length (though, the perfect solution would not require pre-processing)
I already tried the default python solutions mentioned here but they were too slow (and the linecache solution loads the file into memory, therefore is not usable here)
The next solution I thought about is to create some kind of index. I found this solution but it's very outdated so it needs some work to get working, and even then I'm not sure if the overhead created during the processing of the index file won't slow down the process to time-scale of the solution above.
Another solution is converting the file into a binary file and then getting instant access to lines this way. For this solution I couldn't find any python package that supports binary-text work, and I feel like creating a robust parser this way could take very long time and could create many hard-to-diagnose errors down the line because of small miscalculations/mistakes.
The final solution I thought about is using some kind of database (sqlite in my case) which will require transferring the lines into a database and loading them this way.
Note: I will also load thousands of (random) lines each time, therefore solutions which work better for groups of lines will have an advantage.
Thanks in advance,
Art.
python performance text loading large-data
In one of my recent projects I need to perform this simple task but I'm not sure what is the most efficient way to do so.
I have several large text files (>5GB) and I need to continuously extract random lines from those files. The requirements are: I can't load the files into memory, I need to perform this very efficiently ( >>1000 lines a second), and preferably I need to do as less pre-processing as possible.
The files consists of many short lines ~(20 mil lines). The "raw" files has varying line length, but with a short pre-processing I can make all lines have the same length (though, the perfect solution would not require pre-processing)
I already tried the default python solutions mentioned here but they were too slow (and the linecache solution loads the file into memory, therefore is not usable here)
The next solution I thought about is to create some kind of index. I found this solution but it's very outdated so it needs some work to get working, and even then I'm not sure if the overhead created during the processing of the index file won't slow down the process to time-scale of the solution above.
Another solution is converting the file into a binary file and then getting instant access to lines this way. For this solution I couldn't find any python package that supports binary-text work, and I feel like creating a robust parser this way could take very long time and could create many hard-to-diagnose errors down the line because of small miscalculations/mistakes.
The final solution I thought about is using some kind of database (sqlite in my case) which will require transferring the lines into a database and loading them this way.
Note: I will also load thousands of (random) lines each time, therefore solutions which work better for groups of lines will have an advantage.
Thanks in advance,
Art.
python performance text loading large-data
python performance text loading large-data
asked Nov 26 '18 at 14:41
artembusartembus
67111
67111
2
I recommend sqlite. it fits very well with this problem and there is no need to install.
– MEdwin
Nov 26 '18 at 14:44
2
As MEdwin said and as you alluded to, I think your best bet is to change the file away from a text file, and into either some form of SQL file or HDF5. You might be able to read in the file quicker if you pickle it or something, but in my experience, that doesn't make a huge difference.
– John Rouhana
Nov 26 '18 at 14:46
2
Another possible approach is to seek to a random position in the file and move backwards/forwards until you can isolate a line, and repeat...
– Jon Clements♦
Nov 26 '18 at 14:49
add a comment |
2
I recommend sqlite. it fits very well with this problem and there is no need to install.
– MEdwin
Nov 26 '18 at 14:44
2
As MEdwin said and as you alluded to, I think your best bet is to change the file away from a text file, and into either some form of SQL file or HDF5. You might be able to read in the file quicker if you pickle it or something, but in my experience, that doesn't make a huge difference.
– John Rouhana
Nov 26 '18 at 14:46
2
Another possible approach is to seek to a random position in the file and move backwards/forwards until you can isolate a line, and repeat...
– Jon Clements♦
Nov 26 '18 at 14:49
2
2
I recommend sqlite. it fits very well with this problem and there is no need to install.
– MEdwin
Nov 26 '18 at 14:44
I recommend sqlite. it fits very well with this problem and there is no need to install.
– MEdwin
Nov 26 '18 at 14:44
2
2
As MEdwin said and as you alluded to, I think your best bet is to change the file away from a text file, and into either some form of SQL file or HDF5. You might be able to read in the file quicker if you pickle it or something, but in my experience, that doesn't make a huge difference.
– John Rouhana
Nov 26 '18 at 14:46
As MEdwin said and as you alluded to, I think your best bet is to change the file away from a text file, and into either some form of SQL file or HDF5. You might be able to read in the file quicker if you pickle it or something, but in my experience, that doesn't make a huge difference.
– John Rouhana
Nov 26 '18 at 14:46
2
2
Another possible approach is to seek to a random position in the file and move backwards/forwards until you can isolate a line, and repeat...
– Jon Clements♦
Nov 26 '18 at 14:49
Another possible approach is to seek to a random position in the file and move backwards/forwards until you can isolate a line, and repeat...
– Jon Clements♦
Nov 26 '18 at 14:49
add a comment |
1 Answer
1
active
oldest
votes
As said in the comments, I believe using hdf5 would we a good option.
This answer shows how to read that kind of file
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53483493%2finstant-access-to-line-from-a-large-file-without-loading-the-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
As said in the comments, I believe using hdf5 would we a good option.
This answer shows how to read that kind of file
add a comment |
As said in the comments, I believe using hdf5 would we a good option.
This answer shows how to read that kind of file
add a comment |
As said in the comments, I believe using hdf5 would we a good option.
This answer shows how to read that kind of file
As said in the comments, I believe using hdf5 would we a good option.
This answer shows how to read that kind of file
answered Nov 26 '18 at 15:00
Pedro TorresPedro Torres
693413
693413
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53483493%2finstant-access-to-line-from-a-large-file-without-loading-the-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
I recommend sqlite. it fits very well with this problem and there is no need to install.
– MEdwin
Nov 26 '18 at 14:44
2
As MEdwin said and as you alluded to, I think your best bet is to change the file away from a text file, and into either some form of SQL file or HDF5. You might be able to read in the file quicker if you pickle it or something, but in my experience, that doesn't make a huge difference.
– John Rouhana
Nov 26 '18 at 14:46
2
Another possible approach is to seek to a random position in the file and move backwards/forwards until you can isolate a line, and repeat...
– Jon Clements♦
Nov 26 '18 at 14:49