Instant access to line from a large file without loading the file

In one of my recent projects I need to perform this simple task but I'm not sure what is the most efficient way to do so.

I have several large text files (>5GB) and I need to continuously extract random lines from those files. The requirements are: I can't load the files into memory, I need to perform this very efficiently ( >>1000 lines a second), and preferably I need to do as less pre-processing as possible.

The files consists of many short lines ~(20 mil lines). The "raw" files has varying line length, but with a short pre-processing I can make all lines have the same length (though, the perfect solution would not require pre-processing)

I already tried the default python solutions mentioned here but they were too slow (and the linecache solution loads the file into memory, therefore is not usable here)

The next solution I thought about is to create some kind of index. I found this solution but it's very outdated so it needs some work to get working, and even then I'm not sure if the overhead created during the processing of the index file won't slow down the process to time-scale of the solution above.

Another solution is converting the file into a binary file and then getting instant access to lines this way. For this solution I couldn't find any python package that supports binary-text work, and I feel like creating a robust parser this way could take very long time and could create many hard-to-diagnose errors down the line because of small miscalculations/mistakes.

The final solution I thought about is using some kind of database (sqlite in my case) which will require transferring the lines into a database and loading them this way.

Note: I will also load thousands of (random) lines each time, therefore solutions which work better for groups of lines will have an advantage.

Thanks in advance,

Art.

asked Nov 26 '18 at 14:41

artembus

67111

2

I recommend sqlite. it fits very well with this problem and there is no need to install.

– MEdwin
Nov 26 '18 at 14:44

2

As MEdwin said and as you alluded to, I think your best bet is to change the file away from a text file, and into either some form of SQL file or HDF5. You might be able to read in the file quicker if you pickle it or something, but in my experience, that doesn't make a huge difference.

– John Rouhana
Nov 26 '18 at 14:46

2

Another possible approach is to seek to a random position in the file and move backwards/forwards until you can isolate a line, and repeat...

– Jon Clements♦
Nov 26 '18 at 14:49

add a comment |

In one of my recent projects I need to perform this simple task but I'm not sure what is the most efficient way to do so.

I already tried the default python solutions mentioned here but they were too slow (and the linecache solution loads the file into memory, therefore is not usable here)

The final solution I thought about is using some kind of database (sqlite in my case) which will require transferring the lines into a database and loading them this way.

Note: I will also load thousands of (random) lines each time, therefore solutions which work better for groups of lines will have an advantage.

Thanks in advance,

Art.

asked Nov 26 '18 at 14:41

artembus

67111

2

I recommend sqlite. it fits very well with this problem and there is no need to install.

– MEdwin
Nov 26 '18 at 14:44

2

As MEdwin said and as you alluded to, I think your best bet is to change the file away from a text file, and into either some form of SQL file or HDF5. You might be able to read in the file quicker if you pickle it or something, but in my experience, that doesn't make a huge difference.

– John Rouhana
Nov 26 '18 at 14:46

2

Another possible approach is to seek to a random position in the file and move backwards/forwards until you can isolate a line, and repeat...

– Jon Clements♦
Nov 26 '18 at 14:49

add a comment |

In one of my recent projects I need to perform this simple task but I'm not sure what is the most efficient way to do so.

I already tried the default python solutions mentioned here but they were too slow (and the linecache solution loads the file into memory, therefore is not usable here)

The final solution I thought about is using some kind of database (sqlite in my case) which will require transferring the lines into a database and loading them this way.

Note: I will also load thousands of (random) lines each time, therefore solutions which work better for groups of lines will have an advantage.

Thanks in advance,

Art.

asked Nov 26 '18 at 14:41

artembus

67111

In one of my recent projects I need to perform this simple task but I'm not sure what is the most efficient way to do so.

I already tried the default python solutions mentioned here but they were too slow (and the linecache solution loads the file into memory, therefore is not usable here)

The final solution I thought about is using some kind of database (sqlite in my case) which will require transferring the lines into a database and loading them this way.

Note: I will also load thousands of (random) lines each time, therefore solutions which work better for groups of lines will have an advantage.

Thanks in advance,

Art.

python performance text loading large-data

asked Nov 26 '18 at 14:41

artembus

67111

asked Nov 26 '18 at 14:41

artembus

67111

asked Nov 26 '18 at 14:41

artembus

67111

asked Nov 26 '18 at 14:41

artembus

67111

asked Nov 26 '18 at 14:41

artembus

67111

2

I recommend sqlite. it fits very well with this problem and there is no need to install.

– MEdwin
Nov 26 '18 at 14:44

2

As MEdwin said and as you alluded to, I think your best bet is to change the file away from a text file, and into either some form of SQL file or HDF5. You might be able to read in the file quicker if you pickle it or something, but in my experience, that doesn't make a huge difference.

– John Rouhana
Nov 26 '18 at 14:46

2

Another possible approach is to seek to a random position in the file and move backwards/forwards until you can isolate a line, and repeat...

– Jon Clements♦
Nov 26 '18 at 14:49

add a comment |

2

I recommend sqlite. it fits very well with this problem and there is no need to install.

– MEdwin
Nov 26 '18 at 14:44

2

As MEdwin said and as you alluded to, I think your best bet is to change the file away from a text file, and into either some form of SQL file or HDF5. You might be able to read in the file quicker if you pickle it or something, but in my experience, that doesn't make a huge difference.

– John Rouhana
Nov 26 '18 at 14:46

2

Another possible approach is to seek to a random position in the file and move backwards/forwards until you can isolate a line, and repeat...

– Jon Clements♦
Nov 26 '18 at 14:49

I recommend sqlite. it fits very well with this problem and there is no need to install.

– MEdwin
Nov 26 '18 at 14:44

As MEdwin said and as you alluded to, I think your best bet is to change the file away from a text file, and into either some form of SQL file or HDF5. You might be able to read in the file quicker if you pickle it or something, but in my experience, that doesn't make a huge difference.

– John Rouhana
Nov 26 '18 at 14:46

Another possible approach is to seek to a random position in the file and move backwards/forwards until you can isolate a line, and repeat...

– Jon Clements♦
Nov 26 '18 at 14:49

add a comment |

1 Answer
1

active

oldest

votes

As said in the comments, I believe using hdf5 would we a good option.
This answer shows how to read that kind of file

answered Nov 26 '18 at 15:00

Pedro Torres

693413

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53483493%2finstant-access-to-line-from-a-large-file-without-loading-the-file%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

As said in the comments, I believe using hdf5 would we a good option.
This answer shows how to read that kind of file

answered Nov 26 '18 at 15:00

Pedro Torres

693413

add a comment |

As said in the comments, I believe using hdf5 would we a good option.
This answer shows how to read that kind of file

answered Nov 26 '18 at 15:00

Pedro Torres

693413

add a comment |

As said in the comments, I believe using hdf5 would we a good option.
This answer shows how to read that kind of file

answered Nov 26 '18 at 15:00

Pedro Torres

693413

As said in the comments, I believe using hdf5 would we a good option.
This answer shows how to read that kind of file

answered Nov 26 '18 at 15:00

Pedro Torres

693413

answered Nov 26 '18 at 15:00

Pedro Torres

693413

answered Nov 26 '18 at 15:00

Pedro Torres

693413

answered Nov 26 '18 at 15:00

Pedro Torres

693413

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Px c9DkVQYf XbdZxjhRQ,tMWVJ9JZuvdog,nZSEim7QApEAj3p1WMdFjZ,FYrFf JiNE,wnqYASsudi yh0RFWVBC 0

搜尋此網誌

Btukfyl