Instant access to line from a large file without loading the file












0















In one of my recent projects I need to perform this simple task but I'm not sure what is the most efficient way to do so.



I have several large text files (>5GB) and I need to continuously extract random lines from those files. The requirements are: I can't load the files into memory, I need to perform this very efficiently ( >>1000 lines a second), and preferably I need to do as less pre-processing as possible.



The files consists of many short lines ~(20 mil lines). The "raw" files has varying line length, but with a short pre-processing I can make all lines have the same length (though, the perfect solution would not require pre-processing)



I already tried the default python solutions mentioned here but they were too slow (and the linecache solution loads the file into memory, therefore is not usable here)



The next solution I thought about is to create some kind of index. I found this solution but it's very outdated so it needs some work to get working, and even then I'm not sure if the overhead created during the processing of the index file won't slow down the process to time-scale of the solution above.



Another solution is converting the file into a binary file and then getting instant access to lines this way. For this solution I couldn't find any python package that supports binary-text work, and I feel like creating a robust parser this way could take very long time and could create many hard-to-diagnose errors down the line because of small miscalculations/mistakes.



The final solution I thought about is using some kind of database (sqlite in my case) which will require transferring the lines into a database and loading them this way.



Note: I will also load thousands of (random) lines each time, therefore solutions which work better for groups of lines will have an advantage.



Thanks in advance,



Art.










share|improve this question


















  • 2





    I recommend sqlite. it fits very well with this problem and there is no need to install.

    – MEdwin
    Nov 26 '18 at 14:44






  • 2





    As MEdwin said and as you alluded to, I think your best bet is to change the file away from a text file, and into either some form of SQL file or HDF5. You might be able to read in the file quicker if you pickle it or something, but in my experience, that doesn't make a huge difference.

    – John Rouhana
    Nov 26 '18 at 14:46






  • 2





    Another possible approach is to seek to a random position in the file and move backwards/forwards until you can isolate a line, and repeat...

    – Jon Clements
    Nov 26 '18 at 14:49
















0















In one of my recent projects I need to perform this simple task but I'm not sure what is the most efficient way to do so.



I have several large text files (>5GB) and I need to continuously extract random lines from those files. The requirements are: I can't load the files into memory, I need to perform this very efficiently ( >>1000 lines a second), and preferably I need to do as less pre-processing as possible.



The files consists of many short lines ~(20 mil lines). The "raw" files has varying line length, but with a short pre-processing I can make all lines have the same length (though, the perfect solution would not require pre-processing)



I already tried the default python solutions mentioned here but they were too slow (and the linecache solution loads the file into memory, therefore is not usable here)



The next solution I thought about is to create some kind of index. I found this solution but it's very outdated so it needs some work to get working, and even then I'm not sure if the overhead created during the processing of the index file won't slow down the process to time-scale of the solution above.



Another solution is converting the file into a binary file and then getting instant access to lines this way. For this solution I couldn't find any python package that supports binary-text work, and I feel like creating a robust parser this way could take very long time and could create many hard-to-diagnose errors down the line because of small miscalculations/mistakes.



The final solution I thought about is using some kind of database (sqlite in my case) which will require transferring the lines into a database and loading them this way.



Note: I will also load thousands of (random) lines each time, therefore solutions which work better for groups of lines will have an advantage.



Thanks in advance,



Art.










share|improve this question


















  • 2





    I recommend sqlite. it fits very well with this problem and there is no need to install.

    – MEdwin
    Nov 26 '18 at 14:44






  • 2





    As MEdwin said and as you alluded to, I think your best bet is to change the file away from a text file, and into either some form of SQL file or HDF5. You might be able to read in the file quicker if you pickle it or something, but in my experience, that doesn't make a huge difference.

    – John Rouhana
    Nov 26 '18 at 14:46






  • 2





    Another possible approach is to seek to a random position in the file and move backwards/forwards until you can isolate a line, and repeat...

    – Jon Clements
    Nov 26 '18 at 14:49














0












0








0








In one of my recent projects I need to perform this simple task but I'm not sure what is the most efficient way to do so.



I have several large text files (>5GB) and I need to continuously extract random lines from those files. The requirements are: I can't load the files into memory, I need to perform this very efficiently ( >>1000 lines a second), and preferably I need to do as less pre-processing as possible.



The files consists of many short lines ~(20 mil lines). The "raw" files has varying line length, but with a short pre-processing I can make all lines have the same length (though, the perfect solution would not require pre-processing)



I already tried the default python solutions mentioned here but they were too slow (and the linecache solution loads the file into memory, therefore is not usable here)



The next solution I thought about is to create some kind of index. I found this solution but it's very outdated so it needs some work to get working, and even then I'm not sure if the overhead created during the processing of the index file won't slow down the process to time-scale of the solution above.



Another solution is converting the file into a binary file and then getting instant access to lines this way. For this solution I couldn't find any python package that supports binary-text work, and I feel like creating a robust parser this way could take very long time and could create many hard-to-diagnose errors down the line because of small miscalculations/mistakes.



The final solution I thought about is using some kind of database (sqlite in my case) which will require transferring the lines into a database and loading them this way.



Note: I will also load thousands of (random) lines each time, therefore solutions which work better for groups of lines will have an advantage.



Thanks in advance,



Art.










share|improve this question














In one of my recent projects I need to perform this simple task but I'm not sure what is the most efficient way to do so.



I have several large text files (>5GB) and I need to continuously extract random lines from those files. The requirements are: I can't load the files into memory, I need to perform this very efficiently ( >>1000 lines a second), and preferably I need to do as less pre-processing as possible.



The files consists of many short lines ~(20 mil lines). The "raw" files has varying line length, but with a short pre-processing I can make all lines have the same length (though, the perfect solution would not require pre-processing)



I already tried the default python solutions mentioned here but they were too slow (and the linecache solution loads the file into memory, therefore is not usable here)



The next solution I thought about is to create some kind of index. I found this solution but it's very outdated so it needs some work to get working, and even then I'm not sure if the overhead created during the processing of the index file won't slow down the process to time-scale of the solution above.



Another solution is converting the file into a binary file and then getting instant access to lines this way. For this solution I couldn't find any python package that supports binary-text work, and I feel like creating a robust parser this way could take very long time and could create many hard-to-diagnose errors down the line because of small miscalculations/mistakes.



The final solution I thought about is using some kind of database (sqlite in my case) which will require transferring the lines into a database and loading them this way.



Note: I will also load thousands of (random) lines each time, therefore solutions which work better for groups of lines will have an advantage.



Thanks in advance,



Art.







python performance text loading large-data






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 26 '18 at 14:41









artembusartembus

67111




67111








  • 2





    I recommend sqlite. it fits very well with this problem and there is no need to install.

    – MEdwin
    Nov 26 '18 at 14:44






  • 2





    As MEdwin said and as you alluded to, I think your best bet is to change the file away from a text file, and into either some form of SQL file or HDF5. You might be able to read in the file quicker if you pickle it or something, but in my experience, that doesn't make a huge difference.

    – John Rouhana
    Nov 26 '18 at 14:46






  • 2





    Another possible approach is to seek to a random position in the file and move backwards/forwards until you can isolate a line, and repeat...

    – Jon Clements
    Nov 26 '18 at 14:49














  • 2





    I recommend sqlite. it fits very well with this problem and there is no need to install.

    – MEdwin
    Nov 26 '18 at 14:44






  • 2





    As MEdwin said and as you alluded to, I think your best bet is to change the file away from a text file, and into either some form of SQL file or HDF5. You might be able to read in the file quicker if you pickle it or something, but in my experience, that doesn't make a huge difference.

    – John Rouhana
    Nov 26 '18 at 14:46






  • 2





    Another possible approach is to seek to a random position in the file and move backwards/forwards until you can isolate a line, and repeat...

    – Jon Clements
    Nov 26 '18 at 14:49








2




2





I recommend sqlite. it fits very well with this problem and there is no need to install.

– MEdwin
Nov 26 '18 at 14:44





I recommend sqlite. it fits very well with this problem and there is no need to install.

– MEdwin
Nov 26 '18 at 14:44




2




2





As MEdwin said and as you alluded to, I think your best bet is to change the file away from a text file, and into either some form of SQL file or HDF5. You might be able to read in the file quicker if you pickle it or something, but in my experience, that doesn't make a huge difference.

– John Rouhana
Nov 26 '18 at 14:46





As MEdwin said and as you alluded to, I think your best bet is to change the file away from a text file, and into either some form of SQL file or HDF5. You might be able to read in the file quicker if you pickle it or something, but in my experience, that doesn't make a huge difference.

– John Rouhana
Nov 26 '18 at 14:46




2




2





Another possible approach is to seek to a random position in the file and move backwards/forwards until you can isolate a line, and repeat...

– Jon Clements
Nov 26 '18 at 14:49





Another possible approach is to seek to a random position in the file and move backwards/forwards until you can isolate a line, and repeat...

– Jon Clements
Nov 26 '18 at 14:49












1 Answer
1






active

oldest

votes


















1














As said in the comments, I believe using hdf5 would we a good option.
This answer shows how to read that kind of file






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53483493%2finstant-access-to-line-from-a-large-file-without-loading-the-file%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    As said in the comments, I believe using hdf5 would we a good option.
    This answer shows how to read that kind of file






    share|improve this answer




























      1














      As said in the comments, I believe using hdf5 would we a good option.
      This answer shows how to read that kind of file






      share|improve this answer


























        1












        1








        1







        As said in the comments, I believe using hdf5 would we a good option.
        This answer shows how to read that kind of file






        share|improve this answer













        As said in the comments, I believe using hdf5 would we a good option.
        This answer shows how to read that kind of file







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 26 '18 at 15:00









        Pedro TorresPedro Torres

        693413




        693413
































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53483493%2finstant-access-to-line-from-a-large-file-without-loading-the-file%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

            Calculate evaluation metrics using cross_val_predict sklearn

            Insert data from modal to MySQL (multiple modal on website)