Collapse only a subset of the dataset with “if”












0















I'm trying to collapse only a subset of my data using if, but it seems to be dropping / collapsing much more than I expect.



With every other command with which I have used an if qualifier, the command applies only to the subset of the data that meets the if criteria and leaves the rest of the data alone.



For example, replace does not alter the data for which foreign != 1:



. sysuse auto, clear
(1978 Automobile Data)

. replace mpg = 16 if foreign == 1
(22 real changes made)


However, it appears that collapse applies to the data that meets the if criteria and drops the rest:



. count if mpg > -1
74

. * all the data has mpg > -1

. count if foreign == 1
22

. collapse (mean) mpg if foreign == 1

. count if mpg > -1
1


There is no reason why collapse could not in theory work the same way as replace. It could leave all the foreign != 1 intact, while collapsing all foreign == 1 data to one observation.



That is in fact what I want to do with my data, so what should I do differently?



@NickCox helpfully suggested something like this:



. save "temp/whatever"
file temp/whatever.dta saved

. sysuse auto, clear
(1978 Automobile Data)

. drop if foreign == 1
(22 observations deleted)

. append using "temp/whatever"
(note: variable mpg was int, now float to accommodate using data's values)


That works in this sandbox, but my dataset has 10 million observations. If I can avoid having to re-load it, I can save myself a half hour. More if I have to do this for multiple cases.



Any other suggestions would be appreciated.










share|improve this question




















  • 1





    Thanks very much for your help and guidance in improving this question! I am not a stackoverflow etiquette expert either and learned several things about using the platform (eg when to put something in comments vs the question, when to edit the question in response to expert suggestions) that I will do differently in the future.

    – RGecon
    Nov 25 '18 at 17:58








  • 1





    @RGecon we now have a question for this: How to ask high quality reproducible questions in Stata

    – Pearly Spencer
    Nov 25 '18 at 18:13
















0















I'm trying to collapse only a subset of my data using if, but it seems to be dropping / collapsing much more than I expect.



With every other command with which I have used an if qualifier, the command applies only to the subset of the data that meets the if criteria and leaves the rest of the data alone.



For example, replace does not alter the data for which foreign != 1:



. sysuse auto, clear
(1978 Automobile Data)

. replace mpg = 16 if foreign == 1
(22 real changes made)


However, it appears that collapse applies to the data that meets the if criteria and drops the rest:



. count if mpg > -1
74

. * all the data has mpg > -1

. count if foreign == 1
22

. collapse (mean) mpg if foreign == 1

. count if mpg > -1
1


There is no reason why collapse could not in theory work the same way as replace. It could leave all the foreign != 1 intact, while collapsing all foreign == 1 data to one observation.



That is in fact what I want to do with my data, so what should I do differently?



@NickCox helpfully suggested something like this:



. save "temp/whatever"
file temp/whatever.dta saved

. sysuse auto, clear
(1978 Automobile Data)

. drop if foreign == 1
(22 observations deleted)

. append using "temp/whatever"
(note: variable mpg was int, now float to accommodate using data's values)


That works in this sandbox, but my dataset has 10 million observations. If I can avoid having to re-load it, I can save myself a half hour. More if I have to do this for multiple cases.



Any other suggestions would be appreciated.










share|improve this question




















  • 1





    Thanks very much for your help and guidance in improving this question! I am not a stackoverflow etiquette expert either and learned several things about using the platform (eg when to put something in comments vs the question, when to edit the question in response to expert suggestions) that I will do differently in the future.

    – RGecon
    Nov 25 '18 at 17:58








  • 1





    @RGecon we now have a question for this: How to ask high quality reproducible questions in Stata

    – Pearly Spencer
    Nov 25 '18 at 18:13














0












0








0








I'm trying to collapse only a subset of my data using if, but it seems to be dropping / collapsing much more than I expect.



With every other command with which I have used an if qualifier, the command applies only to the subset of the data that meets the if criteria and leaves the rest of the data alone.



For example, replace does not alter the data for which foreign != 1:



. sysuse auto, clear
(1978 Automobile Data)

. replace mpg = 16 if foreign == 1
(22 real changes made)


However, it appears that collapse applies to the data that meets the if criteria and drops the rest:



. count if mpg > -1
74

. * all the data has mpg > -1

. count if foreign == 1
22

. collapse (mean) mpg if foreign == 1

. count if mpg > -1
1


There is no reason why collapse could not in theory work the same way as replace. It could leave all the foreign != 1 intact, while collapsing all foreign == 1 data to one observation.



That is in fact what I want to do with my data, so what should I do differently?



@NickCox helpfully suggested something like this:



. save "temp/whatever"
file temp/whatever.dta saved

. sysuse auto, clear
(1978 Automobile Data)

. drop if foreign == 1
(22 observations deleted)

. append using "temp/whatever"
(note: variable mpg was int, now float to accommodate using data's values)


That works in this sandbox, but my dataset has 10 million observations. If I can avoid having to re-load it, I can save myself a half hour. More if I have to do this for multiple cases.



Any other suggestions would be appreciated.










share|improve this question
















I'm trying to collapse only a subset of my data using if, but it seems to be dropping / collapsing much more than I expect.



With every other command with which I have used an if qualifier, the command applies only to the subset of the data that meets the if criteria and leaves the rest of the data alone.



For example, replace does not alter the data for which foreign != 1:



. sysuse auto, clear
(1978 Automobile Data)

. replace mpg = 16 if foreign == 1
(22 real changes made)


However, it appears that collapse applies to the data that meets the if criteria and drops the rest:



. count if mpg > -1
74

. * all the data has mpg > -1

. count if foreign == 1
22

. collapse (mean) mpg if foreign == 1

. count if mpg > -1
1


There is no reason why collapse could not in theory work the same way as replace. It could leave all the foreign != 1 intact, while collapsing all foreign == 1 data to one observation.



That is in fact what I want to do with my data, so what should I do differently?



@NickCox helpfully suggested something like this:



. save "temp/whatever"
file temp/whatever.dta saved

. sysuse auto, clear
(1978 Automobile Data)

. drop if foreign == 1
(22 observations deleted)

. append using "temp/whatever"
(note: variable mpg was int, now float to accommodate using data's values)


That works in this sandbox, but my dataset has 10 million observations. If I can avoid having to re-load it, I can save myself a half hour. More if I have to do this for multiple cases.



Any other suggestions would be appreciated.







stata






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 24 '18 at 10:52









Pearly Spencer

10.1k173360




10.1k173360










asked Nov 19 '18 at 6:49









RGeconRGecon

439




439








  • 1





    Thanks very much for your help and guidance in improving this question! I am not a stackoverflow etiquette expert either and learned several things about using the platform (eg when to put something in comments vs the question, when to edit the question in response to expert suggestions) that I will do differently in the future.

    – RGecon
    Nov 25 '18 at 17:58








  • 1





    @RGecon we now have a question for this: How to ask high quality reproducible questions in Stata

    – Pearly Spencer
    Nov 25 '18 at 18:13














  • 1





    Thanks very much for your help and guidance in improving this question! I am not a stackoverflow etiquette expert either and learned several things about using the platform (eg when to put something in comments vs the question, when to edit the question in response to expert suggestions) that I will do differently in the future.

    – RGecon
    Nov 25 '18 at 17:58








  • 1





    @RGecon we now have a question for this: How to ask high quality reproducible questions in Stata

    – Pearly Spencer
    Nov 25 '18 at 18:13








1




1





Thanks very much for your help and guidance in improving this question! I am not a stackoverflow etiquette expert either and learned several things about using the platform (eg when to put something in comments vs the question, when to edit the question in response to expert suggestions) that I will do differently in the future.

– RGecon
Nov 25 '18 at 17:58







Thanks very much for your help and guidance in improving this question! I am not a stackoverflow etiquette expert either and learned several things about using the platform (eg when to put something in comments vs the question, when to edit the question in response to expert suggestions) that I will do differently in the future.

– RGecon
Nov 25 '18 at 17:58






1




1





@RGecon we now have a question for this: How to ask high quality reproducible questions in Stata

– Pearly Spencer
Nov 25 '18 at 18:13





@RGecon we now have a question for this: How to ask high quality reproducible questions in Stata

– Pearly Spencer
Nov 25 '18 at 18:13












1 Answer
1






active

oldest

votes


















2














collapse with if works this way:



Those observations selected by the if condition are collapsed, typically (but not necessarily) into a new dataset with fewer observations.



Those observations not selected disappear.



It's incorrect to say that this command is unusual, let alone unique, in that respect. contract and keep also work in this way: whatever is not selected disappears.
(The community has often asked for save with if: savesome from SSC is one work-around.)



If you want to collapse some of the observations but leave the others unchanged, then you can try



A. this strategy



A1. use your dataset



A2. keep if what you want unchanged and save those observations



A3. use your dataset again



A4. collapse to taste



A5. append the dataset from A2



sysuse auto, clear
keep if !foreign
save domestic
sysuse auto, clear
collapse mpg if foreign
gen make = "All foreign"
append using domestic


or B. this one:



B1. work with (if needed create) an identifier that is unique (distinct) for the observations you want unchanged but takes on a single value for the observations you want collapsed



B2. collapse feeding that identifier to by().



sysuse auto, clear 
replace make = "All foreign" if foreign
collapse mpg, by(make)


Although B looks trivial for this example, it is far from obvious to me that it is always superior for large datasets and if you want to continue to work with many variables. I have not experimented with timing or memory comparisons for large datasets or even any datasets, as I haven't encountered this wish before.






share|improve this answer

























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53369610%2fcollapse-only-a-subset-of-the-dataset-with-if%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    2














    collapse with if works this way:



    Those observations selected by the if condition are collapsed, typically (but not necessarily) into a new dataset with fewer observations.



    Those observations not selected disappear.



    It's incorrect to say that this command is unusual, let alone unique, in that respect. contract and keep also work in this way: whatever is not selected disappears.
    (The community has often asked for save with if: savesome from SSC is one work-around.)



    If you want to collapse some of the observations but leave the others unchanged, then you can try



    A. this strategy



    A1. use your dataset



    A2. keep if what you want unchanged and save those observations



    A3. use your dataset again



    A4. collapse to taste



    A5. append the dataset from A2



    sysuse auto, clear
    keep if !foreign
    save domestic
    sysuse auto, clear
    collapse mpg if foreign
    gen make = "All foreign"
    append using domestic


    or B. this one:



    B1. work with (if needed create) an identifier that is unique (distinct) for the observations you want unchanged but takes on a single value for the observations you want collapsed



    B2. collapse feeding that identifier to by().



    sysuse auto, clear 
    replace make = "All foreign" if foreign
    collapse mpg, by(make)


    Although B looks trivial for this example, it is far from obvious to me that it is always superior for large datasets and if you want to continue to work with many variables. I have not experimented with timing or memory comparisons for large datasets or even any datasets, as I haven't encountered this wish before.






    share|improve this answer






























      2














      collapse with if works this way:



      Those observations selected by the if condition are collapsed, typically (but not necessarily) into a new dataset with fewer observations.



      Those observations not selected disappear.



      It's incorrect to say that this command is unusual, let alone unique, in that respect. contract and keep also work in this way: whatever is not selected disappears.
      (The community has often asked for save with if: savesome from SSC is one work-around.)



      If you want to collapse some of the observations but leave the others unchanged, then you can try



      A. this strategy



      A1. use your dataset



      A2. keep if what you want unchanged and save those observations



      A3. use your dataset again



      A4. collapse to taste



      A5. append the dataset from A2



      sysuse auto, clear
      keep if !foreign
      save domestic
      sysuse auto, clear
      collapse mpg if foreign
      gen make = "All foreign"
      append using domestic


      or B. this one:



      B1. work with (if needed create) an identifier that is unique (distinct) for the observations you want unchanged but takes on a single value for the observations you want collapsed



      B2. collapse feeding that identifier to by().



      sysuse auto, clear 
      replace make = "All foreign" if foreign
      collapse mpg, by(make)


      Although B looks trivial for this example, it is far from obvious to me that it is always superior for large datasets and if you want to continue to work with many variables. I have not experimented with timing or memory comparisons for large datasets or even any datasets, as I haven't encountered this wish before.






      share|improve this answer




























        2












        2








        2







        collapse with if works this way:



        Those observations selected by the if condition are collapsed, typically (but not necessarily) into a new dataset with fewer observations.



        Those observations not selected disappear.



        It's incorrect to say that this command is unusual, let alone unique, in that respect. contract and keep also work in this way: whatever is not selected disappears.
        (The community has often asked for save with if: savesome from SSC is one work-around.)



        If you want to collapse some of the observations but leave the others unchanged, then you can try



        A. this strategy



        A1. use your dataset



        A2. keep if what you want unchanged and save those observations



        A3. use your dataset again



        A4. collapse to taste



        A5. append the dataset from A2



        sysuse auto, clear
        keep if !foreign
        save domestic
        sysuse auto, clear
        collapse mpg if foreign
        gen make = "All foreign"
        append using domestic


        or B. this one:



        B1. work with (if needed create) an identifier that is unique (distinct) for the observations you want unchanged but takes on a single value for the observations you want collapsed



        B2. collapse feeding that identifier to by().



        sysuse auto, clear 
        replace make = "All foreign" if foreign
        collapse mpg, by(make)


        Although B looks trivial for this example, it is far from obvious to me that it is always superior for large datasets and if you want to continue to work with many variables. I have not experimented with timing or memory comparisons for large datasets or even any datasets, as I haven't encountered this wish before.






        share|improve this answer















        collapse with if works this way:



        Those observations selected by the if condition are collapsed, typically (but not necessarily) into a new dataset with fewer observations.



        Those observations not selected disappear.



        It's incorrect to say that this command is unusual, let alone unique, in that respect. contract and keep also work in this way: whatever is not selected disappears.
        (The community has often asked for save with if: savesome from SSC is one work-around.)



        If you want to collapse some of the observations but leave the others unchanged, then you can try



        A. this strategy



        A1. use your dataset



        A2. keep if what you want unchanged and save those observations



        A3. use your dataset again



        A4. collapse to taste



        A5. append the dataset from A2



        sysuse auto, clear
        keep if !foreign
        save domestic
        sysuse auto, clear
        collapse mpg if foreign
        gen make = "All foreign"
        append using domestic


        or B. this one:



        B1. work with (if needed create) an identifier that is unique (distinct) for the observations you want unchanged but takes on a single value for the observations you want collapsed



        B2. collapse feeding that identifier to by().



        sysuse auto, clear 
        replace make = "All foreign" if foreign
        collapse mpg, by(make)


        Although B looks trivial for this example, it is far from obvious to me that it is always superior for large datasets and if you want to continue to work with many variables. I have not experimented with timing or memory comparisons for large datasets or even any datasets, as I haven't encountered this wish before.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 24 '18 at 17:58

























        answered Nov 19 '18 at 10:19









        Nick CoxNick Cox

        25k42038




        25k42038






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53369610%2fcollapse-only-a-subset-of-the-dataset-with-if%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Lallio

            Unable to find Lightning Node

            Futebolista