Collapse only a subset of the dataset with “if”
I'm trying to collapse only a subset of my data using if, but it seems to be dropping / collapsing much more than I expect.
With every other command with which I have used an if qualifier, the command applies only to the subset of the data that meets the if criteria and leaves the rest of the data alone.
For example, replace does not alter the data for which foreign != 1:
. sysuse auto, clear
(1978 Automobile Data)
. replace mpg = 16 if foreign == 1
(22 real changes made)
However, it appears that collapse applies to the data that meets the if criteria and drops the rest:
. count if mpg > -1
74
. * all the data has mpg > -1
. count if foreign == 1
22
. collapse (mean) mpg if foreign == 1
. count if mpg > -1
1
There is no reason why collapse could not in theory work the same way as replace. It could leave all the foreign != 1 intact, while collapsing all foreign == 1 data to one observation.
That is in fact what I want to do with my data, so what should I do differently?
@NickCox helpfully suggested something like this:
. save "temp/whatever"
file temp/whatever.dta saved
. sysuse auto, clear
(1978 Automobile Data)
. drop if foreign == 1
(22 observations deleted)
. append using "temp/whatever"
(note: variable mpg was int, now float to accommodate using data's values)
That works in this sandbox, but my dataset has 10 million observations. If I can avoid having to re-load it, I can save myself a half hour. More if I have to do this for multiple cases.
Any other suggestions would be appreciated.
stata
add a comment |
I'm trying to collapse only a subset of my data using if, but it seems to be dropping / collapsing much more than I expect.
With every other command with which I have used an if qualifier, the command applies only to the subset of the data that meets the if criteria and leaves the rest of the data alone.
For example, replace does not alter the data for which foreign != 1:
. sysuse auto, clear
(1978 Automobile Data)
. replace mpg = 16 if foreign == 1
(22 real changes made)
However, it appears that collapse applies to the data that meets the if criteria and drops the rest:
. count if mpg > -1
74
. * all the data has mpg > -1
. count if foreign == 1
22
. collapse (mean) mpg if foreign == 1
. count if mpg > -1
1
There is no reason why collapse could not in theory work the same way as replace. It could leave all the foreign != 1 intact, while collapsing all foreign == 1 data to one observation.
That is in fact what I want to do with my data, so what should I do differently?
@NickCox helpfully suggested something like this:
. save "temp/whatever"
file temp/whatever.dta saved
. sysuse auto, clear
(1978 Automobile Data)
. drop if foreign == 1
(22 observations deleted)
. append using "temp/whatever"
(note: variable mpg was int, now float to accommodate using data's values)
That works in this sandbox, but my dataset has 10 million observations. If I can avoid having to re-load it, I can save myself a half hour. More if I have to do this for multiple cases.
Any other suggestions would be appreciated.
stata
1
Thanks very much for your help and guidance in improving this question! I am not a stackoverflow etiquette expert either and learned several things about using the platform (eg when to put something in comments vs the question, when to edit the question in response to expert suggestions) that I will do differently in the future.
– RGecon
Nov 25 '18 at 17:58
1
@RGecon we now have a question for this: How to ask high quality reproducible questions in Stata
– Pearly Spencer
Nov 25 '18 at 18:13
add a comment |
I'm trying to collapse only a subset of my data using if, but it seems to be dropping / collapsing much more than I expect.
With every other command with which I have used an if qualifier, the command applies only to the subset of the data that meets the if criteria and leaves the rest of the data alone.
For example, replace does not alter the data for which foreign != 1:
. sysuse auto, clear
(1978 Automobile Data)
. replace mpg = 16 if foreign == 1
(22 real changes made)
However, it appears that collapse applies to the data that meets the if criteria and drops the rest:
. count if mpg > -1
74
. * all the data has mpg > -1
. count if foreign == 1
22
. collapse (mean) mpg if foreign == 1
. count if mpg > -1
1
There is no reason why collapse could not in theory work the same way as replace. It could leave all the foreign != 1 intact, while collapsing all foreign == 1 data to one observation.
That is in fact what I want to do with my data, so what should I do differently?
@NickCox helpfully suggested something like this:
. save "temp/whatever"
file temp/whatever.dta saved
. sysuse auto, clear
(1978 Automobile Data)
. drop if foreign == 1
(22 observations deleted)
. append using "temp/whatever"
(note: variable mpg was int, now float to accommodate using data's values)
That works in this sandbox, but my dataset has 10 million observations. If I can avoid having to re-load it, I can save myself a half hour. More if I have to do this for multiple cases.
Any other suggestions would be appreciated.
stata
I'm trying to collapse only a subset of my data using if, but it seems to be dropping / collapsing much more than I expect.
With every other command with which I have used an if qualifier, the command applies only to the subset of the data that meets the if criteria and leaves the rest of the data alone.
For example, replace does not alter the data for which foreign != 1:
. sysuse auto, clear
(1978 Automobile Data)
. replace mpg = 16 if foreign == 1
(22 real changes made)
However, it appears that collapse applies to the data that meets the if criteria and drops the rest:
. count if mpg > -1
74
. * all the data has mpg > -1
. count if foreign == 1
22
. collapse (mean) mpg if foreign == 1
. count if mpg > -1
1
There is no reason why collapse could not in theory work the same way as replace. It could leave all the foreign != 1 intact, while collapsing all foreign == 1 data to one observation.
That is in fact what I want to do with my data, so what should I do differently?
@NickCox helpfully suggested something like this:
. save "temp/whatever"
file temp/whatever.dta saved
. sysuse auto, clear
(1978 Automobile Data)
. drop if foreign == 1
(22 observations deleted)
. append using "temp/whatever"
(note: variable mpg was int, now float to accommodate using data's values)
That works in this sandbox, but my dataset has 10 million observations. If I can avoid having to re-load it, I can save myself a half hour. More if I have to do this for multiple cases.
Any other suggestions would be appreciated.
stata
stata
edited Nov 24 '18 at 10:52
Pearly Spencer
10.1k173360
10.1k173360
asked Nov 19 '18 at 6:49
RGeconRGecon
439
439
1
Thanks very much for your help and guidance in improving this question! I am not a stackoverflow etiquette expert either and learned several things about using the platform (eg when to put something in comments vs the question, when to edit the question in response to expert suggestions) that I will do differently in the future.
– RGecon
Nov 25 '18 at 17:58
1
@RGecon we now have a question for this: How to ask high quality reproducible questions in Stata
– Pearly Spencer
Nov 25 '18 at 18:13
add a comment |
1
Thanks very much for your help and guidance in improving this question! I am not a stackoverflow etiquette expert either and learned several things about using the platform (eg when to put something in comments vs the question, when to edit the question in response to expert suggestions) that I will do differently in the future.
– RGecon
Nov 25 '18 at 17:58
1
@RGecon we now have a question for this: How to ask high quality reproducible questions in Stata
– Pearly Spencer
Nov 25 '18 at 18:13
1
1
Thanks very much for your help and guidance in improving this question! I am not a stackoverflow etiquette expert either and learned several things about using the platform (eg when to put something in comments vs the question, when to edit the question in response to expert suggestions) that I will do differently in the future.
– RGecon
Nov 25 '18 at 17:58
Thanks very much for your help and guidance in improving this question! I am not a stackoverflow etiquette expert either and learned several things about using the platform (eg when to put something in comments vs the question, when to edit the question in response to expert suggestions) that I will do differently in the future.
– RGecon
Nov 25 '18 at 17:58
1
1
@RGecon we now have a question for this: How to ask high quality reproducible questions in Stata
– Pearly Spencer
Nov 25 '18 at 18:13
@RGecon we now have a question for this: How to ask high quality reproducible questions in Stata
– Pearly Spencer
Nov 25 '18 at 18:13
add a comment |
1 Answer
1
active
oldest
votes
collapse with if works this way:
Those observations selected by the if condition are collapsed, typically (but not necessarily) into a new dataset with fewer observations.
Those observations not selected disappear.
It's incorrect to say that this command is unusual, let alone unique, in that respect. contract and keep also work in this way: whatever is not selected disappears.
(The community has often asked for save with if: savesome from SSC is one work-around.)
If you want to collapse some of the observations but leave the others unchanged, then you can try
A. this strategy
A1. use your dataset
A2. keep if what you want unchanged and save those observations
A3. use your dataset again
A4. collapse to taste
A5. append the dataset from A2
sysuse auto, clear
keep if !foreign
save domestic
sysuse auto, clear
collapse mpg if foreign
gen make = "All foreign"
append using domestic
or B. this one:
B1. work with (if needed create) an identifier that is unique (distinct) for the observations you want unchanged but takes on a single value for the observations you want collapsed
B2. collapse feeding that identifier to by().
sysuse auto, clear
replace make = "All foreign" if foreign
collapse mpg, by(make)
Although B looks trivial for this example, it is far from obvious to me that it is always superior for large datasets and if you want to continue to work with many variables. I have not experimented with timing or memory comparisons for large datasets or even any datasets, as I haven't encountered this wish before.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53369610%2fcollapse-only-a-subset-of-the-dataset-with-if%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
collapse with if works this way:
Those observations selected by the if condition are collapsed, typically (but not necessarily) into a new dataset with fewer observations.
Those observations not selected disappear.
It's incorrect to say that this command is unusual, let alone unique, in that respect. contract and keep also work in this way: whatever is not selected disappears.
(The community has often asked for save with if: savesome from SSC is one work-around.)
If you want to collapse some of the observations but leave the others unchanged, then you can try
A. this strategy
A1. use your dataset
A2. keep if what you want unchanged and save those observations
A3. use your dataset again
A4. collapse to taste
A5. append the dataset from A2
sysuse auto, clear
keep if !foreign
save domestic
sysuse auto, clear
collapse mpg if foreign
gen make = "All foreign"
append using domestic
or B. this one:
B1. work with (if needed create) an identifier that is unique (distinct) for the observations you want unchanged but takes on a single value for the observations you want collapsed
B2. collapse feeding that identifier to by().
sysuse auto, clear
replace make = "All foreign" if foreign
collapse mpg, by(make)
Although B looks trivial for this example, it is far from obvious to me that it is always superior for large datasets and if you want to continue to work with many variables. I have not experimented with timing or memory comparisons for large datasets or even any datasets, as I haven't encountered this wish before.
add a comment |
collapse with if works this way:
Those observations selected by the if condition are collapsed, typically (but not necessarily) into a new dataset with fewer observations.
Those observations not selected disappear.
It's incorrect to say that this command is unusual, let alone unique, in that respect. contract and keep also work in this way: whatever is not selected disappears.
(The community has often asked for save with if: savesome from SSC is one work-around.)
If you want to collapse some of the observations but leave the others unchanged, then you can try
A. this strategy
A1. use your dataset
A2. keep if what you want unchanged and save those observations
A3. use your dataset again
A4. collapse to taste
A5. append the dataset from A2
sysuse auto, clear
keep if !foreign
save domestic
sysuse auto, clear
collapse mpg if foreign
gen make = "All foreign"
append using domestic
or B. this one:
B1. work with (if needed create) an identifier that is unique (distinct) for the observations you want unchanged but takes on a single value for the observations you want collapsed
B2. collapse feeding that identifier to by().
sysuse auto, clear
replace make = "All foreign" if foreign
collapse mpg, by(make)
Although B looks trivial for this example, it is far from obvious to me that it is always superior for large datasets and if you want to continue to work with many variables. I have not experimented with timing or memory comparisons for large datasets or even any datasets, as I haven't encountered this wish before.
add a comment |
collapse with if works this way:
Those observations selected by the if condition are collapsed, typically (but not necessarily) into a new dataset with fewer observations.
Those observations not selected disappear.
It's incorrect to say that this command is unusual, let alone unique, in that respect. contract and keep also work in this way: whatever is not selected disappears.
(The community has often asked for save with if: savesome from SSC is one work-around.)
If you want to collapse some of the observations but leave the others unchanged, then you can try
A. this strategy
A1. use your dataset
A2. keep if what you want unchanged and save those observations
A3. use your dataset again
A4. collapse to taste
A5. append the dataset from A2
sysuse auto, clear
keep if !foreign
save domestic
sysuse auto, clear
collapse mpg if foreign
gen make = "All foreign"
append using domestic
or B. this one:
B1. work with (if needed create) an identifier that is unique (distinct) for the observations you want unchanged but takes on a single value for the observations you want collapsed
B2. collapse feeding that identifier to by().
sysuse auto, clear
replace make = "All foreign" if foreign
collapse mpg, by(make)
Although B looks trivial for this example, it is far from obvious to me that it is always superior for large datasets and if you want to continue to work with many variables. I have not experimented with timing or memory comparisons for large datasets or even any datasets, as I haven't encountered this wish before.
collapse with if works this way:
Those observations selected by the if condition are collapsed, typically (but not necessarily) into a new dataset with fewer observations.
Those observations not selected disappear.
It's incorrect to say that this command is unusual, let alone unique, in that respect. contract and keep also work in this way: whatever is not selected disappears.
(The community has often asked for save with if: savesome from SSC is one work-around.)
If you want to collapse some of the observations but leave the others unchanged, then you can try
A. this strategy
A1. use your dataset
A2. keep if what you want unchanged and save those observations
A3. use your dataset again
A4. collapse to taste
A5. append the dataset from A2
sysuse auto, clear
keep if !foreign
save domestic
sysuse auto, clear
collapse mpg if foreign
gen make = "All foreign"
append using domestic
or B. this one:
B1. work with (if needed create) an identifier that is unique (distinct) for the observations you want unchanged but takes on a single value for the observations you want collapsed
B2. collapse feeding that identifier to by().
sysuse auto, clear
replace make = "All foreign" if foreign
collapse mpg, by(make)
Although B looks trivial for this example, it is far from obvious to me that it is always superior for large datasets and if you want to continue to work with many variables. I have not experimented with timing or memory comparisons for large datasets or even any datasets, as I haven't encountered this wish before.
edited Nov 24 '18 at 17:58
answered Nov 19 '18 at 10:19
Nick CoxNick Cox
25k42038
25k42038
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53369610%2fcollapse-only-a-subset-of-the-dataset-with-if%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Thanks very much for your help and guidance in improving this question! I am not a stackoverflow etiquette expert either and learned several things about using the platform (eg when to put something in comments vs the question, when to edit the question in response to expert suggestions) that I will do differently in the future.
– RGecon
Nov 25 '18 at 17:58
1
@RGecon we now have a question for this: How to ask high quality reproducible questions in Stata
– Pearly Spencer
Nov 25 '18 at 18:13