How to efficiently edit massive amounts of data
Here is an example dataset. Let me show everyone what I am doing to my data and then I will explain what I am struggling with. I apologize if the title isn't an accurate description. I tried my best but I am a bit new at this. Feel free to change it to something more suiting if needed
Location sample1 sample 2 sample 3
chr1:1234 0/1 1/1 0/0
chr2:5678 0/0 0/0 0/0
chr3:2345 1/1 1/1 1/1
chr4:6789 0/1 1/1 ./.
I use this to convert them to either a YES, NO, or MAYBE
replacement<-function(x){
x=replace(x,which(x=='./.'),0.1)
x=replace(x,which(x=='0/0'),0)
x=replace(x,which(x=='0/1'),1)
x=replace(x,which(x=='1/1'),2)
}
test=apply(test.data.set,2,replacement)
test.data.2 <- as.data.frame(test)
replacement<-function(x){
x=replace(x,which(x=='0.1'), "MAYBE")
x=replace(x,which(x=='0'), "NO")
x=replace(x,which(x=='1'), "YES")
x=replace(x,which(x=='2'), "YES")
}
test.data.3=apply(test.data.2,2,replacement)
test.data.4 <- as.data.frame(test.data.3)
Dataset after running
Location sample1 sample 2 sample 3
chr1:1234 YES YES NO
chr2:5678 NO NO NO
chr3:2345 YES YES YES
chr4:6789 YES YES MAYBE
So what I wrote above currently works for me. However, I have a new dataset that contains about 300 samples (columns) and about.. I'm not even sure, easily 500 million rows, so I need to alter over a billion "cells". I tried running this on a cluster with 256G with of memory and it just times out. I know what I wrote above is far from the "smoothest" way of altering my data. Does anyone have suggestions to streamline this process? I feel like dplyr has to have some kind of way to do this.
Any help would be amazing! Feel free to ask any questions if you need clarifications.
r dplyr
|
show 6 more comments
Here is an example dataset. Let me show everyone what I am doing to my data and then I will explain what I am struggling with. I apologize if the title isn't an accurate description. I tried my best but I am a bit new at this. Feel free to change it to something more suiting if needed
Location sample1 sample 2 sample 3
chr1:1234 0/1 1/1 0/0
chr2:5678 0/0 0/0 0/0
chr3:2345 1/1 1/1 1/1
chr4:6789 0/1 1/1 ./.
I use this to convert them to either a YES, NO, or MAYBE
replacement<-function(x){
x=replace(x,which(x=='./.'),0.1)
x=replace(x,which(x=='0/0'),0)
x=replace(x,which(x=='0/1'),1)
x=replace(x,which(x=='1/1'),2)
}
test=apply(test.data.set,2,replacement)
test.data.2 <- as.data.frame(test)
replacement<-function(x){
x=replace(x,which(x=='0.1'), "MAYBE")
x=replace(x,which(x=='0'), "NO")
x=replace(x,which(x=='1'), "YES")
x=replace(x,which(x=='2'), "YES")
}
test.data.3=apply(test.data.2,2,replacement)
test.data.4 <- as.data.frame(test.data.3)
Dataset after running
Location sample1 sample 2 sample 3
chr1:1234 YES YES NO
chr2:5678 NO NO NO
chr3:2345 YES YES YES
chr4:6789 YES YES MAYBE
So what I wrote above currently works for me. However, I have a new dataset that contains about 300 samples (columns) and about.. I'm not even sure, easily 500 million rows, so I need to alter over a billion "cells". I tried running this on a cluster with 256G with of memory and it just times out. I know what I wrote above is far from the "smoothest" way of altering my data. Does anyone have suggestions to streamline this process? I feel like dplyr has to have some kind of way to do this.
Any help would be amazing! Feel free to ask any questions if you need clarifications.
r dplyr
2
I think you are missingMAYBE
in your 2nd replacement function
– prosoitos
Nov 27 '18 at 18:53
2
Why don't you just go from the original0/1
type data straight to the NO and YES data? Why the intermediate step?
– John Paul
Nov 27 '18 at 18:54
1
Sorry @prosoitos that was a typo! Fixed it
– Brian
Nov 27 '18 at 18:56
1
And if you usedplyr::case_when
you don't have to create a function and can apply the change directly
– prosoitos
Nov 27 '18 at 18:58
2
@Brian - just dodput(head(test))
– fugu
Nov 27 '18 at 19:24
|
show 6 more comments
Here is an example dataset. Let me show everyone what I am doing to my data and then I will explain what I am struggling with. I apologize if the title isn't an accurate description. I tried my best but I am a bit new at this. Feel free to change it to something more suiting if needed
Location sample1 sample 2 sample 3
chr1:1234 0/1 1/1 0/0
chr2:5678 0/0 0/0 0/0
chr3:2345 1/1 1/1 1/1
chr4:6789 0/1 1/1 ./.
I use this to convert them to either a YES, NO, or MAYBE
replacement<-function(x){
x=replace(x,which(x=='./.'),0.1)
x=replace(x,which(x=='0/0'),0)
x=replace(x,which(x=='0/1'),1)
x=replace(x,which(x=='1/1'),2)
}
test=apply(test.data.set,2,replacement)
test.data.2 <- as.data.frame(test)
replacement<-function(x){
x=replace(x,which(x=='0.1'), "MAYBE")
x=replace(x,which(x=='0'), "NO")
x=replace(x,which(x=='1'), "YES")
x=replace(x,which(x=='2'), "YES")
}
test.data.3=apply(test.data.2,2,replacement)
test.data.4 <- as.data.frame(test.data.3)
Dataset after running
Location sample1 sample 2 sample 3
chr1:1234 YES YES NO
chr2:5678 NO NO NO
chr3:2345 YES YES YES
chr4:6789 YES YES MAYBE
So what I wrote above currently works for me. However, I have a new dataset that contains about 300 samples (columns) and about.. I'm not even sure, easily 500 million rows, so I need to alter over a billion "cells". I tried running this on a cluster with 256G with of memory and it just times out. I know what I wrote above is far from the "smoothest" way of altering my data. Does anyone have suggestions to streamline this process? I feel like dplyr has to have some kind of way to do this.
Any help would be amazing! Feel free to ask any questions if you need clarifications.
r dplyr
Here is an example dataset. Let me show everyone what I am doing to my data and then I will explain what I am struggling with. I apologize if the title isn't an accurate description. I tried my best but I am a bit new at this. Feel free to change it to something more suiting if needed
Location sample1 sample 2 sample 3
chr1:1234 0/1 1/1 0/0
chr2:5678 0/0 0/0 0/0
chr3:2345 1/1 1/1 1/1
chr4:6789 0/1 1/1 ./.
I use this to convert them to either a YES, NO, or MAYBE
replacement<-function(x){
x=replace(x,which(x=='./.'),0.1)
x=replace(x,which(x=='0/0'),0)
x=replace(x,which(x=='0/1'),1)
x=replace(x,which(x=='1/1'),2)
}
test=apply(test.data.set,2,replacement)
test.data.2 <- as.data.frame(test)
replacement<-function(x){
x=replace(x,which(x=='0.1'), "MAYBE")
x=replace(x,which(x=='0'), "NO")
x=replace(x,which(x=='1'), "YES")
x=replace(x,which(x=='2'), "YES")
}
test.data.3=apply(test.data.2,2,replacement)
test.data.4 <- as.data.frame(test.data.3)
Dataset after running
Location sample1 sample 2 sample 3
chr1:1234 YES YES NO
chr2:5678 NO NO NO
chr3:2345 YES YES YES
chr4:6789 YES YES MAYBE
So what I wrote above currently works for me. However, I have a new dataset that contains about 300 samples (columns) and about.. I'm not even sure, easily 500 million rows, so I need to alter over a billion "cells". I tried running this on a cluster with 256G with of memory and it just times out. I know what I wrote above is far from the "smoothest" way of altering my data. Does anyone have suggestions to streamline this process? I feel like dplyr has to have some kind of way to do this.
Any help would be amazing! Feel free to ask any questions if you need clarifications.
r dplyr
r dplyr
edited Nov 27 '18 at 18:56
Brian
asked Nov 27 '18 at 18:45
BrianBrian
596
596
2
I think you are missingMAYBE
in your 2nd replacement function
– prosoitos
Nov 27 '18 at 18:53
2
Why don't you just go from the original0/1
type data straight to the NO and YES data? Why the intermediate step?
– John Paul
Nov 27 '18 at 18:54
1
Sorry @prosoitos that was a typo! Fixed it
– Brian
Nov 27 '18 at 18:56
1
And if you usedplyr::case_when
you don't have to create a function and can apply the change directly
– prosoitos
Nov 27 '18 at 18:58
2
@Brian - just dodput(head(test))
– fugu
Nov 27 '18 at 19:24
|
show 6 more comments
2
I think you are missingMAYBE
in your 2nd replacement function
– prosoitos
Nov 27 '18 at 18:53
2
Why don't you just go from the original0/1
type data straight to the NO and YES data? Why the intermediate step?
– John Paul
Nov 27 '18 at 18:54
1
Sorry @prosoitos that was a typo! Fixed it
– Brian
Nov 27 '18 at 18:56
1
And if you usedplyr::case_when
you don't have to create a function and can apply the change directly
– prosoitos
Nov 27 '18 at 18:58
2
@Brian - just dodput(head(test))
– fugu
Nov 27 '18 at 19:24
2
2
I think you are missing
MAYBE
in your 2nd replacement function– prosoitos
Nov 27 '18 at 18:53
I think you are missing
MAYBE
in your 2nd replacement function– prosoitos
Nov 27 '18 at 18:53
2
2
Why don't you just go from the original
0/1
type data straight to the NO and YES data? Why the intermediate step?– John Paul
Nov 27 '18 at 18:54
Why don't you just go from the original
0/1
type data straight to the NO and YES data? Why the intermediate step?– John Paul
Nov 27 '18 at 18:54
1
1
Sorry @prosoitos that was a typo! Fixed it
– Brian
Nov 27 '18 at 18:56
Sorry @prosoitos that was a typo! Fixed it
– Brian
Nov 27 '18 at 18:56
1
1
And if you use
dplyr::case_when
you don't have to create a function and can apply the change directly– prosoitos
Nov 27 '18 at 18:58
And if you use
dplyr::case_when
you don't have to create a function and can apply the change directly– prosoitos
Nov 27 '18 at 18:58
2
2
@Brian - just do
dput(head(test))
– fugu
Nov 27 '18 at 19:24
@Brian - just do
dput(head(test))
– fugu
Nov 27 '18 at 19:24
|
show 6 more comments
1 Answer
1
active
oldest
votes
library(tidyverse)
Recreate your data:
df <- tibble(
Location = letters[1:4],
sample1 = c("0/1", "0/0", "1/1", "0/1"),
sample2 = c("1/1", "0/0", "1/1", "1/1"),
sample3 = c("0/0", "0/0", "1/1", "./.")
)
Code:
df %>% mutate_at(
vars(- Location),
funs(case_when(
. == "1/1" | . == "0/1" ~ "YES",
. == "0/0" ~ "NO",
. == "./." ~ "MAYBE"
))
)
Result:
# A tibble: 4 x 4
Location sample1 sample2 sample3
<chr> <chr> <chr> <chr>
1 a YES YES NO
2 b NO NO NO
3 c YES YES YES
4 d YES YES MAYBE
1
Sorry for the late response. It took me a bit to get a subdataset to test this one. Works great. I am going to put it to the test later today!
– Brian
Nov 28 '18 at 13:27
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53506221%2fhow-to-efficiently-edit-massive-amounts-of-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
library(tidyverse)
Recreate your data:
df <- tibble(
Location = letters[1:4],
sample1 = c("0/1", "0/0", "1/1", "0/1"),
sample2 = c("1/1", "0/0", "1/1", "1/1"),
sample3 = c("0/0", "0/0", "1/1", "./.")
)
Code:
df %>% mutate_at(
vars(- Location),
funs(case_when(
. == "1/1" | . == "0/1" ~ "YES",
. == "0/0" ~ "NO",
. == "./." ~ "MAYBE"
))
)
Result:
# A tibble: 4 x 4
Location sample1 sample2 sample3
<chr> <chr> <chr> <chr>
1 a YES YES NO
2 b NO NO NO
3 c YES YES YES
4 d YES YES MAYBE
1
Sorry for the late response. It took me a bit to get a subdataset to test this one. Works great. I am going to put it to the test later today!
– Brian
Nov 28 '18 at 13:27
add a comment |
library(tidyverse)
Recreate your data:
df <- tibble(
Location = letters[1:4],
sample1 = c("0/1", "0/0", "1/1", "0/1"),
sample2 = c("1/1", "0/0", "1/1", "1/1"),
sample3 = c("0/0", "0/0", "1/1", "./.")
)
Code:
df %>% mutate_at(
vars(- Location),
funs(case_when(
. == "1/1" | . == "0/1" ~ "YES",
. == "0/0" ~ "NO",
. == "./." ~ "MAYBE"
))
)
Result:
# A tibble: 4 x 4
Location sample1 sample2 sample3
<chr> <chr> <chr> <chr>
1 a YES YES NO
2 b NO NO NO
3 c YES YES YES
4 d YES YES MAYBE
1
Sorry for the late response. It took me a bit to get a subdataset to test this one. Works great. I am going to put it to the test later today!
– Brian
Nov 28 '18 at 13:27
add a comment |
library(tidyverse)
Recreate your data:
df <- tibble(
Location = letters[1:4],
sample1 = c("0/1", "0/0", "1/1", "0/1"),
sample2 = c("1/1", "0/0", "1/1", "1/1"),
sample3 = c("0/0", "0/0", "1/1", "./.")
)
Code:
df %>% mutate_at(
vars(- Location),
funs(case_when(
. == "1/1" | . == "0/1" ~ "YES",
. == "0/0" ~ "NO",
. == "./." ~ "MAYBE"
))
)
Result:
# A tibble: 4 x 4
Location sample1 sample2 sample3
<chr> <chr> <chr> <chr>
1 a YES YES NO
2 b NO NO NO
3 c YES YES YES
4 d YES YES MAYBE
library(tidyverse)
Recreate your data:
df <- tibble(
Location = letters[1:4],
sample1 = c("0/1", "0/0", "1/1", "0/1"),
sample2 = c("1/1", "0/0", "1/1", "1/1"),
sample3 = c("0/0", "0/0", "1/1", "./.")
)
Code:
df %>% mutate_at(
vars(- Location),
funs(case_when(
. == "1/1" | . == "0/1" ~ "YES",
. == "0/0" ~ "NO",
. == "./." ~ "MAYBE"
))
)
Result:
# A tibble: 4 x 4
Location sample1 sample2 sample3
<chr> <chr> <chr> <chr>
1 a YES YES NO
2 b NO NO NO
3 c YES YES YES
4 d YES YES MAYBE
answered Nov 27 '18 at 19:20
prosoitosprosoitos
935419
935419
1
Sorry for the late response. It took me a bit to get a subdataset to test this one. Works great. I am going to put it to the test later today!
– Brian
Nov 28 '18 at 13:27
add a comment |
1
Sorry for the late response. It took me a bit to get a subdataset to test this one. Works great. I am going to put it to the test later today!
– Brian
Nov 28 '18 at 13:27
1
1
Sorry for the late response. It took me a bit to get a subdataset to test this one. Works great. I am going to put it to the test later today!
– Brian
Nov 28 '18 at 13:27
Sorry for the late response. It took me a bit to get a subdataset to test this one. Works great. I am going to put it to the test later today!
– Brian
Nov 28 '18 at 13:27
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53506221%2fhow-to-efficiently-edit-massive-amounts-of-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
I think you are missing
MAYBE
in your 2nd replacement function– prosoitos
Nov 27 '18 at 18:53
2
Why don't you just go from the original
0/1
type data straight to the NO and YES data? Why the intermediate step?– John Paul
Nov 27 '18 at 18:54
1
Sorry @prosoitos that was a typo! Fixed it
– Brian
Nov 27 '18 at 18:56
1
And if you use
dplyr::case_when
you don't have to create a function and can apply the change directly– prosoitos
Nov 27 '18 at 18:58
2
@Brian - just do
dput(head(test))
– fugu
Nov 27 '18 at 19:24