How to efficiently edit massive amounts of data












1















Here is an example dataset. Let me show everyone what I am doing to my data and then I will explain what I am struggling with. I apologize if the title isn't an accurate description. I tried my best but I am a bit new at this. Feel free to change it to something more suiting if needed



Location sample1 sample 2 sample 3
chr1:1234 0/1 1/1 0/0
chr2:5678 0/0 0/0 0/0
chr3:2345 1/1 1/1 1/1
chr4:6789 0/1 1/1 ./.


I use this to convert them to either a YES, NO, or MAYBE



replacement<-function(x){
x=replace(x,which(x=='./.'),0.1)
x=replace(x,which(x=='0/0'),0)
x=replace(x,which(x=='0/1'),1)
x=replace(x,which(x=='1/1'),2)
}

test=apply(test.data.set,2,replacement)

test.data.2 <- as.data.frame(test)

replacement<-function(x){
x=replace(x,which(x=='0.1'), "MAYBE")
x=replace(x,which(x=='0'), "NO")
x=replace(x,which(x=='1'), "YES")
x=replace(x,which(x=='2'), "YES")
}

test.data.3=apply(test.data.2,2,replacement)

test.data.4 <- as.data.frame(test.data.3)


Dataset after running



Location sample1 sample 2 sample 3
chr1:1234 YES YES NO
chr2:5678 NO NO NO
chr3:2345 YES YES YES
chr4:6789 YES YES MAYBE


So what I wrote above currently works for me. However, I have a new dataset that contains about 300 samples (columns) and about.. I'm not even sure, easily 500 million rows, so I need to alter over a billion "cells". I tried running this on a cluster with 256G with of memory and it just times out. I know what I wrote above is far from the "smoothest" way of altering my data. Does anyone have suggestions to streamline this process? I feel like dplyr has to have some kind of way to do this.



Any help would be amazing! Feel free to ask any questions if you need clarifications.










share|improve this question




















  • 2





    I think you are missing MAYBE in your 2nd replacement function

    – prosoitos
    Nov 27 '18 at 18:53






  • 2





    Why don't you just go from the original 0/1 type data straight to the NO and YES data? Why the intermediate step?

    – John Paul
    Nov 27 '18 at 18:54






  • 1





    Sorry @prosoitos that was a typo! Fixed it

    – Brian
    Nov 27 '18 at 18:56






  • 1





    And if you use dplyr::case_when you don't have to create a function and can apply the change directly

    – prosoitos
    Nov 27 '18 at 18:58






  • 2





    @Brian - just do dput(head(test))

    – fugu
    Nov 27 '18 at 19:24
















1















Here is an example dataset. Let me show everyone what I am doing to my data and then I will explain what I am struggling with. I apologize if the title isn't an accurate description. I tried my best but I am a bit new at this. Feel free to change it to something more suiting if needed



Location sample1 sample 2 sample 3
chr1:1234 0/1 1/1 0/0
chr2:5678 0/0 0/0 0/0
chr3:2345 1/1 1/1 1/1
chr4:6789 0/1 1/1 ./.


I use this to convert them to either a YES, NO, or MAYBE



replacement<-function(x){
x=replace(x,which(x=='./.'),0.1)
x=replace(x,which(x=='0/0'),0)
x=replace(x,which(x=='0/1'),1)
x=replace(x,which(x=='1/1'),2)
}

test=apply(test.data.set,2,replacement)

test.data.2 <- as.data.frame(test)

replacement<-function(x){
x=replace(x,which(x=='0.1'), "MAYBE")
x=replace(x,which(x=='0'), "NO")
x=replace(x,which(x=='1'), "YES")
x=replace(x,which(x=='2'), "YES")
}

test.data.3=apply(test.data.2,2,replacement)

test.data.4 <- as.data.frame(test.data.3)


Dataset after running



Location sample1 sample 2 sample 3
chr1:1234 YES YES NO
chr2:5678 NO NO NO
chr3:2345 YES YES YES
chr4:6789 YES YES MAYBE


So what I wrote above currently works for me. However, I have a new dataset that contains about 300 samples (columns) and about.. I'm not even sure, easily 500 million rows, so I need to alter over a billion "cells". I tried running this on a cluster with 256G with of memory and it just times out. I know what I wrote above is far from the "smoothest" way of altering my data. Does anyone have suggestions to streamline this process? I feel like dplyr has to have some kind of way to do this.



Any help would be amazing! Feel free to ask any questions if you need clarifications.










share|improve this question




















  • 2





    I think you are missing MAYBE in your 2nd replacement function

    – prosoitos
    Nov 27 '18 at 18:53






  • 2





    Why don't you just go from the original 0/1 type data straight to the NO and YES data? Why the intermediate step?

    – John Paul
    Nov 27 '18 at 18:54






  • 1





    Sorry @prosoitos that was a typo! Fixed it

    – Brian
    Nov 27 '18 at 18:56






  • 1





    And if you use dplyr::case_when you don't have to create a function and can apply the change directly

    – prosoitos
    Nov 27 '18 at 18:58






  • 2





    @Brian - just do dput(head(test))

    – fugu
    Nov 27 '18 at 19:24














1












1








1








Here is an example dataset. Let me show everyone what I am doing to my data and then I will explain what I am struggling with. I apologize if the title isn't an accurate description. I tried my best but I am a bit new at this. Feel free to change it to something more suiting if needed



Location sample1 sample 2 sample 3
chr1:1234 0/1 1/1 0/0
chr2:5678 0/0 0/0 0/0
chr3:2345 1/1 1/1 1/1
chr4:6789 0/1 1/1 ./.


I use this to convert them to either a YES, NO, or MAYBE



replacement<-function(x){
x=replace(x,which(x=='./.'),0.1)
x=replace(x,which(x=='0/0'),0)
x=replace(x,which(x=='0/1'),1)
x=replace(x,which(x=='1/1'),2)
}

test=apply(test.data.set,2,replacement)

test.data.2 <- as.data.frame(test)

replacement<-function(x){
x=replace(x,which(x=='0.1'), "MAYBE")
x=replace(x,which(x=='0'), "NO")
x=replace(x,which(x=='1'), "YES")
x=replace(x,which(x=='2'), "YES")
}

test.data.3=apply(test.data.2,2,replacement)

test.data.4 <- as.data.frame(test.data.3)


Dataset after running



Location sample1 sample 2 sample 3
chr1:1234 YES YES NO
chr2:5678 NO NO NO
chr3:2345 YES YES YES
chr4:6789 YES YES MAYBE


So what I wrote above currently works for me. However, I have a new dataset that contains about 300 samples (columns) and about.. I'm not even sure, easily 500 million rows, so I need to alter over a billion "cells". I tried running this on a cluster with 256G with of memory and it just times out. I know what I wrote above is far from the "smoothest" way of altering my data. Does anyone have suggestions to streamline this process? I feel like dplyr has to have some kind of way to do this.



Any help would be amazing! Feel free to ask any questions if you need clarifications.










share|improve this question
















Here is an example dataset. Let me show everyone what I am doing to my data and then I will explain what I am struggling with. I apologize if the title isn't an accurate description. I tried my best but I am a bit new at this. Feel free to change it to something more suiting if needed



Location sample1 sample 2 sample 3
chr1:1234 0/1 1/1 0/0
chr2:5678 0/0 0/0 0/0
chr3:2345 1/1 1/1 1/1
chr4:6789 0/1 1/1 ./.


I use this to convert them to either a YES, NO, or MAYBE



replacement<-function(x){
x=replace(x,which(x=='./.'),0.1)
x=replace(x,which(x=='0/0'),0)
x=replace(x,which(x=='0/1'),1)
x=replace(x,which(x=='1/1'),2)
}

test=apply(test.data.set,2,replacement)

test.data.2 <- as.data.frame(test)

replacement<-function(x){
x=replace(x,which(x=='0.1'), "MAYBE")
x=replace(x,which(x=='0'), "NO")
x=replace(x,which(x=='1'), "YES")
x=replace(x,which(x=='2'), "YES")
}

test.data.3=apply(test.data.2,2,replacement)

test.data.4 <- as.data.frame(test.data.3)


Dataset after running



Location sample1 sample 2 sample 3
chr1:1234 YES YES NO
chr2:5678 NO NO NO
chr3:2345 YES YES YES
chr4:6789 YES YES MAYBE


So what I wrote above currently works for me. However, I have a new dataset that contains about 300 samples (columns) and about.. I'm not even sure, easily 500 million rows, so I need to alter over a billion "cells". I tried running this on a cluster with 256G with of memory and it just times out. I know what I wrote above is far from the "smoothest" way of altering my data. Does anyone have suggestions to streamline this process? I feel like dplyr has to have some kind of way to do this.



Any help would be amazing! Feel free to ask any questions if you need clarifications.







r dplyr






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 27 '18 at 18:56







Brian

















asked Nov 27 '18 at 18:45









BrianBrian

596




596








  • 2





    I think you are missing MAYBE in your 2nd replacement function

    – prosoitos
    Nov 27 '18 at 18:53






  • 2





    Why don't you just go from the original 0/1 type data straight to the NO and YES data? Why the intermediate step?

    – John Paul
    Nov 27 '18 at 18:54






  • 1





    Sorry @prosoitos that was a typo! Fixed it

    – Brian
    Nov 27 '18 at 18:56






  • 1





    And if you use dplyr::case_when you don't have to create a function and can apply the change directly

    – prosoitos
    Nov 27 '18 at 18:58






  • 2





    @Brian - just do dput(head(test))

    – fugu
    Nov 27 '18 at 19:24














  • 2





    I think you are missing MAYBE in your 2nd replacement function

    – prosoitos
    Nov 27 '18 at 18:53






  • 2





    Why don't you just go from the original 0/1 type data straight to the NO and YES data? Why the intermediate step?

    – John Paul
    Nov 27 '18 at 18:54






  • 1





    Sorry @prosoitos that was a typo! Fixed it

    – Brian
    Nov 27 '18 at 18:56






  • 1





    And if you use dplyr::case_when you don't have to create a function and can apply the change directly

    – prosoitos
    Nov 27 '18 at 18:58






  • 2





    @Brian - just do dput(head(test))

    – fugu
    Nov 27 '18 at 19:24








2




2





I think you are missing MAYBE in your 2nd replacement function

– prosoitos
Nov 27 '18 at 18:53





I think you are missing MAYBE in your 2nd replacement function

– prosoitos
Nov 27 '18 at 18:53




2




2





Why don't you just go from the original 0/1 type data straight to the NO and YES data? Why the intermediate step?

– John Paul
Nov 27 '18 at 18:54





Why don't you just go from the original 0/1 type data straight to the NO and YES data? Why the intermediate step?

– John Paul
Nov 27 '18 at 18:54




1




1





Sorry @prosoitos that was a typo! Fixed it

– Brian
Nov 27 '18 at 18:56





Sorry @prosoitos that was a typo! Fixed it

– Brian
Nov 27 '18 at 18:56




1




1





And if you use dplyr::case_when you don't have to create a function and can apply the change directly

– prosoitos
Nov 27 '18 at 18:58





And if you use dplyr::case_when you don't have to create a function and can apply the change directly

– prosoitos
Nov 27 '18 at 18:58




2




2





@Brian - just do dput(head(test))

– fugu
Nov 27 '18 at 19:24





@Brian - just do dput(head(test))

– fugu
Nov 27 '18 at 19:24












1 Answer
1






active

oldest

votes


















1














library(tidyverse)


Recreate your data:



df <- tibble(
Location = letters[1:4],
sample1 = c("0/1", "0/0", "1/1", "0/1"),
sample2 = c("1/1", "0/0", "1/1", "1/1"),
sample3 = c("0/0", "0/0", "1/1", "./.")
)


Code:



df %>% mutate_at(
vars(- Location),
funs(case_when(
. == "1/1" | . == "0/1" ~ "YES",
. == "0/0" ~ "NO",
. == "./." ~ "MAYBE"
))
)


Result:



# A tibble: 4 x 4
Location sample1 sample2 sample3
<chr> <chr> <chr> <chr>
1 a YES YES NO
2 b NO NO NO
3 c YES YES YES
4 d YES YES MAYBE





share|improve this answer



















  • 1





    Sorry for the late response. It took me a bit to get a subdataset to test this one. Works great. I am going to put it to the test later today!

    – Brian
    Nov 28 '18 at 13:27











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53506221%2fhow-to-efficiently-edit-massive-amounts-of-data%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














library(tidyverse)


Recreate your data:



df <- tibble(
Location = letters[1:4],
sample1 = c("0/1", "0/0", "1/1", "0/1"),
sample2 = c("1/1", "0/0", "1/1", "1/1"),
sample3 = c("0/0", "0/0", "1/1", "./.")
)


Code:



df %>% mutate_at(
vars(- Location),
funs(case_when(
. == "1/1" | . == "0/1" ~ "YES",
. == "0/0" ~ "NO",
. == "./." ~ "MAYBE"
))
)


Result:



# A tibble: 4 x 4
Location sample1 sample2 sample3
<chr> <chr> <chr> <chr>
1 a YES YES NO
2 b NO NO NO
3 c YES YES YES
4 d YES YES MAYBE





share|improve this answer



















  • 1





    Sorry for the late response. It took me a bit to get a subdataset to test this one. Works great. I am going to put it to the test later today!

    – Brian
    Nov 28 '18 at 13:27
















1














library(tidyverse)


Recreate your data:



df <- tibble(
Location = letters[1:4],
sample1 = c("0/1", "0/0", "1/1", "0/1"),
sample2 = c("1/1", "0/0", "1/1", "1/1"),
sample3 = c("0/0", "0/0", "1/1", "./.")
)


Code:



df %>% mutate_at(
vars(- Location),
funs(case_when(
. == "1/1" | . == "0/1" ~ "YES",
. == "0/0" ~ "NO",
. == "./." ~ "MAYBE"
))
)


Result:



# A tibble: 4 x 4
Location sample1 sample2 sample3
<chr> <chr> <chr> <chr>
1 a YES YES NO
2 b NO NO NO
3 c YES YES YES
4 d YES YES MAYBE





share|improve this answer



















  • 1





    Sorry for the late response. It took me a bit to get a subdataset to test this one. Works great. I am going to put it to the test later today!

    – Brian
    Nov 28 '18 at 13:27














1












1








1







library(tidyverse)


Recreate your data:



df <- tibble(
Location = letters[1:4],
sample1 = c("0/1", "0/0", "1/1", "0/1"),
sample2 = c("1/1", "0/0", "1/1", "1/1"),
sample3 = c("0/0", "0/0", "1/1", "./.")
)


Code:



df %>% mutate_at(
vars(- Location),
funs(case_when(
. == "1/1" | . == "0/1" ~ "YES",
. == "0/0" ~ "NO",
. == "./." ~ "MAYBE"
))
)


Result:



# A tibble: 4 x 4
Location sample1 sample2 sample3
<chr> <chr> <chr> <chr>
1 a YES YES NO
2 b NO NO NO
3 c YES YES YES
4 d YES YES MAYBE





share|improve this answer













library(tidyverse)


Recreate your data:



df <- tibble(
Location = letters[1:4],
sample1 = c("0/1", "0/0", "1/1", "0/1"),
sample2 = c("1/1", "0/0", "1/1", "1/1"),
sample3 = c("0/0", "0/0", "1/1", "./.")
)


Code:



df %>% mutate_at(
vars(- Location),
funs(case_when(
. == "1/1" | . == "0/1" ~ "YES",
. == "0/0" ~ "NO",
. == "./." ~ "MAYBE"
))
)


Result:



# A tibble: 4 x 4
Location sample1 sample2 sample3
<chr> <chr> <chr> <chr>
1 a YES YES NO
2 b NO NO NO
3 c YES YES YES
4 d YES YES MAYBE






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 27 '18 at 19:20









prosoitosprosoitos

935419




935419








  • 1





    Sorry for the late response. It took me a bit to get a subdataset to test this one. Works great. I am going to put it to the test later today!

    – Brian
    Nov 28 '18 at 13:27














  • 1





    Sorry for the late response. It took me a bit to get a subdataset to test this one. Works great. I am going to put it to the test later today!

    – Brian
    Nov 28 '18 at 13:27








1




1





Sorry for the late response. It took me a bit to get a subdataset to test this one. Works great. I am going to put it to the test later today!

– Brian
Nov 28 '18 at 13:27





Sorry for the late response. It took me a bit to get a subdataset to test this one. Works great. I am going to put it to the test later today!

– Brian
Nov 28 '18 at 13:27




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53506221%2fhow-to-efficiently-edit-massive-amounts-of-data%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

Calculate evaluation metrics using cross_val_predict sklearn

Insert data from modal to MySQL (multiple modal on website)