How to efficiently edit massive amounts of data

Here is an example dataset. Let me show everyone what I am doing to my data and then I will explain what I am struggling with. I apologize if the title isn't an accurate description. I tried my best but I am a bit new at this. Feel free to change it to something more suiting if needed

Location sample1 sample 2 sample 3

chr1:1234 0/1 1/1 0/0

chr2:5678 0/0 0/0 0/0

chr3:2345 1/1 1/1 1/1

chr4:6789 0/1 1/1 ./.

I use this to convert them to either a YES, NO, or MAYBE

replacement<-function(x){

  x=replace(x,which(x=='./.'),0.1) 

  x=replace(x,which(x=='0/0'),0)

  x=replace(x,which(x=='0/1'),1)

  x=replace(x,which(x=='1/1'),2)

}



test=apply(test.data.set,2,replacement)



test.data.2 <- as.data.frame(test)



replacement<-function(x){

  x=replace(x,which(x=='0.1'), "MAYBE") 

  x=replace(x,which(x=='0'), "NO")

  x=replace(x,which(x=='1'), "YES")

  x=replace(x,which(x=='2'), "YES")

}



test.data.3=apply(test.data.2,2,replacement)



test.data.4 <- as.data.frame(test.data.3)

Dataset after running

Location sample1 sample 2 sample 3

chr1:1234 YES YES NO

chr2:5678 NO NO NO

chr3:2345 YES YES YES

chr4:6789 YES YES MAYBE

So what I wrote above currently works for me. However, I have a new dataset that contains about 300 samples (columns) and about.. I'm not even sure, easily 500 million rows, so I need to alter over a billion "cells". I tried running this on a cluster with 256G with of memory and it just times out. I know what I wrote above is far from the "smoothest" way of altering my data. Does anyone have suggestions to streamline this process? I feel like dplyr has to have some kind of way to do this.

Any help would be amazing! Feel free to ask any questions if you need clarifications.

edited Nov 27 '18 at 18:56

asked Nov 27 '18 at 18:45

Brian

596

2

I think you are missing MAYBE in your 2nd replacement function

– prosoitos
Nov 27 '18 at 18:53

2

Why don't you just go from the original 0/1 type data straight to the NO and YES data? Why the intermediate step?

– John Paul
Nov 27 '18 at 18:54

1

Sorry @prosoitos that was a typo! Fixed it

– Brian
Nov 27 '18 at 18:56

1

And if you use dplyr::case_when you don't have to create a function and can apply the change directly

– prosoitos
Nov 27 '18 at 18:58

2

@Brian - just do dput(head(test))

– fugu
Nov 27 '18 at 19:24

|
show 6 more comments

Location sample1 sample 2 sample 3

chr1:1234 0/1 1/1 0/0

chr2:5678 0/0 0/0 0/0

chr3:2345 1/1 1/1 1/1

chr4:6789 0/1 1/1 ./.

I use this to convert them to either a YES, NO, or MAYBE

replacement<-function(x){

  x=replace(x,which(x=='./.'),0.1) 

  x=replace(x,which(x=='0/0'),0)

  x=replace(x,which(x=='0/1'),1)

  x=replace(x,which(x=='1/1'),2)

}



test=apply(test.data.set,2,replacement)



test.data.2 <- as.data.frame(test)



replacement<-function(x){

  x=replace(x,which(x=='0.1'), "MAYBE") 

  x=replace(x,which(x=='0'), "NO")

  x=replace(x,which(x=='1'), "YES")

  x=replace(x,which(x=='2'), "YES")

}



test.data.3=apply(test.data.2,2,replacement)



test.data.4 <- as.data.frame(test.data.3)

Dataset after running

Location sample1 sample 2 sample 3

chr1:1234 YES YES NO

chr2:5678 NO NO NO

chr3:2345 YES YES YES

chr4:6789 YES YES MAYBE

Any help would be amazing! Feel free to ask any questions if you need clarifications.

edited Nov 27 '18 at 18:56

asked Nov 27 '18 at 18:45

Brian

596

2

I think you are missing MAYBE in your 2nd replacement function

– prosoitos
Nov 27 '18 at 18:53

2

Why don't you just go from the original 0/1 type data straight to the NO and YES data? Why the intermediate step?

– John Paul
Nov 27 '18 at 18:54

1

Sorry @prosoitos that was a typo! Fixed it

– Brian
Nov 27 '18 at 18:56

1

And if you use dplyr::case_when you don't have to create a function and can apply the change directly

– prosoitos
Nov 27 '18 at 18:58

2

@Brian - just do dput(head(test))

– fugu
Nov 27 '18 at 19:24

|
show 6 more comments

Location sample1 sample 2 sample 3

chr1:1234 0/1 1/1 0/0

chr2:5678 0/0 0/0 0/0

chr3:2345 1/1 1/1 1/1

chr4:6789 0/1 1/1 ./.

I use this to convert them to either a YES, NO, or MAYBE

replacement<-function(x){

  x=replace(x,which(x=='./.'),0.1) 

  x=replace(x,which(x=='0/0'),0)

  x=replace(x,which(x=='0/1'),1)

  x=replace(x,which(x=='1/1'),2)

}



test=apply(test.data.set,2,replacement)



test.data.2 <- as.data.frame(test)



replacement<-function(x){

  x=replace(x,which(x=='0.1'), "MAYBE") 

  x=replace(x,which(x=='0'), "NO")

  x=replace(x,which(x=='1'), "YES")

  x=replace(x,which(x=='2'), "YES")

}



test.data.3=apply(test.data.2,2,replacement)



test.data.4 <- as.data.frame(test.data.3)

Dataset after running

Location sample1 sample 2 sample 3

chr1:1234 YES YES NO

chr2:5678 NO NO NO

chr3:2345 YES YES YES

chr4:6789 YES YES MAYBE

Any help would be amazing! Feel free to ask any questions if you need clarifications.

edited Nov 27 '18 at 18:56

asked Nov 27 '18 at 18:45

Brian

596

Location sample1 sample 2 sample 3

chr1:1234 0/1 1/1 0/0

chr2:5678 0/0 0/0 0/0

chr3:2345 1/1 1/1 1/1

chr4:6789 0/1 1/1 ./.

I use this to convert them to either a YES, NO, or MAYBE

replacement<-function(x){

  x=replace(x,which(x=='./.'),0.1) 

  x=replace(x,which(x=='0/0'),0)

  x=replace(x,which(x=='0/1'),1)

  x=replace(x,which(x=='1/1'),2)

}



test=apply(test.data.set,2,replacement)



test.data.2 <- as.data.frame(test)



replacement<-function(x){

  x=replace(x,which(x=='0.1'), "MAYBE") 

  x=replace(x,which(x=='0'), "NO")

  x=replace(x,which(x=='1'), "YES")

  x=replace(x,which(x=='2'), "YES")

}



test.data.3=apply(test.data.2,2,replacement)



test.data.4 <- as.data.frame(test.data.3)

Dataset after running

Location sample1 sample 2 sample 3

chr1:1234 YES YES NO

chr2:5678 NO NO NO

chr3:2345 YES YES YES

chr4:6789 YES YES MAYBE

Any help would be amazing! Feel free to ask any questions if you need clarifications.

r dplyr

edited Nov 27 '18 at 18:56

asked Nov 27 '18 at 18:45

Brian

596

edited Nov 27 '18 at 18:56

asked Nov 27 '18 at 18:45

Brian

596

edited Nov 27 '18 at 18:56

asked Nov 27 '18 at 18:45

Brian

596

asked Nov 27 '18 at 18:45

Brian

596

asked Nov 27 '18 at 18:45

Brian

596

2

I think you are missing MAYBE in your 2nd replacement function

– prosoitos
Nov 27 '18 at 18:53

2

Why don't you just go from the original 0/1 type data straight to the NO and YES data? Why the intermediate step?

– John Paul
Nov 27 '18 at 18:54

1

Sorry @prosoitos that was a typo! Fixed it

– Brian
Nov 27 '18 at 18:56

1

And if you use dplyr::case_when you don't have to create a function and can apply the change directly

– prosoitos
Nov 27 '18 at 18:58

2

@Brian - just do dput(head(test))

– fugu
Nov 27 '18 at 19:24

|
show 6 more comments

2

I think you are missing MAYBE in your 2nd replacement function

– prosoitos
Nov 27 '18 at 18:53

2

Why don't you just go from the original 0/1 type data straight to the NO and YES data? Why the intermediate step?

– John Paul
Nov 27 '18 at 18:54

1

Sorry @prosoitos that was a typo! Fixed it

– Brian
Nov 27 '18 at 18:56

1

And if you use dplyr::case_when you don't have to create a function and can apply the change directly

– prosoitos
Nov 27 '18 at 18:58

2

@Brian - just do dput(head(test))

– fugu
Nov 27 '18 at 19:24

I think you are missing MAYBE in your 2nd replacement function

– prosoitos
Nov 27 '18 at 18:53

Why don't you just go from the original 0/1 type data straight to the NO and YES data? Why the intermediate step?

– John Paul
Nov 27 '18 at 18:54

Sorry @prosoitos that was a typo! Fixed it

– Brian
Nov 27 '18 at 18:56

And if you use dplyr::case_when you don't have to create a function and can apply the change directly

– prosoitos
Nov 27 '18 at 18:58

@Brian - just do dput(head(test))

– fugu
Nov 27 '18 at 19:24

|
show 6 more comments

1 Answer
1

active

oldest

votes

library(tidyverse)

Recreate your data:

df <- tibble(

  Location = letters[1:4],

  sample1 = c("0/1", "0/0", "1/1", "0/1"),

  sample2 = c("1/1", "0/0", "1/1", "1/1"),

  sample3 = c("0/0", "0/0", "1/1", "./.")

)

Code:

df %>% mutate_at(

  vars(- Location),

  funs(case_when(

    . == "1/1" | . == "0/1" ~ "YES",

    . == "0/0" ~ "NO",

    . == "./." ~ "MAYBE"

  ))

)

Result:

# A tibble: 4 x 4

  Location sample1 sample2 sample3

  <chr>    <chr>   <chr>   <chr>  

1 a        YES     YES     NO     

2 b        NO      NO      NO     

3 c        YES     YES     YES    

4 d        YES     YES     MAYBE

answered Nov 27 '18 at 19:20

prosoitos

935419

1

Sorry for the late response. It took me a bit to get a subdataset to test this one. Works great. I am going to put it to the test later today!

– Brian
Nov 28 '18 at 13:27

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53506221%2fhow-to-efficiently-edit-massive-amounts-of-data%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

library(tidyverse)

Recreate your data:

df <- tibble(

  Location = letters[1:4],

  sample1 = c("0/1", "0/0", "1/1", "0/1"),

  sample2 = c("1/1", "0/0", "1/1", "1/1"),

  sample3 = c("0/0", "0/0", "1/1", "./.")

)

Code:

df %>% mutate_at(

  vars(- Location),

  funs(case_when(

    . == "1/1" | . == "0/1" ~ "YES",

    . == "0/0" ~ "NO",

    . == "./." ~ "MAYBE"

  ))

)

Result:

# A tibble: 4 x 4

  Location sample1 sample2 sample3

  <chr>    <chr>   <chr>   <chr>  

1 a        YES     YES     NO     

2 b        NO      NO      NO     

3 c        YES     YES     YES    

4 d        YES     YES     MAYBE

answered Nov 27 '18 at 19:20

prosoitos

935419

1

Sorry for the late response. It took me a bit to get a subdataset to test this one. Works great. I am going to put it to the test later today!

– Brian
Nov 28 '18 at 13:27

add a comment |

library(tidyverse)

Recreate your data:

df <- tibble(

  Location = letters[1:4],

  sample1 = c("0/1", "0/0", "1/1", "0/1"),

  sample2 = c("1/1", "0/0", "1/1", "1/1"),

  sample3 = c("0/0", "0/0", "1/1", "./.")

)

Code:

df %>% mutate_at(

  vars(- Location),

  funs(case_when(

    . == "1/1" | . == "0/1" ~ "YES",

    . == "0/0" ~ "NO",

    . == "./." ~ "MAYBE"

  ))

)

Result:

# A tibble: 4 x 4

  Location sample1 sample2 sample3

  <chr>    <chr>   <chr>   <chr>  

1 a        YES     YES     NO     

2 b        NO      NO      NO     

3 c        YES     YES     YES    

4 d        YES     YES     MAYBE

answered Nov 27 '18 at 19:20

prosoitos

935419

1

Sorry for the late response. It took me a bit to get a subdataset to test this one. Works great. I am going to put it to the test later today!

– Brian
Nov 28 '18 at 13:27

add a comment |

library(tidyverse)

Recreate your data:

df <- tibble(

  Location = letters[1:4],

  sample1 = c("0/1", "0/0", "1/1", "0/1"),

  sample2 = c("1/1", "0/0", "1/1", "1/1"),

  sample3 = c("0/0", "0/0", "1/1", "./.")

)

Code:

df %>% mutate_at(

  vars(- Location),

  funs(case_when(

    . == "1/1" | . == "0/1" ~ "YES",

    . == "0/0" ~ "NO",

    . == "./." ~ "MAYBE"

  ))

)

Result:

# A tibble: 4 x 4

  Location sample1 sample2 sample3

  <chr>    <chr>   <chr>   <chr>  

1 a        YES     YES     NO     

2 b        NO      NO      NO     

3 c        YES     YES     YES    

4 d        YES     YES     MAYBE

answered Nov 27 '18 at 19:20

prosoitos

935419

library(tidyverse)

Recreate your data:

df <- tibble(

  Location = letters[1:4],

  sample1 = c("0/1", "0/0", "1/1", "0/1"),

  sample2 = c("1/1", "0/0", "1/1", "1/1"),

  sample3 = c("0/0", "0/0", "1/1", "./.")

)

Code:

df %>% mutate_at(

  vars(- Location),

  funs(case_when(

    . == "1/1" | . == "0/1" ~ "YES",

    . == "0/0" ~ "NO",

    . == "./." ~ "MAYBE"

  ))

)

Result:

# A tibble: 4 x 4

  Location sample1 sample2 sample3

  <chr>    <chr>   <chr>   <chr>  

1 a        YES     YES     NO     

2 b        NO      NO      NO     

3 c        YES     YES     YES    

4 d        YES     YES     MAYBE

answered Nov 27 '18 at 19:20

prosoitos

935419

answered Nov 27 '18 at 19:20

prosoitos

935419

answered Nov 27 '18 at 19:20

prosoitos

935419

answered Nov 27 '18 at 19:20

prosoitos

935419

1

Sorry for the late response. It took me a bit to get a subdataset to test this one. Works great. I am going to put it to the test later today!

– Brian
Nov 28 '18 at 13:27

add a comment |

1

Sorry for the late response. It took me a bit to get a subdataset to test this one. Works great. I am going to put it to the test later today!

– Brian
Nov 28 '18 at 13:27

Sorry for the late response. It took me a bit to get a subdataset to test this one. Works great. I am going to put it to the test later today!

– Brian
Nov 28 '18 at 13:27

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Btukfyl