Number duplicate count
up vote
1
down vote
favorite
I have a dataframe:
df <- data.frame(sample = c('S1', 'S1', 'S2', 'S3', 'S4', 'S4'), event = c(1,1,4,2,3,12), start = c(100, 20, 30, 500, 300, 200), end = c(350, 480, 60, 700, 300, 200))
sample event start end
S1 1 100 350
S1 1 20 480
S2 4 30 60
S3 2 500 700
S4 3 300 300
S4 12 200 200
I want to count the number of distinct events
in each sample
, and mutate the sample
name to reflect this.
For example sample
S4
has two distinct events, 3
and 12
. Here I would want to achieve this result:
sample event start end
S1 1 100 350
S1 1 20 480
S2 4 30 60
S3 2 500 700
S4.1 3 300 300
S4.2 12 200 200
Here's what I'm trying, which instead produces S4.2
and S4.2
:
df %>%
group_by(sample) %>%
dplyr::mutate(event_count = n_distinct(event)) %>%
dplyr::mutate(sample_mod = as.character(ifelse(event_count == 1, as.character(sample), paste(sample, event_count, sep = '.'))))
sample event start end event_count sample_mod
1 S1 1 100 350 1 S1
2 S1 1 20 480 1 S1
3 S2 4 30 60 1 S2
4 S3 2 500 700 1 S3
5 S4 3 300 300 2 S4.2
6 S4 12 200 200 2 S4.2
How can I modify this mid-pipe to achieve my desired output?
r dplyr
add a comment |
up vote
1
down vote
favorite
I have a dataframe:
df <- data.frame(sample = c('S1', 'S1', 'S2', 'S3', 'S4', 'S4'), event = c(1,1,4,2,3,12), start = c(100, 20, 30, 500, 300, 200), end = c(350, 480, 60, 700, 300, 200))
sample event start end
S1 1 100 350
S1 1 20 480
S2 4 30 60
S3 2 500 700
S4 3 300 300
S4 12 200 200
I want to count the number of distinct events
in each sample
, and mutate the sample
name to reflect this.
For example sample
S4
has two distinct events, 3
and 12
. Here I would want to achieve this result:
sample event start end
S1 1 100 350
S1 1 20 480
S2 4 30 60
S3 2 500 700
S4.1 3 300 300
S4.2 12 200 200
Here's what I'm trying, which instead produces S4.2
and S4.2
:
df %>%
group_by(sample) %>%
dplyr::mutate(event_count = n_distinct(event)) %>%
dplyr::mutate(sample_mod = as.character(ifelse(event_count == 1, as.character(sample), paste(sample, event_count, sep = '.'))))
sample event start end event_count sample_mod
1 S1 1 100 350 1 S1
2 S1 1 20 480 1 S1
3 S2 4 30 60 1 S2
4 S3 2 500 700 1 S3
5 S4 3 300 300 2 S4.2
6 S4 12 200 200 2 S4.2
How can I modify this mid-pipe to achieve my desired output?
r dplyr
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have a dataframe:
df <- data.frame(sample = c('S1', 'S1', 'S2', 'S3', 'S4', 'S4'), event = c(1,1,4,2,3,12), start = c(100, 20, 30, 500, 300, 200), end = c(350, 480, 60, 700, 300, 200))
sample event start end
S1 1 100 350
S1 1 20 480
S2 4 30 60
S3 2 500 700
S4 3 300 300
S4 12 200 200
I want to count the number of distinct events
in each sample
, and mutate the sample
name to reflect this.
For example sample
S4
has two distinct events, 3
and 12
. Here I would want to achieve this result:
sample event start end
S1 1 100 350
S1 1 20 480
S2 4 30 60
S3 2 500 700
S4.1 3 300 300
S4.2 12 200 200
Here's what I'm trying, which instead produces S4.2
and S4.2
:
df %>%
group_by(sample) %>%
dplyr::mutate(event_count = n_distinct(event)) %>%
dplyr::mutate(sample_mod = as.character(ifelse(event_count == 1, as.character(sample), paste(sample, event_count, sep = '.'))))
sample event start end event_count sample_mod
1 S1 1 100 350 1 S1
2 S1 1 20 480 1 S1
3 S2 4 30 60 1 S2
4 S3 2 500 700 1 S3
5 S4 3 300 300 2 S4.2
6 S4 12 200 200 2 S4.2
How can I modify this mid-pipe to achieve my desired output?
r dplyr
I have a dataframe:
df <- data.frame(sample = c('S1', 'S1', 'S2', 'S3', 'S4', 'S4'), event = c(1,1,4,2,3,12), start = c(100, 20, 30, 500, 300, 200), end = c(350, 480, 60, 700, 300, 200))
sample event start end
S1 1 100 350
S1 1 20 480
S2 4 30 60
S3 2 500 700
S4 3 300 300
S4 12 200 200
I want to count the number of distinct events
in each sample
, and mutate the sample
name to reflect this.
For example sample
S4
has two distinct events, 3
and 12
. Here I would want to achieve this result:
sample event start end
S1 1 100 350
S1 1 20 480
S2 4 30 60
S3 2 500 700
S4.1 3 300 300
S4.2 12 200 200
Here's what I'm trying, which instead produces S4.2
and S4.2
:
df %>%
group_by(sample) %>%
dplyr::mutate(event_count = n_distinct(event)) %>%
dplyr::mutate(sample_mod = as.character(ifelse(event_count == 1, as.character(sample), paste(sample, event_count, sep = '.'))))
sample event start end event_count sample_mod
1 S1 1 100 350 1 S1
2 S1 1 20 480 1 S1
3 S2 4 30 60 1 S2
4 S3 2 500 700 1 S3
5 S4 3 300 300 2 S4.2
6 S4 12 200 200 2 S4.2
How can I modify this mid-pipe to achieve my desired output?
r dplyr
r dplyr
asked Nov 21 at 17:22
fugu
4,46431741
4,46431741
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
up vote
2
down vote
accepted
After grouping by 'sample', get the number of distinct elements in 'event', create a logical condition with that to modify the values in 'sample' to unique values (make.unique
)
df %>%
group_by(sample) %>%
mutate(n = n_distinct(event)) %>%
ungroup %>%
mutate(sample_mod = case_when(n >1 ~ make.unique(as.character(sample)),
TRUE ~ as.character(sample)))
# A tibble: 6 x 6
# sample event start end n sample_mod
# <fct> <dbl> <dbl> <dbl> <int> <chr>
#1 S1 1 100 350 1 S1
#2 S1 1 20 480 1 S1
#3 S2 4 30 60 1 S2
#4 S3 2 500 700 1 S3
#5 S4 3 300 300 2 S4
#6 S4 12 200 200 2 S4.1
But that also renamesS1
asS1
andS1.1
. I don't want to do this as these are not distinct events
– fugu
Nov 21 at 17:27
@fugu Please check the output. It is not renamingS1
– akrun
Nov 21 at 17:28
@fugu Are you sure that you applied the code correctly as I am not able to replicate the issue you showed
– akrun
Nov 21 at 17:29
1
@fugu Sorry, I can't replicate the issue. Are you loadingplyr
also withdplyr
. Then usedplyr::mutate
– akrun
Nov 21 at 17:34
1
That was indeed the issue (re: reproducibility)
– fugu
Nov 21 at 17:35
|
show 1 more comment
up vote
2
down vote
library(data.table)
setDT(df)
df[order(event)
, sample := {
rid <- rleid(event)
if(any(rid > 1)) paste0(sample, '.', rid)
else sample }
, by = sample]
df
# sample event start end
# 1: S1 1 100 350
# 2: S1 1 20 480
# 3: S2 4 30 60
# 4: S3 2 500 700
# 5: S4.1 3 300 300
# 6: S4.2 12 200 200
Data used: (note stringsAsFactors = F
)
df <- data.frame(sample = c('S1', 'S1', 'S2', 'S3', 'S4', 'S4'), event = c(1,1,4,2,3,12), start = c(100, 20, 30, 500, 300, 200), end = c(350, 480, 60, 700, 300, 200), stringsAsFactors = F)
Benchmark:
dt <- function(df){
setDT(df)
df[order(event)
, sample := {
rid <- rleid(event)
if(any(rid > 1)) paste0(sample, '.', rid)
else sample }
, by = sample]
}
dply <- function(df){
df %>%
group_by(sample) %>%
mutate(n = n_distinct(event)) %>%
ungroup %>%
mutate(sample = case_when(n >1 ~ make.unique(as.character(sample)),
TRUE ~ as.character(sample)))
}
df <- rbindlist(replicate(1000, df, simplify = F))
microbenchmark::microbenchmark(dt(df), dply(df))
# Unit: milliseconds
# expr min lq mean median uq max neval
# dt(df) 1.750972 1.970664 2.332920 2.075279 2.391176 8.306448 100
# dply(df) 5.982349 6.277939 7.046036 6.566759 7.036501 15.112181 100
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
After grouping by 'sample', get the number of distinct elements in 'event', create a logical condition with that to modify the values in 'sample' to unique values (make.unique
)
df %>%
group_by(sample) %>%
mutate(n = n_distinct(event)) %>%
ungroup %>%
mutate(sample_mod = case_when(n >1 ~ make.unique(as.character(sample)),
TRUE ~ as.character(sample)))
# A tibble: 6 x 6
# sample event start end n sample_mod
# <fct> <dbl> <dbl> <dbl> <int> <chr>
#1 S1 1 100 350 1 S1
#2 S1 1 20 480 1 S1
#3 S2 4 30 60 1 S2
#4 S3 2 500 700 1 S3
#5 S4 3 300 300 2 S4
#6 S4 12 200 200 2 S4.1
But that also renamesS1
asS1
andS1.1
. I don't want to do this as these are not distinct events
– fugu
Nov 21 at 17:27
@fugu Please check the output. It is not renamingS1
– akrun
Nov 21 at 17:28
@fugu Are you sure that you applied the code correctly as I am not able to replicate the issue you showed
– akrun
Nov 21 at 17:29
1
@fugu Sorry, I can't replicate the issue. Are you loadingplyr
also withdplyr
. Then usedplyr::mutate
– akrun
Nov 21 at 17:34
1
That was indeed the issue (re: reproducibility)
– fugu
Nov 21 at 17:35
|
show 1 more comment
up vote
2
down vote
accepted
After grouping by 'sample', get the number of distinct elements in 'event', create a logical condition with that to modify the values in 'sample' to unique values (make.unique
)
df %>%
group_by(sample) %>%
mutate(n = n_distinct(event)) %>%
ungroup %>%
mutate(sample_mod = case_when(n >1 ~ make.unique(as.character(sample)),
TRUE ~ as.character(sample)))
# A tibble: 6 x 6
# sample event start end n sample_mod
# <fct> <dbl> <dbl> <dbl> <int> <chr>
#1 S1 1 100 350 1 S1
#2 S1 1 20 480 1 S1
#3 S2 4 30 60 1 S2
#4 S3 2 500 700 1 S3
#5 S4 3 300 300 2 S4
#6 S4 12 200 200 2 S4.1
But that also renamesS1
asS1
andS1.1
. I don't want to do this as these are not distinct events
– fugu
Nov 21 at 17:27
@fugu Please check the output. It is not renamingS1
– akrun
Nov 21 at 17:28
@fugu Are you sure that you applied the code correctly as I am not able to replicate the issue you showed
– akrun
Nov 21 at 17:29
1
@fugu Sorry, I can't replicate the issue. Are you loadingplyr
also withdplyr
. Then usedplyr::mutate
– akrun
Nov 21 at 17:34
1
That was indeed the issue (re: reproducibility)
– fugu
Nov 21 at 17:35
|
show 1 more comment
up vote
2
down vote
accepted
up vote
2
down vote
accepted
After grouping by 'sample', get the number of distinct elements in 'event', create a logical condition with that to modify the values in 'sample' to unique values (make.unique
)
df %>%
group_by(sample) %>%
mutate(n = n_distinct(event)) %>%
ungroup %>%
mutate(sample_mod = case_when(n >1 ~ make.unique(as.character(sample)),
TRUE ~ as.character(sample)))
# A tibble: 6 x 6
# sample event start end n sample_mod
# <fct> <dbl> <dbl> <dbl> <int> <chr>
#1 S1 1 100 350 1 S1
#2 S1 1 20 480 1 S1
#3 S2 4 30 60 1 S2
#4 S3 2 500 700 1 S3
#5 S4 3 300 300 2 S4
#6 S4 12 200 200 2 S4.1
After grouping by 'sample', get the number of distinct elements in 'event', create a logical condition with that to modify the values in 'sample' to unique values (make.unique
)
df %>%
group_by(sample) %>%
mutate(n = n_distinct(event)) %>%
ungroup %>%
mutate(sample_mod = case_when(n >1 ~ make.unique(as.character(sample)),
TRUE ~ as.character(sample)))
# A tibble: 6 x 6
# sample event start end n sample_mod
# <fct> <dbl> <dbl> <dbl> <int> <chr>
#1 S1 1 100 350 1 S1
#2 S1 1 20 480 1 S1
#3 S2 4 30 60 1 S2
#4 S3 2 500 700 1 S3
#5 S4 3 300 300 2 S4
#6 S4 12 200 200 2 S4.1
edited Nov 21 at 17:33
answered Nov 21 at 17:25
akrun
391k13180253
391k13180253
But that also renamesS1
asS1
andS1.1
. I don't want to do this as these are not distinct events
– fugu
Nov 21 at 17:27
@fugu Please check the output. It is not renamingS1
– akrun
Nov 21 at 17:28
@fugu Are you sure that you applied the code correctly as I am not able to replicate the issue you showed
– akrun
Nov 21 at 17:29
1
@fugu Sorry, I can't replicate the issue. Are you loadingplyr
also withdplyr
. Then usedplyr::mutate
– akrun
Nov 21 at 17:34
1
That was indeed the issue (re: reproducibility)
– fugu
Nov 21 at 17:35
|
show 1 more comment
But that also renamesS1
asS1
andS1.1
. I don't want to do this as these are not distinct events
– fugu
Nov 21 at 17:27
@fugu Please check the output. It is not renamingS1
– akrun
Nov 21 at 17:28
@fugu Are you sure that you applied the code correctly as I am not able to replicate the issue you showed
– akrun
Nov 21 at 17:29
1
@fugu Sorry, I can't replicate the issue. Are you loadingplyr
also withdplyr
. Then usedplyr::mutate
– akrun
Nov 21 at 17:34
1
That was indeed the issue (re: reproducibility)
– fugu
Nov 21 at 17:35
But that also renames
S1
as S1
and S1.1
. I don't want to do this as these are not distinct events– fugu
Nov 21 at 17:27
But that also renames
S1
as S1
and S1.1
. I don't want to do this as these are not distinct events– fugu
Nov 21 at 17:27
@fugu Please check the output. It is not renaming
S1
– akrun
Nov 21 at 17:28
@fugu Please check the output. It is not renaming
S1
– akrun
Nov 21 at 17:28
@fugu Are you sure that you applied the code correctly as I am not able to replicate the issue you showed
– akrun
Nov 21 at 17:29
@fugu Are you sure that you applied the code correctly as I am not able to replicate the issue you showed
– akrun
Nov 21 at 17:29
1
1
@fugu Sorry, I can't replicate the issue. Are you loading
plyr
also with dplyr
. Then use dplyr::mutate
– akrun
Nov 21 at 17:34
@fugu Sorry, I can't replicate the issue. Are you loading
plyr
also with dplyr
. Then use dplyr::mutate
– akrun
Nov 21 at 17:34
1
1
That was indeed the issue (re: reproducibility)
– fugu
Nov 21 at 17:35
That was indeed the issue (re: reproducibility)
– fugu
Nov 21 at 17:35
|
show 1 more comment
up vote
2
down vote
library(data.table)
setDT(df)
df[order(event)
, sample := {
rid <- rleid(event)
if(any(rid > 1)) paste0(sample, '.', rid)
else sample }
, by = sample]
df
# sample event start end
# 1: S1 1 100 350
# 2: S1 1 20 480
# 3: S2 4 30 60
# 4: S3 2 500 700
# 5: S4.1 3 300 300
# 6: S4.2 12 200 200
Data used: (note stringsAsFactors = F
)
df <- data.frame(sample = c('S1', 'S1', 'S2', 'S3', 'S4', 'S4'), event = c(1,1,4,2,3,12), start = c(100, 20, 30, 500, 300, 200), end = c(350, 480, 60, 700, 300, 200), stringsAsFactors = F)
Benchmark:
dt <- function(df){
setDT(df)
df[order(event)
, sample := {
rid <- rleid(event)
if(any(rid > 1)) paste0(sample, '.', rid)
else sample }
, by = sample]
}
dply <- function(df){
df %>%
group_by(sample) %>%
mutate(n = n_distinct(event)) %>%
ungroup %>%
mutate(sample = case_when(n >1 ~ make.unique(as.character(sample)),
TRUE ~ as.character(sample)))
}
df <- rbindlist(replicate(1000, df, simplify = F))
microbenchmark::microbenchmark(dt(df), dply(df))
# Unit: milliseconds
# expr min lq mean median uq max neval
# dt(df) 1.750972 1.970664 2.332920 2.075279 2.391176 8.306448 100
# dply(df) 5.982349 6.277939 7.046036 6.566759 7.036501 15.112181 100
add a comment |
up vote
2
down vote
library(data.table)
setDT(df)
df[order(event)
, sample := {
rid <- rleid(event)
if(any(rid > 1)) paste0(sample, '.', rid)
else sample }
, by = sample]
df
# sample event start end
# 1: S1 1 100 350
# 2: S1 1 20 480
# 3: S2 4 30 60
# 4: S3 2 500 700
# 5: S4.1 3 300 300
# 6: S4.2 12 200 200
Data used: (note stringsAsFactors = F
)
df <- data.frame(sample = c('S1', 'S1', 'S2', 'S3', 'S4', 'S4'), event = c(1,1,4,2,3,12), start = c(100, 20, 30, 500, 300, 200), end = c(350, 480, 60, 700, 300, 200), stringsAsFactors = F)
Benchmark:
dt <- function(df){
setDT(df)
df[order(event)
, sample := {
rid <- rleid(event)
if(any(rid > 1)) paste0(sample, '.', rid)
else sample }
, by = sample]
}
dply <- function(df){
df %>%
group_by(sample) %>%
mutate(n = n_distinct(event)) %>%
ungroup %>%
mutate(sample = case_when(n >1 ~ make.unique(as.character(sample)),
TRUE ~ as.character(sample)))
}
df <- rbindlist(replicate(1000, df, simplify = F))
microbenchmark::microbenchmark(dt(df), dply(df))
# Unit: milliseconds
# expr min lq mean median uq max neval
# dt(df) 1.750972 1.970664 2.332920 2.075279 2.391176 8.306448 100
# dply(df) 5.982349 6.277939 7.046036 6.566759 7.036501 15.112181 100
add a comment |
up vote
2
down vote
up vote
2
down vote
library(data.table)
setDT(df)
df[order(event)
, sample := {
rid <- rleid(event)
if(any(rid > 1)) paste0(sample, '.', rid)
else sample }
, by = sample]
df
# sample event start end
# 1: S1 1 100 350
# 2: S1 1 20 480
# 3: S2 4 30 60
# 4: S3 2 500 700
# 5: S4.1 3 300 300
# 6: S4.2 12 200 200
Data used: (note stringsAsFactors = F
)
df <- data.frame(sample = c('S1', 'S1', 'S2', 'S3', 'S4', 'S4'), event = c(1,1,4,2,3,12), start = c(100, 20, 30, 500, 300, 200), end = c(350, 480, 60, 700, 300, 200), stringsAsFactors = F)
Benchmark:
dt <- function(df){
setDT(df)
df[order(event)
, sample := {
rid <- rleid(event)
if(any(rid > 1)) paste0(sample, '.', rid)
else sample }
, by = sample]
}
dply <- function(df){
df %>%
group_by(sample) %>%
mutate(n = n_distinct(event)) %>%
ungroup %>%
mutate(sample = case_when(n >1 ~ make.unique(as.character(sample)),
TRUE ~ as.character(sample)))
}
df <- rbindlist(replicate(1000, df, simplify = F))
microbenchmark::microbenchmark(dt(df), dply(df))
# Unit: milliseconds
# expr min lq mean median uq max neval
# dt(df) 1.750972 1.970664 2.332920 2.075279 2.391176 8.306448 100
# dply(df) 5.982349 6.277939 7.046036 6.566759 7.036501 15.112181 100
library(data.table)
setDT(df)
df[order(event)
, sample := {
rid <- rleid(event)
if(any(rid > 1)) paste0(sample, '.', rid)
else sample }
, by = sample]
df
# sample event start end
# 1: S1 1 100 350
# 2: S1 1 20 480
# 3: S2 4 30 60
# 4: S3 2 500 700
# 5: S4.1 3 300 300
# 6: S4.2 12 200 200
Data used: (note stringsAsFactors = F
)
df <- data.frame(sample = c('S1', 'S1', 'S2', 'S3', 'S4', 'S4'), event = c(1,1,4,2,3,12), start = c(100, 20, 30, 500, 300, 200), end = c(350, 480, 60, 700, 300, 200), stringsAsFactors = F)
Benchmark:
dt <- function(df){
setDT(df)
df[order(event)
, sample := {
rid <- rleid(event)
if(any(rid > 1)) paste0(sample, '.', rid)
else sample }
, by = sample]
}
dply <- function(df){
df %>%
group_by(sample) %>%
mutate(n = n_distinct(event)) %>%
ungroup %>%
mutate(sample = case_when(n >1 ~ make.unique(as.character(sample)),
TRUE ~ as.character(sample)))
}
df <- rbindlist(replicate(1000, df, simplify = F))
microbenchmark::microbenchmark(dt(df), dply(df))
# Unit: milliseconds
# expr min lq mean median uq max neval
# dt(df) 1.750972 1.970664 2.332920 2.075279 2.391176 8.306448 100
# dply(df) 5.982349 6.277939 7.046036 6.566759 7.036501 15.112181 100
edited Nov 21 at 17:48
answered Nov 21 at 17:41
IceCreamToucan
7,7001616
7,7001616
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53417518%2fnumber-duplicate-count%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown