Object size increases hugely when transposing a data frame
up vote
3
down vote
favorite
I have a data frame of ca. 50,000 RNA transcripts in rows, with 10,000 different samples in columns. The size of the data frame is 4.9GB.
I then have to transpose the data in order to subset it properly later:
df <- data.frame(t(df))
After the transpose, the object size has ballooned to 70GB. Why is this happening? Should transposing data really change the file size that much?
str()
of the first 20 columns:
str(df[1:20])
Classes 'tbl_df', 'tbl' and 'data.frame': 56202 obs. of 20 variables:
$ X1 : int 1 2 3 4 5 6 7 8 9 10 ...
$ Name : chr "ENSG00000223972.4" "ENSG00000227232.4" "ENSG00000243485.2" "ENSG00000237613.2" ...
$ Description : chr "DDX11L1" "WASH7P" "MIR1302-11" "FAM138A" ...
$ GTEX-1117F-0226-SM-5GZZ7: num 0.1082 21.4 0.1602 0.0505 0 ...
$ GTEX-111CU-1826-SM-5GZYN: num 0.1158 11.03 0.0643 0 0 ...
$ GTEX-111FC-0226-SM-5N9B8: num 0.021 16.75 0.0467 0.0295 0 ...
$ GTEX-111VG-2326-SM-5N9BK: num 0.0233 8.172 0 0.0326 0 ...
$ GTEX-111YS-2426-SM-5GZZQ: num 0 7.658 0.0586 0 0 ...
$ GTEX-1122O-2026-SM-5NQ91: num 0.0464 9.372 0 0 0 ...
$ GTEX-1128S-2126-SM-5H12U: num 0.0308 10.08 0.1367 0.0861 0.1108 ...
$ GTEX-113IC-0226-SM-5HL5C: num 0.0936 13.56 0.2079 0.131 0.0562 ...
$ GTEX-117YX-2226-SM-5EGJJ: num 0.121 9.889 0.0537 0.0677 0 ...
$ GTEX-11DXW-0326-SM-5H11W: num 0.0286 9.121 0.0635 0 0 ...
$ GTEX-11DXX-2326-SM-5Q5A2: num 0 6.698 0.0508 0.032 0 ...
$ GTEX-11DZ1-0226-SM-5A5KF: num 0.0237 9.835 0 0.0664 0 ...
$ GTEX-11EI6-0226-SM-5EQ64: num 0.0802 13.1 0 0 0 ...
$ GTEX-11EM3-2326-SM-5H12B: num 0.0223 8.904 0.0496 0.0625 0.0402 ...
$ GTEX-11EMC-2826-SM-5PNY6: num 0.0189 16.59 0 0.0265 0.034 ...
$ GTEX-11EQ8-0226-SM-5EQ5G: num 0.0931 15.1 0.0689 0.0869 0 ...
$ GTEX-11EQ9-2526-SM-5HL66: num 0.0777 9.838 0 0 0 ...
r dataframe memory transpose
New contributor
add a comment |
up vote
3
down vote
favorite
I have a data frame of ca. 50,000 RNA transcripts in rows, with 10,000 different samples in columns. The size of the data frame is 4.9GB.
I then have to transpose the data in order to subset it properly later:
df <- data.frame(t(df))
After the transpose, the object size has ballooned to 70GB. Why is this happening? Should transposing data really change the file size that much?
str()
of the first 20 columns:
str(df[1:20])
Classes 'tbl_df', 'tbl' and 'data.frame': 56202 obs. of 20 variables:
$ X1 : int 1 2 3 4 5 6 7 8 9 10 ...
$ Name : chr "ENSG00000223972.4" "ENSG00000227232.4" "ENSG00000243485.2" "ENSG00000237613.2" ...
$ Description : chr "DDX11L1" "WASH7P" "MIR1302-11" "FAM138A" ...
$ GTEX-1117F-0226-SM-5GZZ7: num 0.1082 21.4 0.1602 0.0505 0 ...
$ GTEX-111CU-1826-SM-5GZYN: num 0.1158 11.03 0.0643 0 0 ...
$ GTEX-111FC-0226-SM-5N9B8: num 0.021 16.75 0.0467 0.0295 0 ...
$ GTEX-111VG-2326-SM-5N9BK: num 0.0233 8.172 0 0.0326 0 ...
$ GTEX-111YS-2426-SM-5GZZQ: num 0 7.658 0.0586 0 0 ...
$ GTEX-1122O-2026-SM-5NQ91: num 0.0464 9.372 0 0 0 ...
$ GTEX-1128S-2126-SM-5H12U: num 0.0308 10.08 0.1367 0.0861 0.1108 ...
$ GTEX-113IC-0226-SM-5HL5C: num 0.0936 13.56 0.2079 0.131 0.0562 ...
$ GTEX-117YX-2226-SM-5EGJJ: num 0.121 9.889 0.0537 0.0677 0 ...
$ GTEX-11DXW-0326-SM-5H11W: num 0.0286 9.121 0.0635 0 0 ...
$ GTEX-11DXX-2326-SM-5Q5A2: num 0 6.698 0.0508 0.032 0 ...
$ GTEX-11DZ1-0226-SM-5A5KF: num 0.0237 9.835 0 0.0664 0 ...
$ GTEX-11EI6-0226-SM-5EQ64: num 0.0802 13.1 0 0 0 ...
$ GTEX-11EM3-2326-SM-5H12B: num 0.0223 8.904 0.0496 0.0625 0.0402 ...
$ GTEX-11EMC-2826-SM-5PNY6: num 0.0189 16.59 0 0.0265 0.034 ...
$ GTEX-11EQ8-0226-SM-5EQ5G: num 0.0931 15.1 0.0689 0.0869 0 ...
$ GTEX-11EQ9-2526-SM-5HL66: num 0.0777 9.838 0 0 0 ...
r dataframe memory transpose
New contributor
add a comment |
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I have a data frame of ca. 50,000 RNA transcripts in rows, with 10,000 different samples in columns. The size of the data frame is 4.9GB.
I then have to transpose the data in order to subset it properly later:
df <- data.frame(t(df))
After the transpose, the object size has ballooned to 70GB. Why is this happening? Should transposing data really change the file size that much?
str()
of the first 20 columns:
str(df[1:20])
Classes 'tbl_df', 'tbl' and 'data.frame': 56202 obs. of 20 variables:
$ X1 : int 1 2 3 4 5 6 7 8 9 10 ...
$ Name : chr "ENSG00000223972.4" "ENSG00000227232.4" "ENSG00000243485.2" "ENSG00000237613.2" ...
$ Description : chr "DDX11L1" "WASH7P" "MIR1302-11" "FAM138A" ...
$ GTEX-1117F-0226-SM-5GZZ7: num 0.1082 21.4 0.1602 0.0505 0 ...
$ GTEX-111CU-1826-SM-5GZYN: num 0.1158 11.03 0.0643 0 0 ...
$ GTEX-111FC-0226-SM-5N9B8: num 0.021 16.75 0.0467 0.0295 0 ...
$ GTEX-111VG-2326-SM-5N9BK: num 0.0233 8.172 0 0.0326 0 ...
$ GTEX-111YS-2426-SM-5GZZQ: num 0 7.658 0.0586 0 0 ...
$ GTEX-1122O-2026-SM-5NQ91: num 0.0464 9.372 0 0 0 ...
$ GTEX-1128S-2126-SM-5H12U: num 0.0308 10.08 0.1367 0.0861 0.1108 ...
$ GTEX-113IC-0226-SM-5HL5C: num 0.0936 13.56 0.2079 0.131 0.0562 ...
$ GTEX-117YX-2226-SM-5EGJJ: num 0.121 9.889 0.0537 0.0677 0 ...
$ GTEX-11DXW-0326-SM-5H11W: num 0.0286 9.121 0.0635 0 0 ...
$ GTEX-11DXX-2326-SM-5Q5A2: num 0 6.698 0.0508 0.032 0 ...
$ GTEX-11DZ1-0226-SM-5A5KF: num 0.0237 9.835 0 0.0664 0 ...
$ GTEX-11EI6-0226-SM-5EQ64: num 0.0802 13.1 0 0 0 ...
$ GTEX-11EM3-2326-SM-5H12B: num 0.0223 8.904 0.0496 0.0625 0.0402 ...
$ GTEX-11EMC-2826-SM-5PNY6: num 0.0189 16.59 0 0.0265 0.034 ...
$ GTEX-11EQ8-0226-SM-5EQ5G: num 0.0931 15.1 0.0689 0.0869 0 ...
$ GTEX-11EQ9-2526-SM-5HL66: num 0.0777 9.838 0 0 0 ...
r dataframe memory transpose
New contributor
I have a data frame of ca. 50,000 RNA transcripts in rows, with 10,000 different samples in columns. The size of the data frame is 4.9GB.
I then have to transpose the data in order to subset it properly later:
df <- data.frame(t(df))
After the transpose, the object size has ballooned to 70GB. Why is this happening? Should transposing data really change the file size that much?
str()
of the first 20 columns:
str(df[1:20])
Classes 'tbl_df', 'tbl' and 'data.frame': 56202 obs. of 20 variables:
$ X1 : int 1 2 3 4 5 6 7 8 9 10 ...
$ Name : chr "ENSG00000223972.4" "ENSG00000227232.4" "ENSG00000243485.2" "ENSG00000237613.2" ...
$ Description : chr "DDX11L1" "WASH7P" "MIR1302-11" "FAM138A" ...
$ GTEX-1117F-0226-SM-5GZZ7: num 0.1082 21.4 0.1602 0.0505 0 ...
$ GTEX-111CU-1826-SM-5GZYN: num 0.1158 11.03 0.0643 0 0 ...
$ GTEX-111FC-0226-SM-5N9B8: num 0.021 16.75 0.0467 0.0295 0 ...
$ GTEX-111VG-2326-SM-5N9BK: num 0.0233 8.172 0 0.0326 0 ...
$ GTEX-111YS-2426-SM-5GZZQ: num 0 7.658 0.0586 0 0 ...
$ GTEX-1122O-2026-SM-5NQ91: num 0.0464 9.372 0 0 0 ...
$ GTEX-1128S-2126-SM-5H12U: num 0.0308 10.08 0.1367 0.0861 0.1108 ...
$ GTEX-113IC-0226-SM-5HL5C: num 0.0936 13.56 0.2079 0.131 0.0562 ...
$ GTEX-117YX-2226-SM-5EGJJ: num 0.121 9.889 0.0537 0.0677 0 ...
$ GTEX-11DXW-0326-SM-5H11W: num 0.0286 9.121 0.0635 0 0 ...
$ GTEX-11DXX-2326-SM-5Q5A2: num 0 6.698 0.0508 0.032 0 ...
$ GTEX-11DZ1-0226-SM-5A5KF: num 0.0237 9.835 0 0.0664 0 ...
$ GTEX-11EI6-0226-SM-5EQ64: num 0.0802 13.1 0 0 0 ...
$ GTEX-11EM3-2326-SM-5H12B: num 0.0223 8.904 0.0496 0.0625 0.0402 ...
$ GTEX-11EMC-2826-SM-5PNY6: num 0.0189 16.59 0 0.0265 0.034 ...
$ GTEX-11EQ8-0226-SM-5EQ5G: num 0.0931 15.1 0.0689 0.0869 0 ...
$ GTEX-11EQ9-2526-SM-5HL66: num 0.0777 9.838 0 0 0 ...
r dataframe memory transpose
r dataframe memory transpose
New contributor
New contributor
edited yesterday
Henrik
40k990107
40k990107
New contributor
asked yesterday
Phil D
185
185
New contributor
New contributor
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
5
down vote
accepted
First, you write that:
I then have to transpose this dataset in order to subset it properly later,
To be honest, I doubt you have to. Thus, this may be an XY-problem. That said, I think could be of general interest to dissect the issue.
The increase in object size is most likely due to that the class
of the object before and after transposing has changed, together with the fact that objects of different class have different size.
I will try to illustrate this with some examples. We begin with the change of class.
Create a toy data frame with a structure resembling yours, a few character columns and several numeric columns:
# set number of rows and columns
nr <- 5
nc <- 5
set.seed(1)
d <- data.frame(x = sample(letters, nr, replace = TRUE),
y = sample(letters, nr, replace = TRUE),
matrix(runif(nr * nc), nrow = nr),
stringsAsFactors = FALSE)
Transpose it:
d_t <- t(d)
Check the str
ucture of the original data and its transposed sibling:
str(d)
# 'data.frame': 5 obs. of 7 variables:
# $ x : chr "g" "j" "o" "x" ...
# $ y : chr "x" "y" "r" "q" ...
# $ X1: num 0.206 0.177 0.687 0.384 0.77
# $ X2: num 0.498 0.718 0.992 0.38 0.777
# $ X3: num 0.935 0.212 0.652 0.126 0.267
# $ X4: num 0.3861 0.0134 0.3824 0.8697 0.3403
# $ X5: num 0.482 0.6 0.494 0.186 0.827
str(d_t)
# chr [1:7, 1:5] "g" "x" "0.2059746" "0.4976992" ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:7] "x" "y" "X1" "X2" ...
# ..$ : NULL
The data frame has became a character matrix. How did this happen? Well, check the help text for the transpose method for data frames: ?t.data.frame
:
A data frame is first coerced to a matrix: see
as.matrix
.
OK, see ?as.matrix
:
The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column [...]
Whereas a data frame is a list where each column can be of different class, a matrix is just a vector with dimensions, which can hold only one class. Thus, because you have at least one character column, i.e. a non-(numeric/logical/complex) column, your data frame is coerced to a character matrix as a result of t
ranspose. Then you coerce the matrix to data frame, where all columns are character (or factor
, depending on your stringsAsFactors
setting) - check str(data.frame(d_t))
.
In the second step, the size of different objects is compared. Start with the data frame and its transpose, as created above:
# original data frame
object.size(d)
# 2360 bytes
# transposed df - a character matrix
object.size(d_t)
# 3280 bytes
The transposed object is clearly larger. If we increase the number rows and the number of numeric columns to mimic your data better, the relative difference is even larger:
nr <- 56202
nc <- 20
object.size(d)
# 9897712 bytes
object.size(d_t)
# 78299656 bytes
Because the number of elements is the same in the original and transposed data, the (memory) size of each individual element must differ. Let's check the size of integer
, numeric
, and character
vectors of the same length. First, vectors with one-digit values and a corresponding vector of one-character elements:
onedigit_int <- sample(1:9, 1e4, replace = TRUE)
onedigit_num <- as.numeric(onedigit_int)
onedigit_char <- as.character(onedigit_int)
object.size(onedigit_int)
# 40048 bytes
object.size(onedigit_num)
# 80048 bytes
object.size(onedigit_char)
# 80552 bytes
For the single digits/characters, integer
vectors occupy 4 bytes per element, and numeric
and character
vectors 8 bytes per element. The single-character vector does not require more memory than the numeric vector. Does this mean that we can reject the idea that the increase in total size is explained by the coercion of a large number of numeric variables to character? Well, we need to check what happens with vectors with multi-digits (which you seem to have) and their corresponding vectors of multi-character strings:
multidigit_int <- sample(1:1e6, 1e4, replace = TRUE)
multidigit_num <- as.numeric(multidigit_int)
multidigit_char <- as.character(multidigit_int)
object.size(multidigit_int)
# 40048 bytes
object.size(multidigit_num)
# 80048 bytes
object.size(multidigit_char)
# 637360 bytes
The integer vector still occupies 4 bytes for each element, the numeric vector still occupies 8 bytes for each element. However, the size per element in the character vector is larger for larger strings.
Thus, the transpose coerced your data frame to a character matrix, and the size of each character element is larger than its corresponding numeric element.
Transposing a data frame with columns of different class is very rarely sensible. And if all columns are of same class, then we may just as well use a matrix from the start.
Read more about how much memory is used to store different objects in Advanced R by Hadley Wickham
Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.
– Phil D
yesterday
1
Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers
– Henrik
yesterday
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
5
down vote
accepted
First, you write that:
I then have to transpose this dataset in order to subset it properly later,
To be honest, I doubt you have to. Thus, this may be an XY-problem. That said, I think could be of general interest to dissect the issue.
The increase in object size is most likely due to that the class
of the object before and after transposing has changed, together with the fact that objects of different class have different size.
I will try to illustrate this with some examples. We begin with the change of class.
Create a toy data frame with a structure resembling yours, a few character columns and several numeric columns:
# set number of rows and columns
nr <- 5
nc <- 5
set.seed(1)
d <- data.frame(x = sample(letters, nr, replace = TRUE),
y = sample(letters, nr, replace = TRUE),
matrix(runif(nr * nc), nrow = nr),
stringsAsFactors = FALSE)
Transpose it:
d_t <- t(d)
Check the str
ucture of the original data and its transposed sibling:
str(d)
# 'data.frame': 5 obs. of 7 variables:
# $ x : chr "g" "j" "o" "x" ...
# $ y : chr "x" "y" "r" "q" ...
# $ X1: num 0.206 0.177 0.687 0.384 0.77
# $ X2: num 0.498 0.718 0.992 0.38 0.777
# $ X3: num 0.935 0.212 0.652 0.126 0.267
# $ X4: num 0.3861 0.0134 0.3824 0.8697 0.3403
# $ X5: num 0.482 0.6 0.494 0.186 0.827
str(d_t)
# chr [1:7, 1:5] "g" "x" "0.2059746" "0.4976992" ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:7] "x" "y" "X1" "X2" ...
# ..$ : NULL
The data frame has became a character matrix. How did this happen? Well, check the help text for the transpose method for data frames: ?t.data.frame
:
A data frame is first coerced to a matrix: see
as.matrix
.
OK, see ?as.matrix
:
The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column [...]
Whereas a data frame is a list where each column can be of different class, a matrix is just a vector with dimensions, which can hold only one class. Thus, because you have at least one character column, i.e. a non-(numeric/logical/complex) column, your data frame is coerced to a character matrix as a result of t
ranspose. Then you coerce the matrix to data frame, where all columns are character (or factor
, depending on your stringsAsFactors
setting) - check str(data.frame(d_t))
.
In the second step, the size of different objects is compared. Start with the data frame and its transpose, as created above:
# original data frame
object.size(d)
# 2360 bytes
# transposed df - a character matrix
object.size(d_t)
# 3280 bytes
The transposed object is clearly larger. If we increase the number rows and the number of numeric columns to mimic your data better, the relative difference is even larger:
nr <- 56202
nc <- 20
object.size(d)
# 9897712 bytes
object.size(d_t)
# 78299656 bytes
Because the number of elements is the same in the original and transposed data, the (memory) size of each individual element must differ. Let's check the size of integer
, numeric
, and character
vectors of the same length. First, vectors with one-digit values and a corresponding vector of one-character elements:
onedigit_int <- sample(1:9, 1e4, replace = TRUE)
onedigit_num <- as.numeric(onedigit_int)
onedigit_char <- as.character(onedigit_int)
object.size(onedigit_int)
# 40048 bytes
object.size(onedigit_num)
# 80048 bytes
object.size(onedigit_char)
# 80552 bytes
For the single digits/characters, integer
vectors occupy 4 bytes per element, and numeric
and character
vectors 8 bytes per element. The single-character vector does not require more memory than the numeric vector. Does this mean that we can reject the idea that the increase in total size is explained by the coercion of a large number of numeric variables to character? Well, we need to check what happens with vectors with multi-digits (which you seem to have) and their corresponding vectors of multi-character strings:
multidigit_int <- sample(1:1e6, 1e4, replace = TRUE)
multidigit_num <- as.numeric(multidigit_int)
multidigit_char <- as.character(multidigit_int)
object.size(multidigit_int)
# 40048 bytes
object.size(multidigit_num)
# 80048 bytes
object.size(multidigit_char)
# 637360 bytes
The integer vector still occupies 4 bytes for each element, the numeric vector still occupies 8 bytes for each element. However, the size per element in the character vector is larger for larger strings.
Thus, the transpose coerced your data frame to a character matrix, and the size of each character element is larger than its corresponding numeric element.
Transposing a data frame with columns of different class is very rarely sensible. And if all columns are of same class, then we may just as well use a matrix from the start.
Read more about how much memory is used to store different objects in Advanced R by Hadley Wickham
Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.
– Phil D
yesterday
1
Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers
– Henrik
yesterday
add a comment |
up vote
5
down vote
accepted
First, you write that:
I then have to transpose this dataset in order to subset it properly later,
To be honest, I doubt you have to. Thus, this may be an XY-problem. That said, I think could be of general interest to dissect the issue.
The increase in object size is most likely due to that the class
of the object before and after transposing has changed, together with the fact that objects of different class have different size.
I will try to illustrate this with some examples. We begin with the change of class.
Create a toy data frame with a structure resembling yours, a few character columns and several numeric columns:
# set number of rows and columns
nr <- 5
nc <- 5
set.seed(1)
d <- data.frame(x = sample(letters, nr, replace = TRUE),
y = sample(letters, nr, replace = TRUE),
matrix(runif(nr * nc), nrow = nr),
stringsAsFactors = FALSE)
Transpose it:
d_t <- t(d)
Check the str
ucture of the original data and its transposed sibling:
str(d)
# 'data.frame': 5 obs. of 7 variables:
# $ x : chr "g" "j" "o" "x" ...
# $ y : chr "x" "y" "r" "q" ...
# $ X1: num 0.206 0.177 0.687 0.384 0.77
# $ X2: num 0.498 0.718 0.992 0.38 0.777
# $ X3: num 0.935 0.212 0.652 0.126 0.267
# $ X4: num 0.3861 0.0134 0.3824 0.8697 0.3403
# $ X5: num 0.482 0.6 0.494 0.186 0.827
str(d_t)
# chr [1:7, 1:5] "g" "x" "0.2059746" "0.4976992" ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:7] "x" "y" "X1" "X2" ...
# ..$ : NULL
The data frame has became a character matrix. How did this happen? Well, check the help text for the transpose method for data frames: ?t.data.frame
:
A data frame is first coerced to a matrix: see
as.matrix
.
OK, see ?as.matrix
:
The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column [...]
Whereas a data frame is a list where each column can be of different class, a matrix is just a vector with dimensions, which can hold only one class. Thus, because you have at least one character column, i.e. a non-(numeric/logical/complex) column, your data frame is coerced to a character matrix as a result of t
ranspose. Then you coerce the matrix to data frame, where all columns are character (or factor
, depending on your stringsAsFactors
setting) - check str(data.frame(d_t))
.
In the second step, the size of different objects is compared. Start with the data frame and its transpose, as created above:
# original data frame
object.size(d)
# 2360 bytes
# transposed df - a character matrix
object.size(d_t)
# 3280 bytes
The transposed object is clearly larger. If we increase the number rows and the number of numeric columns to mimic your data better, the relative difference is even larger:
nr <- 56202
nc <- 20
object.size(d)
# 9897712 bytes
object.size(d_t)
# 78299656 bytes
Because the number of elements is the same in the original and transposed data, the (memory) size of each individual element must differ. Let's check the size of integer
, numeric
, and character
vectors of the same length. First, vectors with one-digit values and a corresponding vector of one-character elements:
onedigit_int <- sample(1:9, 1e4, replace = TRUE)
onedigit_num <- as.numeric(onedigit_int)
onedigit_char <- as.character(onedigit_int)
object.size(onedigit_int)
# 40048 bytes
object.size(onedigit_num)
# 80048 bytes
object.size(onedigit_char)
# 80552 bytes
For the single digits/characters, integer
vectors occupy 4 bytes per element, and numeric
and character
vectors 8 bytes per element. The single-character vector does not require more memory than the numeric vector. Does this mean that we can reject the idea that the increase in total size is explained by the coercion of a large number of numeric variables to character? Well, we need to check what happens with vectors with multi-digits (which you seem to have) and their corresponding vectors of multi-character strings:
multidigit_int <- sample(1:1e6, 1e4, replace = TRUE)
multidigit_num <- as.numeric(multidigit_int)
multidigit_char <- as.character(multidigit_int)
object.size(multidigit_int)
# 40048 bytes
object.size(multidigit_num)
# 80048 bytes
object.size(multidigit_char)
# 637360 bytes
The integer vector still occupies 4 bytes for each element, the numeric vector still occupies 8 bytes for each element. However, the size per element in the character vector is larger for larger strings.
Thus, the transpose coerced your data frame to a character matrix, and the size of each character element is larger than its corresponding numeric element.
Transposing a data frame with columns of different class is very rarely sensible. And if all columns are of same class, then we may just as well use a matrix from the start.
Read more about how much memory is used to store different objects in Advanced R by Hadley Wickham
Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.
– Phil D
yesterday
1
Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers
– Henrik
yesterday
add a comment |
up vote
5
down vote
accepted
up vote
5
down vote
accepted
First, you write that:
I then have to transpose this dataset in order to subset it properly later,
To be honest, I doubt you have to. Thus, this may be an XY-problem. That said, I think could be of general interest to dissect the issue.
The increase in object size is most likely due to that the class
of the object before and after transposing has changed, together with the fact that objects of different class have different size.
I will try to illustrate this with some examples. We begin with the change of class.
Create a toy data frame with a structure resembling yours, a few character columns and several numeric columns:
# set number of rows and columns
nr <- 5
nc <- 5
set.seed(1)
d <- data.frame(x = sample(letters, nr, replace = TRUE),
y = sample(letters, nr, replace = TRUE),
matrix(runif(nr * nc), nrow = nr),
stringsAsFactors = FALSE)
Transpose it:
d_t <- t(d)
Check the str
ucture of the original data and its transposed sibling:
str(d)
# 'data.frame': 5 obs. of 7 variables:
# $ x : chr "g" "j" "o" "x" ...
# $ y : chr "x" "y" "r" "q" ...
# $ X1: num 0.206 0.177 0.687 0.384 0.77
# $ X2: num 0.498 0.718 0.992 0.38 0.777
# $ X3: num 0.935 0.212 0.652 0.126 0.267
# $ X4: num 0.3861 0.0134 0.3824 0.8697 0.3403
# $ X5: num 0.482 0.6 0.494 0.186 0.827
str(d_t)
# chr [1:7, 1:5] "g" "x" "0.2059746" "0.4976992" ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:7] "x" "y" "X1" "X2" ...
# ..$ : NULL
The data frame has became a character matrix. How did this happen? Well, check the help text for the transpose method for data frames: ?t.data.frame
:
A data frame is first coerced to a matrix: see
as.matrix
.
OK, see ?as.matrix
:
The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column [...]
Whereas a data frame is a list where each column can be of different class, a matrix is just a vector with dimensions, which can hold only one class. Thus, because you have at least one character column, i.e. a non-(numeric/logical/complex) column, your data frame is coerced to a character matrix as a result of t
ranspose. Then you coerce the matrix to data frame, where all columns are character (or factor
, depending on your stringsAsFactors
setting) - check str(data.frame(d_t))
.
In the second step, the size of different objects is compared. Start with the data frame and its transpose, as created above:
# original data frame
object.size(d)
# 2360 bytes
# transposed df - a character matrix
object.size(d_t)
# 3280 bytes
The transposed object is clearly larger. If we increase the number rows and the number of numeric columns to mimic your data better, the relative difference is even larger:
nr <- 56202
nc <- 20
object.size(d)
# 9897712 bytes
object.size(d_t)
# 78299656 bytes
Because the number of elements is the same in the original and transposed data, the (memory) size of each individual element must differ. Let's check the size of integer
, numeric
, and character
vectors of the same length. First, vectors with one-digit values and a corresponding vector of one-character elements:
onedigit_int <- sample(1:9, 1e4, replace = TRUE)
onedigit_num <- as.numeric(onedigit_int)
onedigit_char <- as.character(onedigit_int)
object.size(onedigit_int)
# 40048 bytes
object.size(onedigit_num)
# 80048 bytes
object.size(onedigit_char)
# 80552 bytes
For the single digits/characters, integer
vectors occupy 4 bytes per element, and numeric
and character
vectors 8 bytes per element. The single-character vector does not require more memory than the numeric vector. Does this mean that we can reject the idea that the increase in total size is explained by the coercion of a large number of numeric variables to character? Well, we need to check what happens with vectors with multi-digits (which you seem to have) and their corresponding vectors of multi-character strings:
multidigit_int <- sample(1:1e6, 1e4, replace = TRUE)
multidigit_num <- as.numeric(multidigit_int)
multidigit_char <- as.character(multidigit_int)
object.size(multidigit_int)
# 40048 bytes
object.size(multidigit_num)
# 80048 bytes
object.size(multidigit_char)
# 637360 bytes
The integer vector still occupies 4 bytes for each element, the numeric vector still occupies 8 bytes for each element. However, the size per element in the character vector is larger for larger strings.
Thus, the transpose coerced your data frame to a character matrix, and the size of each character element is larger than its corresponding numeric element.
Transposing a data frame with columns of different class is very rarely sensible. And if all columns are of same class, then we may just as well use a matrix from the start.
Read more about how much memory is used to store different objects in Advanced R by Hadley Wickham
First, you write that:
I then have to transpose this dataset in order to subset it properly later,
To be honest, I doubt you have to. Thus, this may be an XY-problem. That said, I think could be of general interest to dissect the issue.
The increase in object size is most likely due to that the class
of the object before and after transposing has changed, together with the fact that objects of different class have different size.
I will try to illustrate this with some examples. We begin with the change of class.
Create a toy data frame with a structure resembling yours, a few character columns and several numeric columns:
# set number of rows and columns
nr <- 5
nc <- 5
set.seed(1)
d <- data.frame(x = sample(letters, nr, replace = TRUE),
y = sample(letters, nr, replace = TRUE),
matrix(runif(nr * nc), nrow = nr),
stringsAsFactors = FALSE)
Transpose it:
d_t <- t(d)
Check the str
ucture of the original data and its transposed sibling:
str(d)
# 'data.frame': 5 obs. of 7 variables:
# $ x : chr "g" "j" "o" "x" ...
# $ y : chr "x" "y" "r" "q" ...
# $ X1: num 0.206 0.177 0.687 0.384 0.77
# $ X2: num 0.498 0.718 0.992 0.38 0.777
# $ X3: num 0.935 0.212 0.652 0.126 0.267
# $ X4: num 0.3861 0.0134 0.3824 0.8697 0.3403
# $ X5: num 0.482 0.6 0.494 0.186 0.827
str(d_t)
# chr [1:7, 1:5] "g" "x" "0.2059746" "0.4976992" ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:7] "x" "y" "X1" "X2" ...
# ..$ : NULL
The data frame has became a character matrix. How did this happen? Well, check the help text for the transpose method for data frames: ?t.data.frame
:
A data frame is first coerced to a matrix: see
as.matrix
.
OK, see ?as.matrix
:
The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column [...]
Whereas a data frame is a list where each column can be of different class, a matrix is just a vector with dimensions, which can hold only one class. Thus, because you have at least one character column, i.e. a non-(numeric/logical/complex) column, your data frame is coerced to a character matrix as a result of t
ranspose. Then you coerce the matrix to data frame, where all columns are character (or factor
, depending on your stringsAsFactors
setting) - check str(data.frame(d_t))
.
In the second step, the size of different objects is compared. Start with the data frame and its transpose, as created above:
# original data frame
object.size(d)
# 2360 bytes
# transposed df - a character matrix
object.size(d_t)
# 3280 bytes
The transposed object is clearly larger. If we increase the number rows and the number of numeric columns to mimic your data better, the relative difference is even larger:
nr <- 56202
nc <- 20
object.size(d)
# 9897712 bytes
object.size(d_t)
# 78299656 bytes
Because the number of elements is the same in the original and transposed data, the (memory) size of each individual element must differ. Let's check the size of integer
, numeric
, and character
vectors of the same length. First, vectors with one-digit values and a corresponding vector of one-character elements:
onedigit_int <- sample(1:9, 1e4, replace = TRUE)
onedigit_num <- as.numeric(onedigit_int)
onedigit_char <- as.character(onedigit_int)
object.size(onedigit_int)
# 40048 bytes
object.size(onedigit_num)
# 80048 bytes
object.size(onedigit_char)
# 80552 bytes
For the single digits/characters, integer
vectors occupy 4 bytes per element, and numeric
and character
vectors 8 bytes per element. The single-character vector does not require more memory than the numeric vector. Does this mean that we can reject the idea that the increase in total size is explained by the coercion of a large number of numeric variables to character? Well, we need to check what happens with vectors with multi-digits (which you seem to have) and their corresponding vectors of multi-character strings:
multidigit_int <- sample(1:1e6, 1e4, replace = TRUE)
multidigit_num <- as.numeric(multidigit_int)
multidigit_char <- as.character(multidigit_int)
object.size(multidigit_int)
# 40048 bytes
object.size(multidigit_num)
# 80048 bytes
object.size(multidigit_char)
# 637360 bytes
The integer vector still occupies 4 bytes for each element, the numeric vector still occupies 8 bytes for each element. However, the size per element in the character vector is larger for larger strings.
Thus, the transpose coerced your data frame to a character matrix, and the size of each character element is larger than its corresponding numeric element.
Transposing a data frame with columns of different class is very rarely sensible. And if all columns are of same class, then we may just as well use a matrix from the start.
Read more about how much memory is used to store different objects in Advanced R by Hadley Wickham
edited 20 hours ago
answered yesterday
Henrik
40k990107
40k990107
Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.
– Phil D
yesterday
1
Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers
– Henrik
yesterday
add a comment |
Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.
– Phil D
yesterday
1
Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers
– Henrik
yesterday
Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.
– Phil D
yesterday
Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.
– Phil D
yesterday
1
1
Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers
– Henrik
yesterday
Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers
– Henrik
yesterday
add a comment |
Phil D is a new contributor. Be nice, and check out our Code of Conduct.
Phil D is a new contributor. Be nice, and check out our Code of Conduct.
Phil D is a new contributor. Be nice, and check out our Code of Conduct.
Phil D is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53409246%2fobject-size-increases-hugely-when-transposing-a-data-frame%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown