Object size increases hugely when transposing a data frame

up vote
3
down vote

favorite

I have a data frame of ca. 50,000 RNA transcripts in rows, with 10,000 different samples in columns. The size of the data frame is 4.9GB.

I then have to transpose the data in order to subset it properly later:

df <- data.frame(t(df))

After the transpose, the object size has ballooned to 70GB. Why is this happening? Should transposing data really change the file size that much?

str() of the first 20 columns:

str(df[1:20])

Classes 'tbl_df', 'tbl' and 'data.frame':   56202 obs. of  20 variables:

 $ X1                      : int  1 2 3 4 5 6 7 8 9 10 ...

 $ Name                    : chr  "ENSG00000223972.4" "ENSG00000227232.4" "ENSG00000243485.2" "ENSG00000237613.2" ...

 $ Description             : chr  "DDX11L1" "WASH7P" "MIR1302-11" "FAM138A" ...

 $ GTEX-1117F-0226-SM-5GZZ7: num  0.1082 21.4 0.1602 0.0505 0 ...

 $ GTEX-111CU-1826-SM-5GZYN: num  0.1158 11.03 0.0643 0 0 ...

 $ GTEX-111FC-0226-SM-5N9B8: num  0.021 16.75 0.0467 0.0295 0 ...

 $ GTEX-111VG-2326-SM-5N9BK: num  0.0233 8.172 0 0.0326 0 ...

 $ GTEX-111YS-2426-SM-5GZZQ: num  0 7.658 0.0586 0 0 ...

 $ GTEX-1122O-2026-SM-5NQ91: num  0.0464 9.372 0 0 0 ...

 $ GTEX-1128S-2126-SM-5H12U: num  0.0308 10.08 0.1367 0.0861 0.1108 ...

 $ GTEX-113IC-0226-SM-5HL5C: num  0.0936 13.56 0.2079 0.131 0.0562 ...

 $ GTEX-117YX-2226-SM-5EGJJ: num  0.121 9.889 0.0537 0.0677 0 ...

 $ GTEX-11DXW-0326-SM-5H11W: num  0.0286 9.121 0.0635 0 0 ...

 $ GTEX-11DXX-2326-SM-5Q5A2: num  0 6.698 0.0508 0.032 0 ...

 $ GTEX-11DZ1-0226-SM-5A5KF: num  0.0237 9.835 0 0.0664 0 ...

 $ GTEX-11EI6-0226-SM-5EQ64: num  0.0802 13.1 0 0 0 ...

 $ GTEX-11EM3-2326-SM-5H12B: num  0.0223 8.904 0.0496 0.0625 0.0402 ...

 $ GTEX-11EMC-2826-SM-5PNY6: num  0.0189 16.59 0 0.0265 0.034 ...

 $ GTEX-11EQ8-0226-SM-5EQ5G: num  0.0931 15.1 0.0689 0.0869 0 ...

 $ GTEX-11EQ9-2526-SM-5HL66: num  0.0777 9.838 0 0 0 ...

edited yesterday

Henrik

40k990107

asked yesterday

Phil D

185

New contributor

add a comment |

up vote
3
down vote

favorite

I have a data frame of ca. 50,000 RNA transcripts in rows, with 10,000 different samples in columns. The size of the data frame is 4.9GB.

I then have to transpose the data in order to subset it properly later:

df <- data.frame(t(df))

After the transpose, the object size has ballooned to 70GB. Why is this happening? Should transposing data really change the file size that much?

str() of the first 20 columns:

str(df[1:20])

Classes 'tbl_df', 'tbl' and 'data.frame':   56202 obs. of  20 variables:

 $ X1                      : int  1 2 3 4 5 6 7 8 9 10 ...

 $ Name                    : chr  "ENSG00000223972.4" "ENSG00000227232.4" "ENSG00000243485.2" "ENSG00000237613.2" ...

 $ Description             : chr  "DDX11L1" "WASH7P" "MIR1302-11" "FAM138A" ...

 $ GTEX-1117F-0226-SM-5GZZ7: num  0.1082 21.4 0.1602 0.0505 0 ...

 $ GTEX-111CU-1826-SM-5GZYN: num  0.1158 11.03 0.0643 0 0 ...

 $ GTEX-111FC-0226-SM-5N9B8: num  0.021 16.75 0.0467 0.0295 0 ...

 $ GTEX-111VG-2326-SM-5N9BK: num  0.0233 8.172 0 0.0326 0 ...

 $ GTEX-111YS-2426-SM-5GZZQ: num  0 7.658 0.0586 0 0 ...

 $ GTEX-1122O-2026-SM-5NQ91: num  0.0464 9.372 0 0 0 ...

 $ GTEX-1128S-2126-SM-5H12U: num  0.0308 10.08 0.1367 0.0861 0.1108 ...

 $ GTEX-113IC-0226-SM-5HL5C: num  0.0936 13.56 0.2079 0.131 0.0562 ...

 $ GTEX-117YX-2226-SM-5EGJJ: num  0.121 9.889 0.0537 0.0677 0 ...

 $ GTEX-11DXW-0326-SM-5H11W: num  0.0286 9.121 0.0635 0 0 ...

 $ GTEX-11DXX-2326-SM-5Q5A2: num  0 6.698 0.0508 0.032 0 ...

 $ GTEX-11DZ1-0226-SM-5A5KF: num  0.0237 9.835 0 0.0664 0 ...

 $ GTEX-11EI6-0226-SM-5EQ64: num  0.0802 13.1 0 0 0 ...

 $ GTEX-11EM3-2326-SM-5H12B: num  0.0223 8.904 0.0496 0.0625 0.0402 ...

 $ GTEX-11EMC-2826-SM-5PNY6: num  0.0189 16.59 0 0.0265 0.034 ...

 $ GTEX-11EQ8-0226-SM-5EQ5G: num  0.0931 15.1 0.0689 0.0869 0 ...

 $ GTEX-11EQ9-2526-SM-5HL66: num  0.0777 9.838 0 0 0 ...

edited yesterday

Henrik

40k990107

asked yesterday

Phil D

185

New contributor

add a comment |

up vote
3
down vote

favorite

I have a data frame of ca. 50,000 RNA transcripts in rows, with 10,000 different samples in columns. The size of the data frame is 4.9GB.

I then have to transpose the data in order to subset it properly later:

df <- data.frame(t(df))

After the transpose, the object size has ballooned to 70GB. Why is this happening? Should transposing data really change the file size that much?

str() of the first 20 columns:

str(df[1:20])

Classes 'tbl_df', 'tbl' and 'data.frame':   56202 obs. of  20 variables:

 $ X1                      : int  1 2 3 4 5 6 7 8 9 10 ...

 $ Name                    : chr  "ENSG00000223972.4" "ENSG00000227232.4" "ENSG00000243485.2" "ENSG00000237613.2" ...

 $ Description             : chr  "DDX11L1" "WASH7P" "MIR1302-11" "FAM138A" ...

 $ GTEX-1117F-0226-SM-5GZZ7: num  0.1082 21.4 0.1602 0.0505 0 ...

 $ GTEX-111CU-1826-SM-5GZYN: num  0.1158 11.03 0.0643 0 0 ...

 $ GTEX-111FC-0226-SM-5N9B8: num  0.021 16.75 0.0467 0.0295 0 ...

 $ GTEX-111VG-2326-SM-5N9BK: num  0.0233 8.172 0 0.0326 0 ...

 $ GTEX-111YS-2426-SM-5GZZQ: num  0 7.658 0.0586 0 0 ...

 $ GTEX-1122O-2026-SM-5NQ91: num  0.0464 9.372 0 0 0 ...

 $ GTEX-1128S-2126-SM-5H12U: num  0.0308 10.08 0.1367 0.0861 0.1108 ...

 $ GTEX-113IC-0226-SM-5HL5C: num  0.0936 13.56 0.2079 0.131 0.0562 ...

 $ GTEX-117YX-2226-SM-5EGJJ: num  0.121 9.889 0.0537 0.0677 0 ...

 $ GTEX-11DXW-0326-SM-5H11W: num  0.0286 9.121 0.0635 0 0 ...

 $ GTEX-11DXX-2326-SM-5Q5A2: num  0 6.698 0.0508 0.032 0 ...

 $ GTEX-11DZ1-0226-SM-5A5KF: num  0.0237 9.835 0 0.0664 0 ...

 $ GTEX-11EI6-0226-SM-5EQ64: num  0.0802 13.1 0 0 0 ...

 $ GTEX-11EM3-2326-SM-5H12B: num  0.0223 8.904 0.0496 0.0625 0.0402 ...

 $ GTEX-11EMC-2826-SM-5PNY6: num  0.0189 16.59 0 0.0265 0.034 ...

 $ GTEX-11EQ8-0226-SM-5EQ5G: num  0.0931 15.1 0.0689 0.0869 0 ...

 $ GTEX-11EQ9-2526-SM-5HL66: num  0.0777 9.838 0 0 0 ...

edited yesterday

Henrik

40k990107

asked yesterday

Phil D

185

New contributor

I have a data frame of ca. 50,000 RNA transcripts in rows, with 10,000 different samples in columns. The size of the data frame is 4.9GB.

I then have to transpose the data in order to subset it properly later:

df <- data.frame(t(df))

After the transpose, the object size has ballooned to 70GB. Why is this happening? Should transposing data really change the file size that much?

str() of the first 20 columns:

str(df[1:20])

Classes 'tbl_df', 'tbl' and 'data.frame':   56202 obs. of  20 variables:

 $ X1                      : int  1 2 3 4 5 6 7 8 9 10 ...

 $ Name                    : chr  "ENSG00000223972.4" "ENSG00000227232.4" "ENSG00000243485.2" "ENSG00000237613.2" ...

 $ Description             : chr  "DDX11L1" "WASH7P" "MIR1302-11" "FAM138A" ...

 $ GTEX-1117F-0226-SM-5GZZ7: num  0.1082 21.4 0.1602 0.0505 0 ...

 $ GTEX-111CU-1826-SM-5GZYN: num  0.1158 11.03 0.0643 0 0 ...

 $ GTEX-111FC-0226-SM-5N9B8: num  0.021 16.75 0.0467 0.0295 0 ...

 $ GTEX-111VG-2326-SM-5N9BK: num  0.0233 8.172 0 0.0326 0 ...

 $ GTEX-111YS-2426-SM-5GZZQ: num  0 7.658 0.0586 0 0 ...

 $ GTEX-1122O-2026-SM-5NQ91: num  0.0464 9.372 0 0 0 ...

 $ GTEX-1128S-2126-SM-5H12U: num  0.0308 10.08 0.1367 0.0861 0.1108 ...

 $ GTEX-113IC-0226-SM-5HL5C: num  0.0936 13.56 0.2079 0.131 0.0562 ...

 $ GTEX-117YX-2226-SM-5EGJJ: num  0.121 9.889 0.0537 0.0677 0 ...

 $ GTEX-11DXW-0326-SM-5H11W: num  0.0286 9.121 0.0635 0 0 ...

 $ GTEX-11DXX-2326-SM-5Q5A2: num  0 6.698 0.0508 0.032 0 ...

 $ GTEX-11DZ1-0226-SM-5A5KF: num  0.0237 9.835 0 0.0664 0 ...

 $ GTEX-11EI6-0226-SM-5EQ64: num  0.0802 13.1 0 0 0 ...

 $ GTEX-11EM3-2326-SM-5H12B: num  0.0223 8.904 0.0496 0.0625 0.0402 ...

 $ GTEX-11EMC-2826-SM-5PNY6: num  0.0189 16.59 0 0.0265 0.034 ...

 $ GTEX-11EQ8-0226-SM-5EQ5G: num  0.0931 15.1 0.0689 0.0869 0 ...

 $ GTEX-11EQ9-2526-SM-5HL66: num  0.0777 9.838 0 0 0 ...

r dataframe memory transpose

edited yesterday

Henrik

40k990107

asked yesterday

Phil D

185

New contributor

edited yesterday

Henrik

40k990107

asked yesterday

Phil D

185

New contributor

edited yesterday

Henrik

40k990107

edited yesterday

Henrik

40k990107

edited yesterday

Henrik

40k990107

asked yesterday

Phil D

185

New contributor

asked yesterday

Phil D

185

asked yesterday

Phil D

185

New contributor

Phil D is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

1 Answer
1

active

oldest

votes

up vote
5
down vote

accepted

First, you write that:

I then have to transpose this dataset in order to subset it properly later,

To be honest, I doubt you have to. Thus, this may be an XY-problem. That said, I think could be of general interest to dissect the issue.

The increase in object size is most likely due to that the class of the object before and after transposing has changed, together with the fact that objects of different class have different size.

I will try to illustrate this with some examples. We begin with the change of class.

Create a toy data frame with a structure resembling yours, a few character columns and several numeric columns:

# set number of rows and columns

nr <- 5

nc <- 5



set.seed(1)

d <- data.frame(x = sample(letters, nr, replace = TRUE),

                y = sample(letters, nr, replace = TRUE),

                matrix(runif(nr * nc), nrow = nr),

                stringsAsFactors = FALSE)

Transpose it:

d_t <- t(d)

Check the structure of the original data and its transposed sibling:

str(d)

# 'data.frame': 5 obs. of  7 variables:

# $ x : chr  "g" "j" "o" "x" ...

# $ y : chr  "x" "y" "r" "q" ...

# $ X1: num  0.206 0.177 0.687 0.384 0.77

# $ X2: num  0.498 0.718 0.992 0.38 0.777

# $ X3: num  0.935 0.212 0.652 0.126 0.267

# $ X4: num  0.3861 0.0134 0.3824 0.8697 0.3403

# $ X5: num  0.482 0.6 0.494 0.186 0.827



str(d_t)

# chr [1:7, 1:5] "g" "x" "0.2059746" "0.4976992" ...

# - attr(*, "dimnames")=List of 2

#  ..$ : chr [1:7] "x" "y" "X1" "X2" ...

#  ..$ : NULL

The data frame has became a character matrix. How did this happen? Well, check the help text for the transpose method for data frames: ?t.data.frame:

A data frame is first coerced to a matrix: see as.matrix.

OK, see ?as.matrix:

The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column [...]

Whereas a data frame is a list where each column can be of different class, a matrix is just a vector with dimensions, which can hold only one class. Thus, because you have at least one character column, i.e. a non-(numeric/logical/complex) column, your data frame is coerced to a character matrix as a result of transpose. Then you coerce the matrix to data frame, where all columns are character (or factor, depending on your stringsAsFactors setting) - check str(data.frame(d_t)).

In the second step, the size of different objects is compared. Start with the data frame and its transpose, as created above:

# original data frame

object.size(d)

# 2360 bytes



# transposed df - a character matrix

object.size(d_t)

# 3280 bytes

The transposed object is clearly larger. If we increase the number rows and the number of numeric columns to mimic your data better, the relative difference is even larger:

nr <- 56202

nc <- 20 



object.size(d)

# 9897712 bytes

object.size(d_t)

# 78299656 bytes

Because the number of elements is the same in the original and transposed data, the (memory) size of each individual element must differ. Let's check the size of integer, numeric, and character vectors of the same length. First, vectors with one-digit values and a corresponding vector of one-character elements:

onedigit_int <- sample(1:9, 1e4, replace = TRUE)

onedigit_num <- as.numeric(onedigit_int)

onedigit_char <- as.character(onedigit_int)    



object.size(onedigit_int)

# 40048 bytes



object.size(onedigit_num)

# 80048 bytes



object.size(onedigit_char)

# 80552 bytes

For the single digits/characters, integer vectors occupy 4 bytes per element, and numeric and character vectors 8 bytes per element. The single-character vector does not require more memory than the numeric vector. Does this mean that we can reject the idea that the increase in total size is explained by the coercion of a large number of numeric variables to character? Well, we need to check what happens with vectors with multi-digits (which you seem to have) and their corresponding vectors of multi-character strings:

multidigit_int <- sample(1:1e6, 1e4, replace = TRUE)

multidigit_num <- as.numeric(multidigit_int)

multidigit_char <- as.character(multidigit_int)



object.size(multidigit_int)

# 40048 bytes



object.size(multidigit_num)

# 80048 bytes



object.size(multidigit_char)

# 637360 bytes

The integer vector still occupies 4 bytes for each element, the numeric vector still occupies 8 bytes for each element. However, the size per element in the character vector is larger for larger strings.

Thus, the transpose coerced your data frame to a character matrix, and the size of each character element is larger than its corresponding numeric element.

Transposing a data frame with columns of different class is very rarely sensible. And if all columns are of same class, then we may just as well use a matrix from the start.

Read more about how much memory is used to store different objects in Advanced R by Hadley Wickham

edited 20 hours ago

answered yesterday

Henrik

40k990107

Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.
– Phil D
yesterday

1

Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers
– Henrik
yesterday

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

Phil D is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53409246%2fobject-size-increases-hugely-when-transposing-a-data-frame%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
5
down vote

accepted

First, you write that:

I then have to transpose this dataset in order to subset it properly later,

To be honest, I doubt you have to. Thus, this may be an XY-problem. That said, I think could be of general interest to dissect the issue.

The increase in object size is most likely due to that the class of the object before and after transposing has changed, together with the fact that objects of different class have different size.

I will try to illustrate this with some examples. We begin with the change of class.

Create a toy data frame with a structure resembling yours, a few character columns and several numeric columns:

# set number of rows and columns

nr <- 5

nc <- 5



set.seed(1)

d <- data.frame(x = sample(letters, nr, replace = TRUE),

                y = sample(letters, nr, replace = TRUE),

                matrix(runif(nr * nc), nrow = nr),

                stringsAsFactors = FALSE)

Transpose it:

d_t <- t(d)

Check the structure of the original data and its transposed sibling:

str(d)

# 'data.frame': 5 obs. of  7 variables:

# $ x : chr  "g" "j" "o" "x" ...

# $ y : chr  "x" "y" "r" "q" ...

# $ X1: num  0.206 0.177 0.687 0.384 0.77

# $ X2: num  0.498 0.718 0.992 0.38 0.777

# $ X3: num  0.935 0.212 0.652 0.126 0.267

# $ X4: num  0.3861 0.0134 0.3824 0.8697 0.3403

# $ X5: num  0.482 0.6 0.494 0.186 0.827



str(d_t)

# chr [1:7, 1:5] "g" "x" "0.2059746" "0.4976992" ...

# - attr(*, "dimnames")=List of 2

#  ..$ : chr [1:7] "x" "y" "X1" "X2" ...

#  ..$ : NULL

The data frame has became a character matrix. How did this happen? Well, check the help text for the transpose method for data frames: ?t.data.frame:

A data frame is first coerced to a matrix: see as.matrix.

OK, see ?as.matrix:

The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column [...]

In the second step, the size of different objects is compared. Start with the data frame and its transpose, as created above:

# original data frame

object.size(d)

# 2360 bytes



# transposed df - a character matrix

object.size(d_t)

# 3280 bytes

The transposed object is clearly larger. If we increase the number rows and the number of numeric columns to mimic your data better, the relative difference is even larger:

nr <- 56202

nc <- 20 



object.size(d)

# 9897712 bytes

object.size(d_t)

# 78299656 bytes

onedigit_int <- sample(1:9, 1e4, replace = TRUE)

onedigit_num <- as.numeric(onedigit_int)

onedigit_char <- as.character(onedigit_int)    



object.size(onedigit_int)

# 40048 bytes



object.size(onedigit_num)

# 80048 bytes



object.size(onedigit_char)

# 80552 bytes

multidigit_int <- sample(1:1e6, 1e4, replace = TRUE)

multidigit_num <- as.numeric(multidigit_int)

multidigit_char <- as.character(multidigit_int)



object.size(multidigit_int)

# 40048 bytes



object.size(multidigit_num)

# 80048 bytes



object.size(multidigit_char)

# 637360 bytes

Thus, the transpose coerced your data frame to a character matrix, and the size of each character element is larger than its corresponding numeric element.

Transposing a data frame with columns of different class is very rarely sensible. And if all columns are of same class, then we may just as well use a matrix from the start.

Read more about how much memory is used to store different objects in Advanced R by Hadley Wickham

edited 20 hours ago

answered yesterday

Henrik

40k990107

Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.
– Phil D
yesterday

1

Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers
– Henrik
yesterday

add a comment |

up vote
5
down vote

accepted

First, you write that:

I then have to transpose this dataset in order to subset it properly later,

To be honest, I doubt you have to. Thus, this may be an XY-problem. That said, I think could be of general interest to dissect the issue.

The increase in object size is most likely due to that the class of the object before and after transposing has changed, together with the fact that objects of different class have different size.

I will try to illustrate this with some examples. We begin with the change of class.

Create a toy data frame with a structure resembling yours, a few character columns and several numeric columns:

# set number of rows and columns

nr <- 5

nc <- 5



set.seed(1)

d <- data.frame(x = sample(letters, nr, replace = TRUE),

                y = sample(letters, nr, replace = TRUE),

                matrix(runif(nr * nc), nrow = nr),

                stringsAsFactors = FALSE)

Transpose it:

d_t <- t(d)

Check the structure of the original data and its transposed sibling:

str(d)

# 'data.frame': 5 obs. of  7 variables:

# $ x : chr  "g" "j" "o" "x" ...

# $ y : chr  "x" "y" "r" "q" ...

# $ X1: num  0.206 0.177 0.687 0.384 0.77

# $ X2: num  0.498 0.718 0.992 0.38 0.777

# $ X3: num  0.935 0.212 0.652 0.126 0.267

# $ X4: num  0.3861 0.0134 0.3824 0.8697 0.3403

# $ X5: num  0.482 0.6 0.494 0.186 0.827



str(d_t)

# chr [1:7, 1:5] "g" "x" "0.2059746" "0.4976992" ...

# - attr(*, "dimnames")=List of 2

#  ..$ : chr [1:7] "x" "y" "X1" "X2" ...

#  ..$ : NULL

The data frame has became a character matrix. How did this happen? Well, check the help text for the transpose method for data frames: ?t.data.frame:

A data frame is first coerced to a matrix: see as.matrix.

OK, see ?as.matrix:

The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column [...]

In the second step, the size of different objects is compared. Start with the data frame and its transpose, as created above:

# original data frame

object.size(d)

# 2360 bytes



# transposed df - a character matrix

object.size(d_t)

# 3280 bytes

The transposed object is clearly larger. If we increase the number rows and the number of numeric columns to mimic your data better, the relative difference is even larger:

nr <- 56202

nc <- 20 



object.size(d)

# 9897712 bytes

object.size(d_t)

# 78299656 bytes

onedigit_int <- sample(1:9, 1e4, replace = TRUE)

onedigit_num <- as.numeric(onedigit_int)

onedigit_char <- as.character(onedigit_int)    



object.size(onedigit_int)

# 40048 bytes



object.size(onedigit_num)

# 80048 bytes



object.size(onedigit_char)

# 80552 bytes

multidigit_int <- sample(1:1e6, 1e4, replace = TRUE)

multidigit_num <- as.numeric(multidigit_int)

multidigit_char <- as.character(multidigit_int)



object.size(multidigit_int)

# 40048 bytes



object.size(multidigit_num)

# 80048 bytes



object.size(multidigit_char)

# 637360 bytes

Thus, the transpose coerced your data frame to a character matrix, and the size of each character element is larger than its corresponding numeric element.

Transposing a data frame with columns of different class is very rarely sensible. And if all columns are of same class, then we may just as well use a matrix from the start.

Read more about how much memory is used to store different objects in Advanced R by Hadley Wickham

edited 20 hours ago

answered yesterday

Henrik

40k990107

Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.
– Phil D
yesterday

1

Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers
– Henrik
yesterday

add a comment |

up vote
5
down vote

accepted

First, you write that:

I then have to transpose this dataset in order to subset it properly later,

To be honest, I doubt you have to. Thus, this may be an XY-problem. That said, I think could be of general interest to dissect the issue.

The increase in object size is most likely due to that the class of the object before and after transposing has changed, together with the fact that objects of different class have different size.

I will try to illustrate this with some examples. We begin with the change of class.

Create a toy data frame with a structure resembling yours, a few character columns and several numeric columns:

# set number of rows and columns

nr <- 5

nc <- 5



set.seed(1)

d <- data.frame(x = sample(letters, nr, replace = TRUE),

                y = sample(letters, nr, replace = TRUE),

                matrix(runif(nr * nc), nrow = nr),

                stringsAsFactors = FALSE)

Transpose it:

d_t <- t(d)

Check the structure of the original data and its transposed sibling:

str(d)

# 'data.frame': 5 obs. of  7 variables:

# $ x : chr  "g" "j" "o" "x" ...

# $ y : chr  "x" "y" "r" "q" ...

# $ X1: num  0.206 0.177 0.687 0.384 0.77

# $ X2: num  0.498 0.718 0.992 0.38 0.777

# $ X3: num  0.935 0.212 0.652 0.126 0.267

# $ X4: num  0.3861 0.0134 0.3824 0.8697 0.3403

# $ X5: num  0.482 0.6 0.494 0.186 0.827



str(d_t)

# chr [1:7, 1:5] "g" "x" "0.2059746" "0.4976992" ...

# - attr(*, "dimnames")=List of 2

#  ..$ : chr [1:7] "x" "y" "X1" "X2" ...

#  ..$ : NULL

The data frame has became a character matrix. How did this happen? Well, check the help text for the transpose method for data frames: ?t.data.frame:

A data frame is first coerced to a matrix: see as.matrix.

OK, see ?as.matrix:

The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column [...]

In the second step, the size of different objects is compared. Start with the data frame and its transpose, as created above:

# original data frame

object.size(d)

# 2360 bytes



# transposed df - a character matrix

object.size(d_t)

# 3280 bytes

The transposed object is clearly larger. If we increase the number rows and the number of numeric columns to mimic your data better, the relative difference is even larger:

nr <- 56202

nc <- 20 



object.size(d)

# 9897712 bytes

object.size(d_t)

# 78299656 bytes

onedigit_int <- sample(1:9, 1e4, replace = TRUE)

onedigit_num <- as.numeric(onedigit_int)

onedigit_char <- as.character(onedigit_int)    



object.size(onedigit_int)

# 40048 bytes



object.size(onedigit_num)

# 80048 bytes



object.size(onedigit_char)

# 80552 bytes

multidigit_int <- sample(1:1e6, 1e4, replace = TRUE)

multidigit_num <- as.numeric(multidigit_int)

multidigit_char <- as.character(multidigit_int)



object.size(multidigit_int)

# 40048 bytes



object.size(multidigit_num)

# 80048 bytes



object.size(multidigit_char)

# 637360 bytes

Thus, the transpose coerced your data frame to a character matrix, and the size of each character element is larger than its corresponding numeric element.

Transposing a data frame with columns of different class is very rarely sensible. And if all columns are of same class, then we may just as well use a matrix from the start.

Read more about how much memory is used to store different objects in Advanced R by Hadley Wickham

edited 20 hours ago

answered yesterday

Henrik

40k990107

First, you write that:

I then have to transpose this dataset in order to subset it properly later,

To be honest, I doubt you have to. Thus, this may be an XY-problem. That said, I think could be of general interest to dissect the issue.

The increase in object size is most likely due to that the class of the object before and after transposing has changed, together with the fact that objects of different class have different size.

I will try to illustrate this with some examples. We begin with the change of class.

Create a toy data frame with a structure resembling yours, a few character columns and several numeric columns:

# set number of rows and columns

nr <- 5

nc <- 5



set.seed(1)

d <- data.frame(x = sample(letters, nr, replace = TRUE),

                y = sample(letters, nr, replace = TRUE),

                matrix(runif(nr * nc), nrow = nr),

                stringsAsFactors = FALSE)

Transpose it:

d_t <- t(d)

Check the structure of the original data and its transposed sibling:

str(d)

# 'data.frame': 5 obs. of  7 variables:

# $ x : chr  "g" "j" "o" "x" ...

# $ y : chr  "x" "y" "r" "q" ...

# $ X1: num  0.206 0.177 0.687 0.384 0.77

# $ X2: num  0.498 0.718 0.992 0.38 0.777

# $ X3: num  0.935 0.212 0.652 0.126 0.267

# $ X4: num  0.3861 0.0134 0.3824 0.8697 0.3403

# $ X5: num  0.482 0.6 0.494 0.186 0.827



str(d_t)

# chr [1:7, 1:5] "g" "x" "0.2059746" "0.4976992" ...

# - attr(*, "dimnames")=List of 2

#  ..$ : chr [1:7] "x" "y" "X1" "X2" ...

#  ..$ : NULL

The data frame has became a character matrix. How did this happen? Well, check the help text for the transpose method for data frames: ?t.data.frame:

A data frame is first coerced to a matrix: see as.matrix.

OK, see ?as.matrix:

The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column [...]

In the second step, the size of different objects is compared. Start with the data frame and its transpose, as created above:

# original data frame

object.size(d)

# 2360 bytes



# transposed df - a character matrix

object.size(d_t)

# 3280 bytes

The transposed object is clearly larger. If we increase the number rows and the number of numeric columns to mimic your data better, the relative difference is even larger:

nr <- 56202

nc <- 20 



object.size(d)

# 9897712 bytes

object.size(d_t)

# 78299656 bytes

onedigit_int <- sample(1:9, 1e4, replace = TRUE)

onedigit_num <- as.numeric(onedigit_int)

onedigit_char <- as.character(onedigit_int)    



object.size(onedigit_int)

# 40048 bytes



object.size(onedigit_num)

# 80048 bytes



object.size(onedigit_char)

# 80552 bytes

multidigit_int <- sample(1:1e6, 1e4, replace = TRUE)

multidigit_num <- as.numeric(multidigit_int)

multidigit_char <- as.character(multidigit_int)



object.size(multidigit_int)

# 40048 bytes



object.size(multidigit_num)

# 80048 bytes



object.size(multidigit_char)

# 637360 bytes

Thus, the transpose coerced your data frame to a character matrix, and the size of each character element is larger than its corresponding numeric element.

Transposing a data frame with columns of different class is very rarely sensible. And if all columns are of same class, then we may just as well use a matrix from the start.

Read more about how much memory is used to store different objects in Advanced R by Hadley Wickham

edited 20 hours ago

answered yesterday

Henrik

40k990107

edited 20 hours ago

answered yesterday

Henrik

40k990107

answered yesterday

Henrik

40k990107

answered yesterday

Henrik

40k990107

Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.
– Phil D
yesterday

1

Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers
– Henrik
yesterday

add a comment |

Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.
– Phil D
yesterday

1

Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers
– Henrik
yesterday

Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.
– Phil D
yesterday

Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers
– Henrik
yesterday

add a comment |

Phil D is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Phil D is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Btukfyl