Object size increases hugely when transposing a data frame











up vote
3
down vote

favorite












I have a data frame of ca. 50,000 RNA transcripts in rows, with 10,000 different samples in columns. The size of the data frame is 4.9GB.



I then have to transpose the data in order to subset it properly later:



df <- data.frame(t(df))



After the transpose, the object size has ballooned to 70GB. Why is this happening? Should transposing data really change the file size that much?



str() of the first 20 columns:



str(df[1:20])
Classes 'tbl_df', 'tbl' and 'data.frame': 56202 obs. of 20 variables:
$ X1 : int 1 2 3 4 5 6 7 8 9 10 ...
$ Name : chr "ENSG00000223972.4" "ENSG00000227232.4" "ENSG00000243485.2" "ENSG00000237613.2" ...
$ Description : chr "DDX11L1" "WASH7P" "MIR1302-11" "FAM138A" ...
$ GTEX-1117F-0226-SM-5GZZ7: num 0.1082 21.4 0.1602 0.0505 0 ...
$ GTEX-111CU-1826-SM-5GZYN: num 0.1158 11.03 0.0643 0 0 ...
$ GTEX-111FC-0226-SM-5N9B8: num 0.021 16.75 0.0467 0.0295 0 ...
$ GTEX-111VG-2326-SM-5N9BK: num 0.0233 8.172 0 0.0326 0 ...
$ GTEX-111YS-2426-SM-5GZZQ: num 0 7.658 0.0586 0 0 ...
$ GTEX-1122O-2026-SM-5NQ91: num 0.0464 9.372 0 0 0 ...
$ GTEX-1128S-2126-SM-5H12U: num 0.0308 10.08 0.1367 0.0861 0.1108 ...
$ GTEX-113IC-0226-SM-5HL5C: num 0.0936 13.56 0.2079 0.131 0.0562 ...
$ GTEX-117YX-2226-SM-5EGJJ: num 0.121 9.889 0.0537 0.0677 0 ...
$ GTEX-11DXW-0326-SM-5H11W: num 0.0286 9.121 0.0635 0 0 ...
$ GTEX-11DXX-2326-SM-5Q5A2: num 0 6.698 0.0508 0.032 0 ...
$ GTEX-11DZ1-0226-SM-5A5KF: num 0.0237 9.835 0 0.0664 0 ...
$ GTEX-11EI6-0226-SM-5EQ64: num 0.0802 13.1 0 0 0 ...
$ GTEX-11EM3-2326-SM-5H12B: num 0.0223 8.904 0.0496 0.0625 0.0402 ...
$ GTEX-11EMC-2826-SM-5PNY6: num 0.0189 16.59 0 0.0265 0.034 ...
$ GTEX-11EQ8-0226-SM-5EQ5G: num 0.0931 15.1 0.0689 0.0869 0 ...
$ GTEX-11EQ9-2526-SM-5HL66: num 0.0777 9.838 0 0 0 ...









share|improve this question









New contributor




Phil D is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
























    up vote
    3
    down vote

    favorite












    I have a data frame of ca. 50,000 RNA transcripts in rows, with 10,000 different samples in columns. The size of the data frame is 4.9GB.



    I then have to transpose the data in order to subset it properly later:



    df <- data.frame(t(df))



    After the transpose, the object size has ballooned to 70GB. Why is this happening? Should transposing data really change the file size that much?



    str() of the first 20 columns:



    str(df[1:20])
    Classes 'tbl_df', 'tbl' and 'data.frame': 56202 obs. of 20 variables:
    $ X1 : int 1 2 3 4 5 6 7 8 9 10 ...
    $ Name : chr "ENSG00000223972.4" "ENSG00000227232.4" "ENSG00000243485.2" "ENSG00000237613.2" ...
    $ Description : chr "DDX11L1" "WASH7P" "MIR1302-11" "FAM138A" ...
    $ GTEX-1117F-0226-SM-5GZZ7: num 0.1082 21.4 0.1602 0.0505 0 ...
    $ GTEX-111CU-1826-SM-5GZYN: num 0.1158 11.03 0.0643 0 0 ...
    $ GTEX-111FC-0226-SM-5N9B8: num 0.021 16.75 0.0467 0.0295 0 ...
    $ GTEX-111VG-2326-SM-5N9BK: num 0.0233 8.172 0 0.0326 0 ...
    $ GTEX-111YS-2426-SM-5GZZQ: num 0 7.658 0.0586 0 0 ...
    $ GTEX-1122O-2026-SM-5NQ91: num 0.0464 9.372 0 0 0 ...
    $ GTEX-1128S-2126-SM-5H12U: num 0.0308 10.08 0.1367 0.0861 0.1108 ...
    $ GTEX-113IC-0226-SM-5HL5C: num 0.0936 13.56 0.2079 0.131 0.0562 ...
    $ GTEX-117YX-2226-SM-5EGJJ: num 0.121 9.889 0.0537 0.0677 0 ...
    $ GTEX-11DXW-0326-SM-5H11W: num 0.0286 9.121 0.0635 0 0 ...
    $ GTEX-11DXX-2326-SM-5Q5A2: num 0 6.698 0.0508 0.032 0 ...
    $ GTEX-11DZ1-0226-SM-5A5KF: num 0.0237 9.835 0 0.0664 0 ...
    $ GTEX-11EI6-0226-SM-5EQ64: num 0.0802 13.1 0 0 0 ...
    $ GTEX-11EM3-2326-SM-5H12B: num 0.0223 8.904 0.0496 0.0625 0.0402 ...
    $ GTEX-11EMC-2826-SM-5PNY6: num 0.0189 16.59 0 0.0265 0.034 ...
    $ GTEX-11EQ8-0226-SM-5EQ5G: num 0.0931 15.1 0.0689 0.0869 0 ...
    $ GTEX-11EQ9-2526-SM-5HL66: num 0.0777 9.838 0 0 0 ...









    share|improve this question









    New contributor




    Phil D is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






















      up vote
      3
      down vote

      favorite









      up vote
      3
      down vote

      favorite











      I have a data frame of ca. 50,000 RNA transcripts in rows, with 10,000 different samples in columns. The size of the data frame is 4.9GB.



      I then have to transpose the data in order to subset it properly later:



      df <- data.frame(t(df))



      After the transpose, the object size has ballooned to 70GB. Why is this happening? Should transposing data really change the file size that much?



      str() of the first 20 columns:



      str(df[1:20])
      Classes 'tbl_df', 'tbl' and 'data.frame': 56202 obs. of 20 variables:
      $ X1 : int 1 2 3 4 5 6 7 8 9 10 ...
      $ Name : chr "ENSG00000223972.4" "ENSG00000227232.4" "ENSG00000243485.2" "ENSG00000237613.2" ...
      $ Description : chr "DDX11L1" "WASH7P" "MIR1302-11" "FAM138A" ...
      $ GTEX-1117F-0226-SM-5GZZ7: num 0.1082 21.4 0.1602 0.0505 0 ...
      $ GTEX-111CU-1826-SM-5GZYN: num 0.1158 11.03 0.0643 0 0 ...
      $ GTEX-111FC-0226-SM-5N9B8: num 0.021 16.75 0.0467 0.0295 0 ...
      $ GTEX-111VG-2326-SM-5N9BK: num 0.0233 8.172 0 0.0326 0 ...
      $ GTEX-111YS-2426-SM-5GZZQ: num 0 7.658 0.0586 0 0 ...
      $ GTEX-1122O-2026-SM-5NQ91: num 0.0464 9.372 0 0 0 ...
      $ GTEX-1128S-2126-SM-5H12U: num 0.0308 10.08 0.1367 0.0861 0.1108 ...
      $ GTEX-113IC-0226-SM-5HL5C: num 0.0936 13.56 0.2079 0.131 0.0562 ...
      $ GTEX-117YX-2226-SM-5EGJJ: num 0.121 9.889 0.0537 0.0677 0 ...
      $ GTEX-11DXW-0326-SM-5H11W: num 0.0286 9.121 0.0635 0 0 ...
      $ GTEX-11DXX-2326-SM-5Q5A2: num 0 6.698 0.0508 0.032 0 ...
      $ GTEX-11DZ1-0226-SM-5A5KF: num 0.0237 9.835 0 0.0664 0 ...
      $ GTEX-11EI6-0226-SM-5EQ64: num 0.0802 13.1 0 0 0 ...
      $ GTEX-11EM3-2326-SM-5H12B: num 0.0223 8.904 0.0496 0.0625 0.0402 ...
      $ GTEX-11EMC-2826-SM-5PNY6: num 0.0189 16.59 0 0.0265 0.034 ...
      $ GTEX-11EQ8-0226-SM-5EQ5G: num 0.0931 15.1 0.0689 0.0869 0 ...
      $ GTEX-11EQ9-2526-SM-5HL66: num 0.0777 9.838 0 0 0 ...









      share|improve this question









      New contributor




      Phil D is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      I have a data frame of ca. 50,000 RNA transcripts in rows, with 10,000 different samples in columns. The size of the data frame is 4.9GB.



      I then have to transpose the data in order to subset it properly later:



      df <- data.frame(t(df))



      After the transpose, the object size has ballooned to 70GB. Why is this happening? Should transposing data really change the file size that much?



      str() of the first 20 columns:



      str(df[1:20])
      Classes 'tbl_df', 'tbl' and 'data.frame': 56202 obs. of 20 variables:
      $ X1 : int 1 2 3 4 5 6 7 8 9 10 ...
      $ Name : chr "ENSG00000223972.4" "ENSG00000227232.4" "ENSG00000243485.2" "ENSG00000237613.2" ...
      $ Description : chr "DDX11L1" "WASH7P" "MIR1302-11" "FAM138A" ...
      $ GTEX-1117F-0226-SM-5GZZ7: num 0.1082 21.4 0.1602 0.0505 0 ...
      $ GTEX-111CU-1826-SM-5GZYN: num 0.1158 11.03 0.0643 0 0 ...
      $ GTEX-111FC-0226-SM-5N9B8: num 0.021 16.75 0.0467 0.0295 0 ...
      $ GTEX-111VG-2326-SM-5N9BK: num 0.0233 8.172 0 0.0326 0 ...
      $ GTEX-111YS-2426-SM-5GZZQ: num 0 7.658 0.0586 0 0 ...
      $ GTEX-1122O-2026-SM-5NQ91: num 0.0464 9.372 0 0 0 ...
      $ GTEX-1128S-2126-SM-5H12U: num 0.0308 10.08 0.1367 0.0861 0.1108 ...
      $ GTEX-113IC-0226-SM-5HL5C: num 0.0936 13.56 0.2079 0.131 0.0562 ...
      $ GTEX-117YX-2226-SM-5EGJJ: num 0.121 9.889 0.0537 0.0677 0 ...
      $ GTEX-11DXW-0326-SM-5H11W: num 0.0286 9.121 0.0635 0 0 ...
      $ GTEX-11DXX-2326-SM-5Q5A2: num 0 6.698 0.0508 0.032 0 ...
      $ GTEX-11DZ1-0226-SM-5A5KF: num 0.0237 9.835 0 0.0664 0 ...
      $ GTEX-11EI6-0226-SM-5EQ64: num 0.0802 13.1 0 0 0 ...
      $ GTEX-11EM3-2326-SM-5H12B: num 0.0223 8.904 0.0496 0.0625 0.0402 ...
      $ GTEX-11EMC-2826-SM-5PNY6: num 0.0189 16.59 0 0.0265 0.034 ...
      $ GTEX-11EQ8-0226-SM-5EQ5G: num 0.0931 15.1 0.0689 0.0869 0 ...
      $ GTEX-11EQ9-2526-SM-5HL66: num 0.0777 9.838 0 0 0 ...






      r dataframe memory transpose






      share|improve this question









      New contributor




      Phil D is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question









      New contributor




      Phil D is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question








      edited yesterday









      Henrik

      40k990107




      40k990107






      New contributor




      Phil D is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked yesterday









      Phil D

      185




      185




      New contributor




      Phil D is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Phil D is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Phil D is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          5
          down vote



          accepted










          First, you write that:




          I then have to transpose this dataset in order to subset it properly later,




          To be honest, I doubt you have to. Thus, this may be an XY-problem. That said, I think could be of general interest to dissect the issue.





          The increase in object size is most likely due to that the class of the object before and after transposing has changed, together with the fact that objects of different class have different size.



          I will try to illustrate this with some examples. We begin with the change of class.



          Create a toy data frame with a structure resembling yours, a few character columns and several numeric columns:



          # set number of rows and columns
          nr <- 5
          nc <- 5

          set.seed(1)
          d <- data.frame(x = sample(letters, nr, replace = TRUE),
          y = sample(letters, nr, replace = TRUE),
          matrix(runif(nr * nc), nrow = nr),
          stringsAsFactors = FALSE)


          Transpose it:



          d_t <- t(d)


          Check the structure of the original data and its transposed sibling:



          str(d)
          # 'data.frame': 5 obs. of 7 variables:
          # $ x : chr "g" "j" "o" "x" ...
          # $ y : chr "x" "y" "r" "q" ...
          # $ X1: num 0.206 0.177 0.687 0.384 0.77
          # $ X2: num 0.498 0.718 0.992 0.38 0.777
          # $ X3: num 0.935 0.212 0.652 0.126 0.267
          # $ X4: num 0.3861 0.0134 0.3824 0.8697 0.3403
          # $ X5: num 0.482 0.6 0.494 0.186 0.827

          str(d_t)
          # chr [1:7, 1:5] "g" "x" "0.2059746" "0.4976992" ...
          # - attr(*, "dimnames")=List of 2
          # ..$ : chr [1:7] "x" "y" "X1" "X2" ...
          # ..$ : NULL


          The data frame has became a character matrix. How did this happen? Well, check the help text for the transpose method for data frames: ?t.data.frame:




          A data frame is first coerced to a matrix: see as.matrix.




          OK, see ?as.matrix:




          The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column [...]




          Whereas a data frame is a list where each column can be of different class, a matrix is just a vector with dimensions, which can hold only one class. Thus, because you have at least one character column, i.e. a non-(numeric/logical/complex) column, your data frame is coerced to a character matrix as a result of transpose. Then you coerce the matrix to data frame, where all columns are character (or factor, depending on your stringsAsFactors setting) - check str(data.frame(d_t)).





          In the second step, the size of different objects is compared. Start with the data frame and its transpose, as created above:



          # original data frame
          object.size(d)
          # 2360 bytes

          # transposed df - a character matrix
          object.size(d_t)
          # 3280 bytes


          The transposed object is clearly larger. If we increase the number rows and the number of numeric columns to mimic your data better, the relative difference is even larger:



          nr <- 56202
          nc <- 20

          object.size(d)
          # 9897712 bytes
          object.size(d_t)
          # 78299656 bytes




          Because the number of elements is the same in the original and transposed data, the (memory) size of each individual element must differ. Let's check the size of integer, numeric, and character vectors of the same length. First, vectors with one-digit values and a corresponding vector of one-character elements:



          onedigit_int <- sample(1:9, 1e4, replace = TRUE)
          onedigit_num <- as.numeric(onedigit_int)
          onedigit_char <- as.character(onedigit_int)

          object.size(onedigit_int)
          # 40048 bytes

          object.size(onedigit_num)
          # 80048 bytes

          object.size(onedigit_char)
          # 80552 bytes


          For the single digits/characters, integer vectors occupy 4 bytes per element, and numeric and character vectors 8 bytes per element. The single-character vector does not require more memory than the numeric vector. Does this mean that we can reject the idea that the increase in total size is explained by the coercion of a large number of numeric variables to character? Well, we need to check what happens with vectors with multi-digits (which you seem to have) and their corresponding vectors of multi-character strings:



          multidigit_int <- sample(1:1e6, 1e4, replace = TRUE)
          multidigit_num <- as.numeric(multidigit_int)
          multidigit_char <- as.character(multidigit_int)

          object.size(multidigit_int)
          # 40048 bytes

          object.size(multidigit_num)
          # 80048 bytes

          object.size(multidigit_char)
          # 637360 bytes


          The integer vector still occupies 4 bytes for each element, the numeric vector still occupies 8 bytes for each element. However, the size per element in the character vector is larger for larger strings.



          Thus, the transpose coerced your data frame to a character matrix, and the size of each character element is larger than its corresponding numeric element.



          Transposing a data frame with columns of different class is very rarely sensible. And if all columns are of same class, then we may just as well use a matrix from the start.





          Read more about how much memory is used to store different objects in Advanced R by Hadley Wickham






          share|improve this answer























          • Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.
            – Phil D
            yesterday








          • 1




            Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers
            – Henrik
            yesterday











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });






          Phil D is a new contributor. Be nice, and check out our Code of Conduct.










           

          draft saved


          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53409246%2fobject-size-increases-hugely-when-transposing-a-data-frame%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          5
          down vote



          accepted










          First, you write that:




          I then have to transpose this dataset in order to subset it properly later,




          To be honest, I doubt you have to. Thus, this may be an XY-problem. That said, I think could be of general interest to dissect the issue.





          The increase in object size is most likely due to that the class of the object before and after transposing has changed, together with the fact that objects of different class have different size.



          I will try to illustrate this with some examples. We begin with the change of class.



          Create a toy data frame with a structure resembling yours, a few character columns and several numeric columns:



          # set number of rows and columns
          nr <- 5
          nc <- 5

          set.seed(1)
          d <- data.frame(x = sample(letters, nr, replace = TRUE),
          y = sample(letters, nr, replace = TRUE),
          matrix(runif(nr * nc), nrow = nr),
          stringsAsFactors = FALSE)


          Transpose it:



          d_t <- t(d)


          Check the structure of the original data and its transposed sibling:



          str(d)
          # 'data.frame': 5 obs. of 7 variables:
          # $ x : chr "g" "j" "o" "x" ...
          # $ y : chr "x" "y" "r" "q" ...
          # $ X1: num 0.206 0.177 0.687 0.384 0.77
          # $ X2: num 0.498 0.718 0.992 0.38 0.777
          # $ X3: num 0.935 0.212 0.652 0.126 0.267
          # $ X4: num 0.3861 0.0134 0.3824 0.8697 0.3403
          # $ X5: num 0.482 0.6 0.494 0.186 0.827

          str(d_t)
          # chr [1:7, 1:5] "g" "x" "0.2059746" "0.4976992" ...
          # - attr(*, "dimnames")=List of 2
          # ..$ : chr [1:7] "x" "y" "X1" "X2" ...
          # ..$ : NULL


          The data frame has became a character matrix. How did this happen? Well, check the help text for the transpose method for data frames: ?t.data.frame:




          A data frame is first coerced to a matrix: see as.matrix.




          OK, see ?as.matrix:




          The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column [...]




          Whereas a data frame is a list where each column can be of different class, a matrix is just a vector with dimensions, which can hold only one class. Thus, because you have at least one character column, i.e. a non-(numeric/logical/complex) column, your data frame is coerced to a character matrix as a result of transpose. Then you coerce the matrix to data frame, where all columns are character (or factor, depending on your stringsAsFactors setting) - check str(data.frame(d_t)).





          In the second step, the size of different objects is compared. Start with the data frame and its transpose, as created above:



          # original data frame
          object.size(d)
          # 2360 bytes

          # transposed df - a character matrix
          object.size(d_t)
          # 3280 bytes


          The transposed object is clearly larger. If we increase the number rows and the number of numeric columns to mimic your data better, the relative difference is even larger:



          nr <- 56202
          nc <- 20

          object.size(d)
          # 9897712 bytes
          object.size(d_t)
          # 78299656 bytes




          Because the number of elements is the same in the original and transposed data, the (memory) size of each individual element must differ. Let's check the size of integer, numeric, and character vectors of the same length. First, vectors with one-digit values and a corresponding vector of one-character elements:



          onedigit_int <- sample(1:9, 1e4, replace = TRUE)
          onedigit_num <- as.numeric(onedigit_int)
          onedigit_char <- as.character(onedigit_int)

          object.size(onedigit_int)
          # 40048 bytes

          object.size(onedigit_num)
          # 80048 bytes

          object.size(onedigit_char)
          # 80552 bytes


          For the single digits/characters, integer vectors occupy 4 bytes per element, and numeric and character vectors 8 bytes per element. The single-character vector does not require more memory than the numeric vector. Does this mean that we can reject the idea that the increase in total size is explained by the coercion of a large number of numeric variables to character? Well, we need to check what happens with vectors with multi-digits (which you seem to have) and their corresponding vectors of multi-character strings:



          multidigit_int <- sample(1:1e6, 1e4, replace = TRUE)
          multidigit_num <- as.numeric(multidigit_int)
          multidigit_char <- as.character(multidigit_int)

          object.size(multidigit_int)
          # 40048 bytes

          object.size(multidigit_num)
          # 80048 bytes

          object.size(multidigit_char)
          # 637360 bytes


          The integer vector still occupies 4 bytes for each element, the numeric vector still occupies 8 bytes for each element. However, the size per element in the character vector is larger for larger strings.



          Thus, the transpose coerced your data frame to a character matrix, and the size of each character element is larger than its corresponding numeric element.



          Transposing a data frame with columns of different class is very rarely sensible. And if all columns are of same class, then we may just as well use a matrix from the start.





          Read more about how much memory is used to store different objects in Advanced R by Hadley Wickham






          share|improve this answer























          • Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.
            – Phil D
            yesterday








          • 1




            Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers
            – Henrik
            yesterday















          up vote
          5
          down vote



          accepted










          First, you write that:




          I then have to transpose this dataset in order to subset it properly later,




          To be honest, I doubt you have to. Thus, this may be an XY-problem. That said, I think could be of general interest to dissect the issue.





          The increase in object size is most likely due to that the class of the object before and after transposing has changed, together with the fact that objects of different class have different size.



          I will try to illustrate this with some examples. We begin with the change of class.



          Create a toy data frame with a structure resembling yours, a few character columns and several numeric columns:



          # set number of rows and columns
          nr <- 5
          nc <- 5

          set.seed(1)
          d <- data.frame(x = sample(letters, nr, replace = TRUE),
          y = sample(letters, nr, replace = TRUE),
          matrix(runif(nr * nc), nrow = nr),
          stringsAsFactors = FALSE)


          Transpose it:



          d_t <- t(d)


          Check the structure of the original data and its transposed sibling:



          str(d)
          # 'data.frame': 5 obs. of 7 variables:
          # $ x : chr "g" "j" "o" "x" ...
          # $ y : chr "x" "y" "r" "q" ...
          # $ X1: num 0.206 0.177 0.687 0.384 0.77
          # $ X2: num 0.498 0.718 0.992 0.38 0.777
          # $ X3: num 0.935 0.212 0.652 0.126 0.267
          # $ X4: num 0.3861 0.0134 0.3824 0.8697 0.3403
          # $ X5: num 0.482 0.6 0.494 0.186 0.827

          str(d_t)
          # chr [1:7, 1:5] "g" "x" "0.2059746" "0.4976992" ...
          # - attr(*, "dimnames")=List of 2
          # ..$ : chr [1:7] "x" "y" "X1" "X2" ...
          # ..$ : NULL


          The data frame has became a character matrix. How did this happen? Well, check the help text for the transpose method for data frames: ?t.data.frame:




          A data frame is first coerced to a matrix: see as.matrix.




          OK, see ?as.matrix:




          The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column [...]




          Whereas a data frame is a list where each column can be of different class, a matrix is just a vector with dimensions, which can hold only one class. Thus, because you have at least one character column, i.e. a non-(numeric/logical/complex) column, your data frame is coerced to a character matrix as a result of transpose. Then you coerce the matrix to data frame, where all columns are character (or factor, depending on your stringsAsFactors setting) - check str(data.frame(d_t)).





          In the second step, the size of different objects is compared. Start with the data frame and its transpose, as created above:



          # original data frame
          object.size(d)
          # 2360 bytes

          # transposed df - a character matrix
          object.size(d_t)
          # 3280 bytes


          The transposed object is clearly larger. If we increase the number rows and the number of numeric columns to mimic your data better, the relative difference is even larger:



          nr <- 56202
          nc <- 20

          object.size(d)
          # 9897712 bytes
          object.size(d_t)
          # 78299656 bytes




          Because the number of elements is the same in the original and transposed data, the (memory) size of each individual element must differ. Let's check the size of integer, numeric, and character vectors of the same length. First, vectors with one-digit values and a corresponding vector of one-character elements:



          onedigit_int <- sample(1:9, 1e4, replace = TRUE)
          onedigit_num <- as.numeric(onedigit_int)
          onedigit_char <- as.character(onedigit_int)

          object.size(onedigit_int)
          # 40048 bytes

          object.size(onedigit_num)
          # 80048 bytes

          object.size(onedigit_char)
          # 80552 bytes


          For the single digits/characters, integer vectors occupy 4 bytes per element, and numeric and character vectors 8 bytes per element. The single-character vector does not require more memory than the numeric vector. Does this mean that we can reject the idea that the increase in total size is explained by the coercion of a large number of numeric variables to character? Well, we need to check what happens with vectors with multi-digits (which you seem to have) and their corresponding vectors of multi-character strings:



          multidigit_int <- sample(1:1e6, 1e4, replace = TRUE)
          multidigit_num <- as.numeric(multidigit_int)
          multidigit_char <- as.character(multidigit_int)

          object.size(multidigit_int)
          # 40048 bytes

          object.size(multidigit_num)
          # 80048 bytes

          object.size(multidigit_char)
          # 637360 bytes


          The integer vector still occupies 4 bytes for each element, the numeric vector still occupies 8 bytes for each element. However, the size per element in the character vector is larger for larger strings.



          Thus, the transpose coerced your data frame to a character matrix, and the size of each character element is larger than its corresponding numeric element.



          Transposing a data frame with columns of different class is very rarely sensible. And if all columns are of same class, then we may just as well use a matrix from the start.





          Read more about how much memory is used to store different objects in Advanced R by Hadley Wickham






          share|improve this answer























          • Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.
            – Phil D
            yesterday








          • 1




            Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers
            – Henrik
            yesterday













          up vote
          5
          down vote



          accepted







          up vote
          5
          down vote



          accepted






          First, you write that:




          I then have to transpose this dataset in order to subset it properly later,




          To be honest, I doubt you have to. Thus, this may be an XY-problem. That said, I think could be of general interest to dissect the issue.





          The increase in object size is most likely due to that the class of the object before and after transposing has changed, together with the fact that objects of different class have different size.



          I will try to illustrate this with some examples. We begin with the change of class.



          Create a toy data frame with a structure resembling yours, a few character columns and several numeric columns:



          # set number of rows and columns
          nr <- 5
          nc <- 5

          set.seed(1)
          d <- data.frame(x = sample(letters, nr, replace = TRUE),
          y = sample(letters, nr, replace = TRUE),
          matrix(runif(nr * nc), nrow = nr),
          stringsAsFactors = FALSE)


          Transpose it:



          d_t <- t(d)


          Check the structure of the original data and its transposed sibling:



          str(d)
          # 'data.frame': 5 obs. of 7 variables:
          # $ x : chr "g" "j" "o" "x" ...
          # $ y : chr "x" "y" "r" "q" ...
          # $ X1: num 0.206 0.177 0.687 0.384 0.77
          # $ X2: num 0.498 0.718 0.992 0.38 0.777
          # $ X3: num 0.935 0.212 0.652 0.126 0.267
          # $ X4: num 0.3861 0.0134 0.3824 0.8697 0.3403
          # $ X5: num 0.482 0.6 0.494 0.186 0.827

          str(d_t)
          # chr [1:7, 1:5] "g" "x" "0.2059746" "0.4976992" ...
          # - attr(*, "dimnames")=List of 2
          # ..$ : chr [1:7] "x" "y" "X1" "X2" ...
          # ..$ : NULL


          The data frame has became a character matrix. How did this happen? Well, check the help text for the transpose method for data frames: ?t.data.frame:




          A data frame is first coerced to a matrix: see as.matrix.




          OK, see ?as.matrix:




          The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column [...]




          Whereas a data frame is a list where each column can be of different class, a matrix is just a vector with dimensions, which can hold only one class. Thus, because you have at least one character column, i.e. a non-(numeric/logical/complex) column, your data frame is coerced to a character matrix as a result of transpose. Then you coerce the matrix to data frame, where all columns are character (or factor, depending on your stringsAsFactors setting) - check str(data.frame(d_t)).





          In the second step, the size of different objects is compared. Start with the data frame and its transpose, as created above:



          # original data frame
          object.size(d)
          # 2360 bytes

          # transposed df - a character matrix
          object.size(d_t)
          # 3280 bytes


          The transposed object is clearly larger. If we increase the number rows and the number of numeric columns to mimic your data better, the relative difference is even larger:



          nr <- 56202
          nc <- 20

          object.size(d)
          # 9897712 bytes
          object.size(d_t)
          # 78299656 bytes




          Because the number of elements is the same in the original and transposed data, the (memory) size of each individual element must differ. Let's check the size of integer, numeric, and character vectors of the same length. First, vectors with one-digit values and a corresponding vector of one-character elements:



          onedigit_int <- sample(1:9, 1e4, replace = TRUE)
          onedigit_num <- as.numeric(onedigit_int)
          onedigit_char <- as.character(onedigit_int)

          object.size(onedigit_int)
          # 40048 bytes

          object.size(onedigit_num)
          # 80048 bytes

          object.size(onedigit_char)
          # 80552 bytes


          For the single digits/characters, integer vectors occupy 4 bytes per element, and numeric and character vectors 8 bytes per element. The single-character vector does not require more memory than the numeric vector. Does this mean that we can reject the idea that the increase in total size is explained by the coercion of a large number of numeric variables to character? Well, we need to check what happens with vectors with multi-digits (which you seem to have) and their corresponding vectors of multi-character strings:



          multidigit_int <- sample(1:1e6, 1e4, replace = TRUE)
          multidigit_num <- as.numeric(multidigit_int)
          multidigit_char <- as.character(multidigit_int)

          object.size(multidigit_int)
          # 40048 bytes

          object.size(multidigit_num)
          # 80048 bytes

          object.size(multidigit_char)
          # 637360 bytes


          The integer vector still occupies 4 bytes for each element, the numeric vector still occupies 8 bytes for each element. However, the size per element in the character vector is larger for larger strings.



          Thus, the transpose coerced your data frame to a character matrix, and the size of each character element is larger than its corresponding numeric element.



          Transposing a data frame with columns of different class is very rarely sensible. And if all columns are of same class, then we may just as well use a matrix from the start.





          Read more about how much memory is used to store different objects in Advanced R by Hadley Wickham






          share|improve this answer














          First, you write that:




          I then have to transpose this dataset in order to subset it properly later,




          To be honest, I doubt you have to. Thus, this may be an XY-problem. That said, I think could be of general interest to dissect the issue.





          The increase in object size is most likely due to that the class of the object before and after transposing has changed, together with the fact that objects of different class have different size.



          I will try to illustrate this with some examples. We begin with the change of class.



          Create a toy data frame with a structure resembling yours, a few character columns and several numeric columns:



          # set number of rows and columns
          nr <- 5
          nc <- 5

          set.seed(1)
          d <- data.frame(x = sample(letters, nr, replace = TRUE),
          y = sample(letters, nr, replace = TRUE),
          matrix(runif(nr * nc), nrow = nr),
          stringsAsFactors = FALSE)


          Transpose it:



          d_t <- t(d)


          Check the structure of the original data and its transposed sibling:



          str(d)
          # 'data.frame': 5 obs. of 7 variables:
          # $ x : chr "g" "j" "o" "x" ...
          # $ y : chr "x" "y" "r" "q" ...
          # $ X1: num 0.206 0.177 0.687 0.384 0.77
          # $ X2: num 0.498 0.718 0.992 0.38 0.777
          # $ X3: num 0.935 0.212 0.652 0.126 0.267
          # $ X4: num 0.3861 0.0134 0.3824 0.8697 0.3403
          # $ X5: num 0.482 0.6 0.494 0.186 0.827

          str(d_t)
          # chr [1:7, 1:5] "g" "x" "0.2059746" "0.4976992" ...
          # - attr(*, "dimnames")=List of 2
          # ..$ : chr [1:7] "x" "y" "X1" "X2" ...
          # ..$ : NULL


          The data frame has became a character matrix. How did this happen? Well, check the help text for the transpose method for data frames: ?t.data.frame:




          A data frame is first coerced to a matrix: see as.matrix.




          OK, see ?as.matrix:




          The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column [...]




          Whereas a data frame is a list where each column can be of different class, a matrix is just a vector with dimensions, which can hold only one class. Thus, because you have at least one character column, i.e. a non-(numeric/logical/complex) column, your data frame is coerced to a character matrix as a result of transpose. Then you coerce the matrix to data frame, where all columns are character (or factor, depending on your stringsAsFactors setting) - check str(data.frame(d_t)).





          In the second step, the size of different objects is compared. Start with the data frame and its transpose, as created above:



          # original data frame
          object.size(d)
          # 2360 bytes

          # transposed df - a character matrix
          object.size(d_t)
          # 3280 bytes


          The transposed object is clearly larger. If we increase the number rows and the number of numeric columns to mimic your data better, the relative difference is even larger:



          nr <- 56202
          nc <- 20

          object.size(d)
          # 9897712 bytes
          object.size(d_t)
          # 78299656 bytes




          Because the number of elements is the same in the original and transposed data, the (memory) size of each individual element must differ. Let's check the size of integer, numeric, and character vectors of the same length. First, vectors with one-digit values and a corresponding vector of one-character elements:



          onedigit_int <- sample(1:9, 1e4, replace = TRUE)
          onedigit_num <- as.numeric(onedigit_int)
          onedigit_char <- as.character(onedigit_int)

          object.size(onedigit_int)
          # 40048 bytes

          object.size(onedigit_num)
          # 80048 bytes

          object.size(onedigit_char)
          # 80552 bytes


          For the single digits/characters, integer vectors occupy 4 bytes per element, and numeric and character vectors 8 bytes per element. The single-character vector does not require more memory than the numeric vector. Does this mean that we can reject the idea that the increase in total size is explained by the coercion of a large number of numeric variables to character? Well, we need to check what happens with vectors with multi-digits (which you seem to have) and their corresponding vectors of multi-character strings:



          multidigit_int <- sample(1:1e6, 1e4, replace = TRUE)
          multidigit_num <- as.numeric(multidigit_int)
          multidigit_char <- as.character(multidigit_int)

          object.size(multidigit_int)
          # 40048 bytes

          object.size(multidigit_num)
          # 80048 bytes

          object.size(multidigit_char)
          # 637360 bytes


          The integer vector still occupies 4 bytes for each element, the numeric vector still occupies 8 bytes for each element. However, the size per element in the character vector is larger for larger strings.



          Thus, the transpose coerced your data frame to a character matrix, and the size of each character element is larger than its corresponding numeric element.



          Transposing a data frame with columns of different class is very rarely sensible. And if all columns are of same class, then we may just as well use a matrix from the start.





          Read more about how much memory is used to store different objects in Advanced R by Hadley Wickham







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited 20 hours ago

























          answered yesterday









          Henrik

          40k990107




          40k990107












          • Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.
            – Phil D
            yesterday








          • 1




            Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers
            – Henrik
            yesterday


















          • Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.
            – Phil D
            yesterday








          • 1




            Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers
            – Henrik
            yesterday
















          Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.
          – Phil D
          yesterday






          Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.
          – Phil D
          yesterday






          1




          1




          Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers
          – Henrik
          yesterday




          Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers
          – Henrik
          yesterday










          Phil D is a new contributor. Be nice, and check out our Code of Conduct.










           

          draft saved


          draft discarded


















          Phil D is a new contributor. Be nice, and check out our Code of Conduct.













          Phil D is a new contributor. Be nice, and check out our Code of Conduct.












          Phil D is a new contributor. Be nice, and check out our Code of Conduct.















           


          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53409246%2fobject-size-increases-hugely-when-transposing-a-data-frame%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

          Calculate evaluation metrics using cross_val_predict sklearn

          Insert data from modal to MySQL (multiple modal on website)