How to merge a list of data.tables without getting splitted columns?












1














I'm about to merge large data sets. That's why I try out data.table and am thrilled by its speed.



# base R
system.time(
M1 <- Reduce(function(...) merge(..., all=TRUE), L)
)
# user system elapsed
# 5.05 0.00 5.20

# data.table
library(data.table)
L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id")))
system.time(
M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)
)
# user system elapsed
# 0.12 0.00 0.12


Both approaches yield the same values, however there are some columns that are split with data.table.



base R:



set.seed(1)
car::some(M1, 5)
# sid id V3 V4 a b
# 60504 1 60504 -0.6964804 -1.210195 NA NA
# 79653 1 79653 -2.5287163 -1.087546 NA NA
# 111637 2 11637 0.7104236 NA -1.7377657 NA
# 171855 2 71855 0.2023342 NA -0.6334279 NA
# 272460 3 72460 -0.5098994 NA NA 0.2738896


data.table:



set.seed(1)
car::some(M2, 5)
# sid id V3.x V4 V3.y a V3 b
# 1: 1 60504 -0.6964804 -1.210195 NA NA NA NA
# 2: 1 79653 -2.5287163 -1.087546 NA NA NA NA
# 3: 2 11637 NA NA 0.7104236 -1.7377657 NA NA
# 4: 2 71855 NA NA 0.2023342 -0.6334279 NA NA
# 5: 3 72460 NA NA NA NA -0.5098994 0.2738896


Did I miss something? Or is there an easy way to solve this, i.e. get the split columns combined? (I don't want to use any other packages.)



Data



fun <- function(x){
set.seed(x)
data.frame(cbind(sid=x, id=1:1e5, matrix(rnorm(1e5*2), 1e5)))
}
tmp <- lapply(1:3, fun)
df1 <- tmp[[1]]
df2 <- tmp[[2]]
df3 <- tmp[[3]]
rm(tmp)
names(df2)[4] <- c("a")
names(df3)[4] <- c("b")
L <- list(df1, df2, df3)


Related: 1, 2










share|improve this question





























    1














    I'm about to merge large data sets. That's why I try out data.table and am thrilled by its speed.



    # base R
    system.time(
    M1 <- Reduce(function(...) merge(..., all=TRUE), L)
    )
    # user system elapsed
    # 5.05 0.00 5.20

    # data.table
    library(data.table)
    L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id")))
    system.time(
    M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)
    )
    # user system elapsed
    # 0.12 0.00 0.12


    Both approaches yield the same values, however there are some columns that are split with data.table.



    base R:



    set.seed(1)
    car::some(M1, 5)
    # sid id V3 V4 a b
    # 60504 1 60504 -0.6964804 -1.210195 NA NA
    # 79653 1 79653 -2.5287163 -1.087546 NA NA
    # 111637 2 11637 0.7104236 NA -1.7377657 NA
    # 171855 2 71855 0.2023342 NA -0.6334279 NA
    # 272460 3 72460 -0.5098994 NA NA 0.2738896


    data.table:



    set.seed(1)
    car::some(M2, 5)
    # sid id V3.x V4 V3.y a V3 b
    # 1: 1 60504 -0.6964804 -1.210195 NA NA NA NA
    # 2: 1 79653 -2.5287163 -1.087546 NA NA NA NA
    # 3: 2 11637 NA NA 0.7104236 -1.7377657 NA NA
    # 4: 2 71855 NA NA 0.2023342 -0.6334279 NA NA
    # 5: 3 72460 NA NA NA NA -0.5098994 0.2738896


    Did I miss something? Or is there an easy way to solve this, i.e. get the split columns combined? (I don't want to use any other packages.)



    Data



    fun <- function(x){
    set.seed(x)
    data.frame(cbind(sid=x, id=1:1e5, matrix(rnorm(1e5*2), 1e5)))
    }
    tmp <- lapply(1:3, fun)
    df1 <- tmp[[1]]
    df2 <- tmp[[2]]
    df3 <- tmp[[3]]
    rm(tmp)
    names(df2)[4] <- c("a")
    names(df3)[4] <- c("b")
    L <- list(df1, df2, df3)


    Related: 1, 2










    share|improve this question



























      1












      1








      1


      1





      I'm about to merge large data sets. That's why I try out data.table and am thrilled by its speed.



      # base R
      system.time(
      M1 <- Reduce(function(...) merge(..., all=TRUE), L)
      )
      # user system elapsed
      # 5.05 0.00 5.20

      # data.table
      library(data.table)
      L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id")))
      system.time(
      M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)
      )
      # user system elapsed
      # 0.12 0.00 0.12


      Both approaches yield the same values, however there are some columns that are split with data.table.



      base R:



      set.seed(1)
      car::some(M1, 5)
      # sid id V3 V4 a b
      # 60504 1 60504 -0.6964804 -1.210195 NA NA
      # 79653 1 79653 -2.5287163 -1.087546 NA NA
      # 111637 2 11637 0.7104236 NA -1.7377657 NA
      # 171855 2 71855 0.2023342 NA -0.6334279 NA
      # 272460 3 72460 -0.5098994 NA NA 0.2738896


      data.table:



      set.seed(1)
      car::some(M2, 5)
      # sid id V3.x V4 V3.y a V3 b
      # 1: 1 60504 -0.6964804 -1.210195 NA NA NA NA
      # 2: 1 79653 -2.5287163 -1.087546 NA NA NA NA
      # 3: 2 11637 NA NA 0.7104236 -1.7377657 NA NA
      # 4: 2 71855 NA NA 0.2023342 -0.6334279 NA NA
      # 5: 3 72460 NA NA NA NA -0.5098994 0.2738896


      Did I miss something? Or is there an easy way to solve this, i.e. get the split columns combined? (I don't want to use any other packages.)



      Data



      fun <- function(x){
      set.seed(x)
      data.frame(cbind(sid=x, id=1:1e5, matrix(rnorm(1e5*2), 1e5)))
      }
      tmp <- lapply(1:3, fun)
      df1 <- tmp[[1]]
      df2 <- tmp[[2]]
      df3 <- tmp[[3]]
      rm(tmp)
      names(df2)[4] <- c("a")
      names(df3)[4] <- c("b")
      L <- list(df1, df2, df3)


      Related: 1, 2










      share|improve this question















      I'm about to merge large data sets. That's why I try out data.table and am thrilled by its speed.



      # base R
      system.time(
      M1 <- Reduce(function(...) merge(..., all=TRUE), L)
      )
      # user system elapsed
      # 5.05 0.00 5.20

      # data.table
      library(data.table)
      L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id")))
      system.time(
      M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)
      )
      # user system elapsed
      # 0.12 0.00 0.12


      Both approaches yield the same values, however there are some columns that are split with data.table.



      base R:



      set.seed(1)
      car::some(M1, 5)
      # sid id V3 V4 a b
      # 60504 1 60504 -0.6964804 -1.210195 NA NA
      # 79653 1 79653 -2.5287163 -1.087546 NA NA
      # 111637 2 11637 0.7104236 NA -1.7377657 NA
      # 171855 2 71855 0.2023342 NA -0.6334279 NA
      # 272460 3 72460 -0.5098994 NA NA 0.2738896


      data.table:



      set.seed(1)
      car::some(M2, 5)
      # sid id V3.x V4 V3.y a V3 b
      # 1: 1 60504 -0.6964804 -1.210195 NA NA NA NA
      # 2: 1 79653 -2.5287163 -1.087546 NA NA NA NA
      # 3: 2 11637 NA NA 0.7104236 -1.7377657 NA NA
      # 4: 2 71855 NA NA 0.2023342 -0.6334279 NA NA
      # 5: 3 72460 NA NA NA NA -0.5098994 0.2738896


      Did I miss something? Or is there an easy way to solve this, i.e. get the split columns combined? (I don't want to use any other packages.)



      Data



      fun <- function(x){
      set.seed(x)
      data.frame(cbind(sid=x, id=1:1e5, matrix(rnorm(1e5*2), 1e5)))
      }
      tmp <- lapply(1:3, fun)
      df1 <- tmp[[1]]
      df2 <- tmp[[2]]
      df3 <- tmp[[3]]
      rm(tmp)
      names(df2)[4] <- c("a")
      names(df3)[4] <- c("b")
      L <- list(df1, df2, df3)


      Related: 1, 2







      r merge data.table






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 23 at 0:18

























      asked Nov 23 at 0:07









      jay.sf

      4,49821438




      4,49821438
























          1 Answer
          1






          active

          oldest

          votes


















          5














          The by argument in base::merge defaults to intersect(names(x), names(y)) where x and y are the 2 tables to be merged. Hence, base::merge also uses V3 as the merging key.



          The by argument in data.table::merge defaults to the shared key columns between the two tables (i.e. sid and id in this case). And since the tables have columns named V3, suffixes are appended to the new columns.



          So if your intention is to merge by all common columns, you can identify the common columns, set keys then merge:



          commcols <- Reduce(intersect, lapply(L, names))
          L.dt <- lapply(L, function(x) setkeyv(data.table(x), commcols))
          M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)





          share|improve this answer























          • Ok, in my real data, as you can imagine, there are hundreds of V3s, which is actually the point.
            – jay.sf
            Nov 23 at 0:22












          • you can specify your by columns if you really want to merge by V3 as well. or L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id", "V3"))) if this is really what you intended and what you did with base::merge
            – chinsoon12
            Nov 23 at 0:25












          • And this then continues with the as and bs and cs and ... from the dts to merge.
            – jay.sf
            Nov 23 at 0:26










          • just identify the common columns using by=Reduce(intersect, lapply(L.dt, names))
            – chinsoon12
            Nov 23 at 0:27










          • I have a follow-up issue up with the real data, though. Error message x has some duplicated column name(s) citing 118 names, data has 431 names. Thought it was related to intersect and tried union but didn't help. Could you clarify that here, or shall I rather ask another question?
            – jay.sf
            Nov 23 at 0:56













          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53439279%2fhow-to-merge-a-list-of-data-tables-without-getting-splitted-columns%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          5














          The by argument in base::merge defaults to intersect(names(x), names(y)) where x and y are the 2 tables to be merged. Hence, base::merge also uses V3 as the merging key.



          The by argument in data.table::merge defaults to the shared key columns between the two tables (i.e. sid and id in this case). And since the tables have columns named V3, suffixes are appended to the new columns.



          So if your intention is to merge by all common columns, you can identify the common columns, set keys then merge:



          commcols <- Reduce(intersect, lapply(L, names))
          L.dt <- lapply(L, function(x) setkeyv(data.table(x), commcols))
          M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)





          share|improve this answer























          • Ok, in my real data, as you can imagine, there are hundreds of V3s, which is actually the point.
            – jay.sf
            Nov 23 at 0:22












          • you can specify your by columns if you really want to merge by V3 as well. or L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id", "V3"))) if this is really what you intended and what you did with base::merge
            – chinsoon12
            Nov 23 at 0:25












          • And this then continues with the as and bs and cs and ... from the dts to merge.
            – jay.sf
            Nov 23 at 0:26










          • just identify the common columns using by=Reduce(intersect, lapply(L.dt, names))
            – chinsoon12
            Nov 23 at 0:27










          • I have a follow-up issue up with the real data, though. Error message x has some duplicated column name(s) citing 118 names, data has 431 names. Thought it was related to intersect and tried union but didn't help. Could you clarify that here, or shall I rather ask another question?
            – jay.sf
            Nov 23 at 0:56


















          5














          The by argument in base::merge defaults to intersect(names(x), names(y)) where x and y are the 2 tables to be merged. Hence, base::merge also uses V3 as the merging key.



          The by argument in data.table::merge defaults to the shared key columns between the two tables (i.e. sid and id in this case). And since the tables have columns named V3, suffixes are appended to the new columns.



          So if your intention is to merge by all common columns, you can identify the common columns, set keys then merge:



          commcols <- Reduce(intersect, lapply(L, names))
          L.dt <- lapply(L, function(x) setkeyv(data.table(x), commcols))
          M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)





          share|improve this answer























          • Ok, in my real data, as you can imagine, there are hundreds of V3s, which is actually the point.
            – jay.sf
            Nov 23 at 0:22












          • you can specify your by columns if you really want to merge by V3 as well. or L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id", "V3"))) if this is really what you intended and what you did with base::merge
            – chinsoon12
            Nov 23 at 0:25












          • And this then continues with the as and bs and cs and ... from the dts to merge.
            – jay.sf
            Nov 23 at 0:26










          • just identify the common columns using by=Reduce(intersect, lapply(L.dt, names))
            – chinsoon12
            Nov 23 at 0:27










          • I have a follow-up issue up with the real data, though. Error message x has some duplicated column name(s) citing 118 names, data has 431 names. Thought it was related to intersect and tried union but didn't help. Could you clarify that here, or shall I rather ask another question?
            – jay.sf
            Nov 23 at 0:56
















          5












          5








          5






          The by argument in base::merge defaults to intersect(names(x), names(y)) where x and y are the 2 tables to be merged. Hence, base::merge also uses V3 as the merging key.



          The by argument in data.table::merge defaults to the shared key columns between the two tables (i.e. sid and id in this case). And since the tables have columns named V3, suffixes are appended to the new columns.



          So if your intention is to merge by all common columns, you can identify the common columns, set keys then merge:



          commcols <- Reduce(intersect, lapply(L, names))
          L.dt <- lapply(L, function(x) setkeyv(data.table(x), commcols))
          M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)





          share|improve this answer














          The by argument in base::merge defaults to intersect(names(x), names(y)) where x and y are the 2 tables to be merged. Hence, base::merge also uses V3 as the merging key.



          The by argument in data.table::merge defaults to the shared key columns between the two tables (i.e. sid and id in this case). And since the tables have columns named V3, suffixes are appended to the new columns.



          So if your intention is to merge by all common columns, you can identify the common columns, set keys then merge:



          commcols <- Reduce(intersect, lapply(L, names))
          L.dt <- lapply(L, function(x) setkeyv(data.table(x), commcols))
          M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 23 at 0:36

























          answered Nov 23 at 0:13









          chinsoon12

          7,83111118




          7,83111118












          • Ok, in my real data, as you can imagine, there are hundreds of V3s, which is actually the point.
            – jay.sf
            Nov 23 at 0:22












          • you can specify your by columns if you really want to merge by V3 as well. or L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id", "V3"))) if this is really what you intended and what you did with base::merge
            – chinsoon12
            Nov 23 at 0:25












          • And this then continues with the as and bs and cs and ... from the dts to merge.
            – jay.sf
            Nov 23 at 0:26










          • just identify the common columns using by=Reduce(intersect, lapply(L.dt, names))
            – chinsoon12
            Nov 23 at 0:27










          • I have a follow-up issue up with the real data, though. Error message x has some duplicated column name(s) citing 118 names, data has 431 names. Thought it was related to intersect and tried union but didn't help. Could you clarify that here, or shall I rather ask another question?
            – jay.sf
            Nov 23 at 0:56




















          • Ok, in my real data, as you can imagine, there are hundreds of V3s, which is actually the point.
            – jay.sf
            Nov 23 at 0:22












          • you can specify your by columns if you really want to merge by V3 as well. or L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id", "V3"))) if this is really what you intended and what you did with base::merge
            – chinsoon12
            Nov 23 at 0:25












          • And this then continues with the as and bs and cs and ... from the dts to merge.
            – jay.sf
            Nov 23 at 0:26










          • just identify the common columns using by=Reduce(intersect, lapply(L.dt, names))
            – chinsoon12
            Nov 23 at 0:27










          • I have a follow-up issue up with the real data, though. Error message x has some duplicated column name(s) citing 118 names, data has 431 names. Thought it was related to intersect and tried union but didn't help. Could you clarify that here, or shall I rather ask another question?
            – jay.sf
            Nov 23 at 0:56


















          Ok, in my real data, as you can imagine, there are hundreds of V3s, which is actually the point.
          – jay.sf
          Nov 23 at 0:22






          Ok, in my real data, as you can imagine, there are hundreds of V3s, which is actually the point.
          – jay.sf
          Nov 23 at 0:22














          you can specify your by columns if you really want to merge by V3 as well. or L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id", "V3"))) if this is really what you intended and what you did with base::merge
          – chinsoon12
          Nov 23 at 0:25






          you can specify your by columns if you really want to merge by V3 as well. or L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id", "V3"))) if this is really what you intended and what you did with base::merge
          – chinsoon12
          Nov 23 at 0:25














          And this then continues with the as and bs and cs and ... from the dts to merge.
          – jay.sf
          Nov 23 at 0:26




          And this then continues with the as and bs and cs and ... from the dts to merge.
          – jay.sf
          Nov 23 at 0:26












          just identify the common columns using by=Reduce(intersect, lapply(L.dt, names))
          – chinsoon12
          Nov 23 at 0:27




          just identify the common columns using by=Reduce(intersect, lapply(L.dt, names))
          – chinsoon12
          Nov 23 at 0:27












          I have a follow-up issue up with the real data, though. Error message x has some duplicated column name(s) citing 118 names, data has 431 names. Thought it was related to intersect and tried union but didn't help. Could you clarify that here, or shall I rather ask another question?
          – jay.sf
          Nov 23 at 0:56






          I have a follow-up issue up with the real data, though. Error message x has some duplicated column name(s) citing 118 names, data has 431 names. Thought it was related to intersect and tried union but didn't help. Could you clarify that here, or shall I rather ask another question?
          – jay.sf
          Nov 23 at 0:56




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53439279%2fhow-to-merge-a-list-of-data-tables-without-getting-splitted-columns%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

          Calculate evaluation metrics using cross_val_predict sklearn

          Insert data from modal to MySQL (multiple modal on website)