通过匹配日期合并2个数据帧

我有两个数据帧:

id      dates
MUM-1  2015-07-10
MUM-1  2015-07-11
MUM-1  2015-07-12
MUM-2  2014-01-14
MUM-2  2014-01-15
MUM-2  2014-01-16
MUM-2  2014-01-17

和:

id      dates      field1  field2
MUM-1  2015-07-10     1       0
MUM-1  2015-07-12     2       1
MUM-2  2014-01-14     4       3
MUM-2  2014-01-17     0       1

合并数据:

id      dates        field1   field2
MUM-1  2015-07-10      1         0
MUM-1  2015-07-11      na        na
MUM-1  2015-07-12      2         1
MUM-2  2014-01-14      4         3
MUM-2  2014-01-15      na        na
MUM-2  2014-01-16      na        na
MUM-2  2014-01-17      0         1   

代码:merge(x = df1,y = df2,by =’id’,all.x = T)

我正在使用合并,但由于两个数据帧的大小太大,处理时间太长.合并功能有替代方法吗?也许在dplyr?因此它在比较中处理速度很快.两个数据帧都有超过900K行.

最佳答案
您也可以按如下方式简单地加入,而不是使用与data.table合并:

setDT(df1)
setDT(df2)

df2[df1, on = c('id','dates')]

这给了:

> df2[df1]
      id      dates field1 field2
1: MUM-1 2015-07-10      1      0
2: MUM-1 2015-07-11     NA     NA
3: MUM-1 2015-07-12      2      1
4: MUM-2 2014-01-14      4      3
5: MUM-2 2014-01-15     NA     NA
6: MUM-2 2014-01-16     NA     NA
7: MUM-2 2014-01-17      0      1

使用dplyr执行此操作:

library(dplyr)
dplr <- left_join(df1, df2, by=c("id","dates"))

正如@Arun在评论中所提到的,对于具有七行的小型数据集,基准测试不是很有意义.所以我们创建一些更大的数据集:

dt1 <- data.table(id=gl(2, 730, labels = c("MUM-1", "MUM-2")),
                  dates=c(seq(as.Date("2010-01-01"), as.Date("2011-12-31"), by="days"),
                          seq(as.Date("2013-01-01"), as.Date("2014-12-31"), by="days")))
dt2 <- data.table(id=gl(2, 730, labels = c("MUM-1", "MUM-2")),
                  dates=c(seq(as.Date("2010-01-01"), as.Date("2011-12-31"), by="days"),
                          seq(as.Date("2013-01-01"), as.Date("2014-12-31"), by="days")),
                  field1=sample(c(0,1,2,3,4), size=730, replace = TRUE),
                  field2=sample(c(0,1,2,3,4), size=730, replace = TRUE))
dt2 <- dt2[sample(nrow(dt2), 800)]

可以看出,@ Arun的方法稍快一些:

library(rbenchmark)
benchmark(replications = 10, order = "elapsed", columns = c("test", "elapsed", "relative"),
          jaap = dt2[dt1, on = c('id','dates')],
          pavo = merge(dt1,dt2,by="id",allow.cartesian=T),
          dplr = left_join(dt1, dt2, by=c("id","dates")),
          arun = dt1[dt2, c("fiedl1", "field2") := .(field1, field2), on=c("id", "dates")])

  test elapsed relative
4 arun   0.015    1.000
1 jaap   0.016    1.067
3 dplr   0.037    2.467
2 pavo   1.033   68.867

有关大型数据集的比较,请参见the answer of @Arun.

转载注明原文:通过匹配日期合并2个数据帧 - 代码日志