请教如何把这个问题向量化？

nan.xiao

16 楼的方法确实已经相当高效了。Rcpp 简单翻译下也不会获得很多性能提升了。

naive 数据：

R

用户系统流逝

1.060 0.293 1.367

Rcpp

用户系统流逝

0.703 0.000 0.709

奇葩数据：

R

用户系统流逝

0.767 0.070 0.842

Rcpp

用户系统流逝

0.694 0.000 0.696

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
NumericVector test3(NumericVector x, NumericVector y) {

  int m = x.size() ;
  int n = y.size() ;
  NumericVector out(m) ;
  IntegerVector tmp(m) ;
  int a ;

  std::sort(x.begin(), x.end()) ;
  std::sort(y.begin(), y.end()) ;

  for (int i = 0; i < m; ++i) {
    tmp = y <= x ;
    a = sum(tmp) ;
    if (a != 0) {
      out = y[a - 1] ;
    } else {
      out = NA_REAL ;
    }
  }

  return out ;

}

我这是 06 年底的。。。

suckbunny

<br />
rm(list=ls())<br />
set.seed(65535)<br />
y <- 1:50000<br />
x <- sample(y,1000)+0.5<br />
#re=rep(0,1000)<br />
tic<-proc.time()<br />
x<-sort(x)<br />
#y<-sort(y)<br />
re <- sapply(x, function(x,y) { sum(y <= x)->a;if (a!=0){y[a]}else{NA}}, y)<br />
toc<-proc.time()<br />
toc-tic<br />
用户 系统 流逝<br />
0.48 0.02 0.53<br />

和

<br />
rm(list=ls())<br />
set.seed(65535)<br />
y <- 1:500000<br />
x <- sample(y,1000)+0.5<br />
tic<-proc.time();<br />
x <- sort(x)<br />
re <- sapply(x, function(x,y) { sum(y <= x)->a;if (a!=0){y[a]}else{NA}}, y)<br />
toc<-proc.time()<br />
toc-tic<br />
用户 系统 流逝<br />
5.54 0.17 5.76<br />

我试了好久发现为什么‘一样’的程序结果差十倍呢？？

结果发现尼玛50000和500000差了一个0啊！！[s:12]

nan.xiao

回复第22楼的 suckbunny：是啊，其实比较坑，细看每个人的实验都有着微妙的区别。。。[s:11]

我觉得现实情况应该是看对 x 的 scalability.

suckbunny

cut目前唯一的遗憾就是x不能重复。

<br />
rm(list=ls())<br />
set.seed(65535)<br />
y <- 1:50000<br />
x <- sample(y,1000)+0.5<br />
pt <- proc.time()<br />
x <- sort(x)<br />
z <- split(y, cut(y, c(min(y), x), include.lowest= T))<br />
re2 <- numeric(length(x))<br />
for (i in 1:length(x)) {<br />
  re2[i] <- max(z[[i]])<br />
}<br />
t2 <- proc.time() - pt<br />
t2<br />
用户 系统 流逝<br />
0.07 0.00 0.14<br />

suckbunny

代码根据楼下改了。当时脑子都乱了。

<br />
rm(list=ls())<br />
set.seed(65535)<br />
y <- 1:50000<br />
x <- sample(y,1000,replace=T)+0.5<br />
#y <- c(1,2,5,6,7,9)<br />
#x <- c(1,1,2,4,4,6,8)<br />
pt <- proc.time()<br />
rk <- rank(x)<br />
x <- sort(x)<br />
uq <- unique(x)<br />
z <- split(y, cut(y,c(-Inf,uq), include.lowest=T))<br />
re2 <- numeric(length(uq))<br />
for (i in 1:length(uq)) {<br />
  re2[i] <- ifelse (length(z[[i]])== 0, re2[i-1], max(z[[i]]))<br />
}<br />
re2 <- rep(re2,rle(x)$len)<br />
re2 <- re2[rk]<br />
t2 <- proc.time() - pt<br />
t2<br />
用户 系统 流逝<br />
0.06 0.02 0.21 </p>
<p>set.seed(65535)<br />
y <- 1:50000<br />
x <- sample(y,1000,replace=T)+0.5<br />
system.time(re <- sapply(x, function(x,y){y <- y[y <= x]; tail(y,1)}, y))<br />
用户 系统 流逝<br />
1.11 0.00 1.11 </p>
<p>all.equal(re,re2)<br />

差距还是有的。

Robert_Hoo

呵呵我的电脑太快了，所以偷偷加了个0；要不然用cut的方法时间还不到0.1秒(测试时间太短对速度快的算法是不公平的）

<br />
rm(list=ls())<br />
set.seed(65535)<br />
y <- 1:50000<br />
x <- sample(y,1000,replace=T) + 0.5<br />
t1 <- system.time(re <- sapply(x, function(x,y){y <- y[y <= x]; tail(y,1)}, y))</p>
<p>#y <- c(1,2,5,6,7,9)<br />
#x <- c(1,1,2,4,4,6,8)<br />
pt <- proc.time()<br />
rk <- rank(x)<br />
x <- sort(x)<br />
uq <- unique(x)<br />
z <- split(y, cut(y, c(-Inf,uq), include.lowest=T))<br />
re2 <- numeric(length(uq))<br />
for (i in 1:length(uq)) {<br />
  re2[i] <- ifelse (length(z[[i]])== 0, re2[i-1], max(z[[i]]))<br />
}<br />
re2 <- rep(re2,rle(x)$len)<br />
re2 <- re2[rk]<br />
t2 <- proc.time() - pt</p>
<p>t2<br />
user  system elapsed<br />
0.08    0.02    0.09<br />
t1<br />
user  system elapsed<br />
1       0       1<br />
all.equal(re,re2)<br />
[1] TRUE<br />

而且，你会发两现随着数据量增大，两个算法的差距越来越大; x,y都加一个数量级的话，两个算法的时间相差100倍以上

<br />
y <- 1:500000<br />
x <- sample(y,10000,replace=T) + 0.5<br />
t2<br />
user  system elapsed<br />
0.78    0.00    0.81<br />
t1<br />
user  system elapsed<br />
111.03    0.02  111.50<br />

suckbunny

回复第26楼的 Robert_Hoo：不错。

raphael210

感谢各位高手鼎力相助。不过今天一不小心发现base里有个函数就是专门干这事的[s:12]。

findInterval：

Find the indices of x in vec, where vec must be sorted (non-decreasingly); i.e., if i <- findInterval(x,v), we have v[i[j]] ≤ x[j] < v[i[j] + 1] where v[0] := - Inf, v[N+1] := + Inf, and N <- length(vec).

而且难能可贵的是，此函数既可以允许x未排序，也可以允许x重复,而且速度还无比快。

<br />
set.seed(65535)<br />
y <- 1:50000<br />
x <- sample(y,1000,replace=T) + 0.5<br />
t1 <- system.time(re <- sapply(x, function(x,y){y <- y[y <= x]; tail(y,1)}, y))<br />
t3 <- system.time(re3 <- y[findInterval(x,y)])</p>
<p>#y <- c(1,2,5,6,7,9)<br />
#x <- c(1,1,2,4,4,6,8)<br />
pt <- proc.time()<br />
rk <- rank(x)<br />
x <- sort(x)<br />
uq <- unique(x)<br />
z <- split(y, cut(y, c(-Inf,uq), include.lowest=T))<br />
re2 <- numeric(length(uq))<br />
for (i in 1:length(uq)) {<br />
  re2[i] <- ifelse (length(z[[i]])== 0, re2[i-1], max(z[[i]]))<br />
}<br />
re2 <- rep(re2,rle(x)$len)<br />
re2 <- re2[rk]<br />
t2 <- proc.time() - pt</p>
<p>> t1<br />
   user  system elapsed<br />
   2.11    0.00    2.11<br />
> t2<br />
   user  system elapsed<br />
   0.14    0.01    0.30<br />
> t3<br />
   user  system elapsed<br />
      0       0       0<br />
> all.equal(re,re2)<br />
[1] TRUE<br />
> all.equal(re,re3)<br />
[1] TRUE</p>
<p>

数据换大点的：

</p>
<p>set.seed(65535)<br />
y <- 1:500000<br />
x <- sample(y,1000,replace=T) + 0.5<br />
> t1<br />
   user  system elapsed<br />
  30.65    1.46   32.78<br />
> t2<br />
   user  system elapsed<br />
   0.64    0.01    0.81<br />
> t3<br />
   user  system elapsed<br />
   0.01    0.00    0.01<br />
> all.equal(re,re2)<br />
[1] TRUE<br />
> all.equal(re,re3)<br />
[1] TRUE<br />

当然，再次感谢各位的热心讨论。本人受益匪浅！

Ihavenothing

回复第28楼的 raphael210：

恭喜楼主秒杀！[s:11]

zggjtsgzczh

似乎R 3.0.0做了优化，同样的代码R3要比R2.x快一点。还有一点，楼上的几位可以考虑升级一下硬件了，如果经常处理数据的话。[s:11]

suckbunny

被秒杀了。。。。

Robert_Hoo

哈哈众里寻他千百度呀。。。。。。这findInterval 函数确实够惊艳；

刚弄了个新方程fun.Ind，对寻找y 的算法行了一下优化。。。对于大数据y的搜寻效率超过findInterval了。。。；

但是对于x 没进行优化，只是简单的用了个sapply函数

<br />
rm(list=ls())<br />
set.seed(65535)<br />
y <- 1:50000000<br />
x <- sample(y, 1000, T) + .5</p>
<p>t1 <- system.time(re1 <- y[findInterval(x,y)])</p>
<p>fun.Ind <- function(x,y) {<br />
  sapply(x,FUN = function(x) {<br />
    len.y = length(y)<br />
    z = c(1, len.y/2, len.y)<br />
    if(x<y[1]) Ind =0<br />
    if(x >= y[len.y]) Ind=len.y<br />
    if(x < y[len.y] & x >= y[1]) {<br />
      while(z[1]!= z[2]) {<br />
        temp <- z[2]<br />
        if (x >= y[temp]) {z[1]=floor(temp);z[2]=floor((temp+z[3])/2)}<br />
        else {z[3] = floor(temp);z[2]=floor((temp+z[1])/2)}<br />
      }<br />
      Ind <- z[2]}<br />
    return(Ind)<br />
  })<br />
}</p>
<p>t2 <- system.time(re2 <- y[fun.Ind(x,y)])</p>
<p>pt <- proc.time()<br />
rk <- rank(x)<br />
x <- sort(x)<br />
uq <- unique(x)<br />
z <- split(y, cut(y, c(-Inf,uq), include.lowest=T))<br />
re3 <- numeric(length(uq))<br />
for (i in 1:length(uq)) {<br />
  re3[i] <- ifelse (length(z[[i]])== 0, re3[i-1], max(z[[i]]))<br />
}<br />
re3 <- rep(re3,rle(x)$len)<br />
re3 <- re3[rk]<br />
t3 <- proc.time() - pt</p>
<p>all.equal(re1,re2)<br />
[1] TRUE<br />
all.equal(re2,re3)<br />
[1] TRUE</p>
<p>t1  # findInterval<br />
user  system elapsed<br />
0.36    0.11    0.47<br />
t2  #fun.Ind. the speed can be faster by rewriting the while loop with c/c++<br />
user  system elapsed<br />
0.19    0.00    0.19<br />
t3  # cut method<br />
user  system elapsed<br />
17.27    0.36   17.68</p>
<p>

bjt

这种事情经常发生，比如刚刚看到的：http://dapengde.com/rinlife_easter/

[s:11]

« 上一页