回复 第14楼 的 刘思喆:
Why do you think so? Now I'm only using element-wise matrix operations:
<br />
> system.time({mx=colSums(x); (colSums(y*x)-mx*mean(y))/(colSums(x*x)-mx*mx/length(y))})<br />
user system elapsed<br />
0.259 0.060 0.319<br />
In this version, I cannot see where BLAS is used. Actually
lm.fit uses QR decomposition, which usually depends on LAPACK/BLAS.
<br />
$ R CMD config BLAS_LIBS<br />
-L/home/***/lib64/R/lib -lRblas<br />
$ R CMD config LAPACK_LIBS<br />
-L/home/***/lib64/R/lib -lRlapack<br />
These seem pointing to the default LAPACK/BLAS of R on this system.</p>
In the last version of the code above, it seems to me that R_binary and real_binary haven't even used openmp. But colSums does use openmp. So, let's check the benefit of openmp:
<br />
(default.max=.Internal(setMaxNumMathThreads(4L))<br />
#[1] 1<br />
(default=.Internal(setNumMathThreads(1L))<br />
#[1] 1</p>
<p>set.seed(1001)<br />
y = rnorm(1000)<br />
x = matrix(rnorm(10000000), 1000, 10000)<br />
system.time(for (i in 1:ncol(x)) lm.fit(y = y, x = cbind(rep(1,NROW(x)),x[,i]))$coefficients[2])<br />
# user system elapsed<br />
# 3.643 0.090 3.748</p>
<p>system.time(cor(y,x)*sd(y)/apply(x,2L,sd))<br />
# user system elapsed<br />
# 1.448 0.044 1.497</p>
<p>system.time(replicate(10L, {mx=colSums(x); (y%*%x-mx*mean(y))/(colSums(x*x)-mx*mx/length(y))}))/10L<br />
# user system elapsed<br />
# 0.2331 0.0281 0.2612</p>
<p>system.time(replicate(10L, {mx=colSums(x); (colSums(y*x)-mx*mean(y))/(colSums(x*x)-mx*mx/length(y))}))/10L<br />
# user system elapsed<br />
# 0.3136 0.0548 0.3684</p>
<p>.Internal(setNumMathThreads(4L))<br />
#[1] 1</p>
<p>system.time(for (i in 1:ncol(x)) lm.fit(y = y, x = cbind(rep(1,NROW(x)),x[,i]))$coefficients[2])<br />
# user system elapsed<br />
# 3.670 0.000 3.671</p>
<p>system.time(cor(y,x)*sd(y)/apply(x,2L,sd))<br />
# user system elapsed<br />
# 1.262 0.028 1.290</p>
<p>system.time(replicate(10L, {mx=colSums(x); (y%*%x-mx*mean(y))/(colSums(x*x)-mx*mx/length(y))}))/10L<br />
# user system elapsed<br />
# 0.5100 0.0264 0.1997</p>
<p>system.time(replicate(10L, {mx=colSums(x); (colSums(y*x)-mx*mean(y))/(colSums(x*x)-mx*mx/length(y))}))/10L<br />
# user system elapsed<br />
# 0.8629 0.0567 0.2889</p>
<p>.Internal(setNumMathThreads(default))<br />
#[1] 4<br />
.Internal(setMaxNumMathThreads(default.max))<br />
#[1] 4<br />
This does not look very impressive...
</p>