来自 这篇< R 中大型数据集的回归>
http://cos.name/2013/08/regression-of-large-dataset-in-r/
我用Oracle R Enterprise改写一下 .也来解决 R客户端内存不足的问题(只是无法用到并行这个特性)
<br />
#生成数据<br />
set.seed(123);<br />
n = 5000000;<br />
p = 5;<br />
system.time(x <- matrix(rnorm(n * p), n, p));<br />
x = cbind(1, x);<br />
bet = c(2, rep(1, p));<br />
y = c(x %*% bet) + rnorm(n);<br />
gc();<br />
dat = as.data.frame(x);<br />
rm(x);<br />
gc();<br />
dat$y = y;<br />
rm(y);<br />
gc();<br />
colnames(dat) = c(paste("x", 0:p, sep = ""), "y");<br />
gc();<br />
</p>
将数据保存到Oracle数据库中
<br />
#在此之前已经用ORE( oracle R enterprise连接上一个数据库),登录 方式如下<br />
#ore.connect(user,sid,host,password,port,all=TRUE) </p>
<p>#将数据保存到Oracle数据库中<br />
#将dat数据框的数据保存在Oracle数据库中表名为oracle_tab_dat;<br />
system.time(ore.create(dat,table="ORACLE_TAB_DAT"))</p>
<p>
</p>
(1)用Oracle 的ore.lm函数做回归,(注因为oracle_tab_dat是oracle中的表,所以要用ore.lm函数,而不用常规的lm函数)
<br />
system.time(mod <- ore.lm( y ~., ORACLE_TAB_DAT))<br />
以下是输出</p>
ORE> system.time(mod <- ore.lm( y ~., ORACLE_TAB_DAT))
user system elapsed
0.263 0.007 50.474
ORE> summary(mod)
Call:
ore.lm(formula = y ~ ., data = ORACLE_TAB_DAT)
Residuals:
Min 1Q Median 3Q Max
-101.320 -6.283 -0.106 6.088 113.854
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.361706 0.005354 7351 < 2e-16 ***
x0 NA NA NA NA
x1 19.681085 0.005354 3676 < 2e-16 ***
x2 19.678239 0.005355 3675 < 2e-16 ***
x3 19.680580 0.005351 3678 < 2e-16 ***
x4 23.861695 0.005354 4457 < 2e-16 ***
x5 23.936075 0.005352 4473 4.44e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.97 on 4999994 degrees of freedom
Multiple R-squared: 0.9414, Adjusted R-squared: 0.9414
F-statistic: 1.606e+07 on 5 and 4999994 DF, p-value: < 2.2e-16
ORE> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 753819 40.3 1166886 62.4 1073225 57.4
Vcells 920709 7.1 1598044 12.2 1125030 8.6
ORE> ls()
[1] "mod"
ORE>
可以发现mod对象当用40MB的空间
(2)用biglm函数调用Oracle中的ORACLE_TAB_DAT
<br />
#运行ore.tableApply函数<br />
modList <- ore.tableApply(<br />
X=ORACLE_TAB_DAT,<br />
function(dat) {<br />
library(biglm)<br />
bigmod <- biglm(y ~ x0+x1+x2+x3+x4+x5, dat)<br />
summary(bigmod)<br />
}); </p>
<p>#将结果保存在R客户端本地<br />
modList_local <- ore.pull(modList)<br />
</p>
#在执行上面ore.tableApply函数时,oracle会在数据库中创建一个类似ORE$185_117,用来保存结果
ORE> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 742514 39.7 1166886 62.4 899071 48.1
Vcells 908294 7.0 1598044 12.2 1125030 8.6
ORE> modList_local <- ore.pull(modList)
ORE> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 742894 39.7 1590760 85.0 1026996 54.9
Vcells 70907371 541.0 143830781 1097.4 141748327 1081.5
发现modList_local占用本地内存近500Mb,而ORE$185_117表的大小有近600MB
ORE> modList_local$mat
Coef (95% CI) SE p
(Intercept) 39.36171 39.35100 39.37241 0.005354317 0
x0 NA NA NA NA NA
x1 19.68108 19.67038 19.69179 0.005353924 0
x2 19.67824 19.66753 19.68895 0.005354583 0
x3 19.68058 19.66988 19.69128 0.005351436 0
x4 23.86169 23.85099 23.87240 0.005353987 0
x5 23.93607 23.92537 23.94678 0.005351736 0
ORE> modList_local$rsq
[1] 0.9413915
ORE>
~~HAVE FUN~~~