Liechi 如果数据源比较大,比如来自数据库
library(DBI)
con <- dbConnect(
RSQLite::SQLite(),
"F:/lend-loan/lending-club-loan-data/lend_loan.sqlite"
)
那应该怎么操作比较合适?有兴趣的可以研究一下这个数据集
https://github.com/XiangyunHuang/RGraphics/releases/download/v0.4/lend_loan.sqlite
只是在这个数据集中,缺失值表示为 ""
,我写了查某一个字段的,比如 emp_title
tbl(con, "loan") %>%
select(emp_title) %>%
filter(emp_title == "") %>%
summarise(ratio = n()/2260668)
当我这样操作的时候,好慢
lend_loan <- tbl(con, "loan") %>%
collect()
missing_lend_loan <- apply(lend_loan, 2, function(x) {
mean(x == "")
})
或者
mean_na <- function(x) mean(x == "")
missing_lend_loan <- tbl(con, "loan") %>%
collect() %>%
summarise_all(.funs= mean_na)
是不是我的姿势有问题?