tctcab 大部分是这样, Authors@R
为 NA
表示这个R包没有别的贡献者,去掉这些个R包,再根据 Author 字段提取贡献者,清理函数
if(file.exists('data/packages.rds')){
cran_packages <- as.data.frame( readRDS( file = 'data/packages.rds' ), stringsAsFactors = FALSE)
# 干脆下载下来保存到本地 tools::CRAN_package_db() 是在线下载,慢
}else{
# 地址 address 指定的数据集每天都在变
address <- 'https://cran.r-project.org/web/packages/packages.rds'
con <- url( address, "rb" )
cran_packages <- as.data.frame( readRDS( gzcon( con ) ), stringsAsFactors = FALSE)
close(con)
}
(NAmissing <- colnames( cran_packages )[ apply( cran_packages, 2, function( x ) { all( is.na(x) ) } ) ])
cran_packages <- cran_packages[,setdiff( colnames( cran_packages ), NAmissing) ])
sub_cran_pkgs <- cran_packages[which(!is.na(cran_packages$`Authors@R`)),]
sub_cran_pkgs <- sub_cran_pkgs[,c("Package","Maintainer","Author","Authors@R")]
clean_author2 <- function(x) {
x <- gsub("\\[.*?\\]|\\(.*?\\)","",x)
x <- gsub("<([^<>]*)>","",x)
x <- gsub("\\\n","",x)
x <- gsub("(\\\t)|(\\\")|(\\\')|(')","",x)
x <- gsub(" +$", "", x) # 去掉末尾空格
x <- gsub(" ","",x) # 两空格转一个空格
}
author_names <- lapply(sub_cran_pkgs$Author, clean_author2)
比如有这样的情况 (())
嵌套着来,导致如下错乱的结果,眼力所及先放一个,其他情况防不胜防,正则能力已是捉襟见肘。
> sub_cran_pkgs[54,]$Author
[1] "Hiroshi Akima [aut, cph] (Fortran code (TOMS 760, 761, 697 and 433)),\n Albrecht Gebhardt [aut, cre, cph] (R port (interp* functions), bicubic*\n functions),\n Thomas Petzold [ctb, cph] (aspline function),\n Martin Maechler [ctb, cph] (interp2xyz function + enhancements),\n YYYY Association for Computing Machinery, Inc. [cph] (covers code from\n TOMS 760, 761, 697 and 433)"
> author_names[54]
[[1]]
[1] "Hiroshi Akima),Albrecht Gebhardt, bicubic*functions),Thomas Petzold,Martin Maechler,YYYY Association for Computing Machinery, Inc."
yihui 曾说 ?regex
要看八百遍,为之奈何?tctcab 有什么通俗的、方便查询的资料