如何提取R包中贡献者名字

Cloud2016 · 2017年11月15日

通过 pdb <- tools::CRAN_package_db() 获取 CRAN 上每个R包的信息，我想提取每个R包的开发者和贡献者，开发者我用 Maintainer 字段提取，贡献者需要从Author或者Authors@R字段提取，而Author字段就是一段一段的文本，要从文本中识别姓名，并根据文意提取贡献者，我不会做（我知道不少可以正则表达式提取，可还有很多正则无力），所以我就看看Authors@R字段，发现它们是包含 person 类的字符（这个类可以?person查看），需要先 x = eval(parse(text= XX )) 字符串转化为表达式，计算后得到 person 类，再 paste( x$given, x$family ) 提取名字，本以为找到什么捷径了，由于很多人没有按格式书写，上一步计算完并没有得到person类或者根本是乱的，导致此路不通，所以始终免不了在一堆文本中找贡献者名字。

所以在此求助各路大神

tctcab · 2017年11月15日

看了一下应该直接Author列就能搞定了：authors@R 列反而好多是NA

author列，

名字都用逗号隔开，所以strsplit一下就可以获得名字
偶尔有[aut,dtc,cre]或 <aaa@gmail.com> 这样的字段，也可以直接正则删之。

所以直接用Author列就可以了吧
代码如下


db <- tools::CRAN_package_db()

db_author <- db$Author
db_author_tidy <- gsub("\\[.*?\\]|<.*?>","",db_author)
db_author_tidy_split <- strsplit(db_author_tidy,",") 

db_author_list <- lapply(db_author_tidy_split,stringr::str_trim)

结果

db_author_list[1:5]

[[1]]
[1] "Scott Fortmann-Roe"

[[2]]
[1] "Gaurav Sood"

[[3]]
[1] "Csillery Katalin"  "Lemaire Louisiane" "Francois Olivier"  "Blum Michael"     

[[4]]
[1] "Csillery Katalin"  "Lemaire Louisiane" "Francois Olivier"  "Blum Michael"     

[[5]]
[1] "Abdulmonem Alsaleh" "Robert Weeks"       "Ian Morison"       
[4] "RStudio"

Cloud2016 · 2017年11月15日

tctcab 大部分是这样， Authors@R为 NA 表示这个R包没有别的贡献者，去掉这些个R包，再根据 Author 字段提取贡献者，清理函数

if(file.exists('data/packages.rds')){
  cran_packages <- as.data.frame( readRDS( file = 'data/packages.rds' ), stringsAsFactors = FALSE)
# 干脆下载下来保存到本地  tools::CRAN_package_db() 是在线下载，慢
}else{
  # 地址 address 指定的数据集每天都在变
  address <- 'https://cran.r-project.org/web/packages/packages.rds'
  con <- url( address, "rb" )
  cran_packages <- as.data.frame( readRDS( gzcon( con ) ), stringsAsFactors = FALSE)
  close(con)
}
(NAmissing <- colnames( cran_packages )[ apply( cran_packages, 2, function( x ) { all( is.na(x) ) } ) ])
cran_packages <- cran_packages[,setdiff( colnames( cran_packages ), NAmissing) ])
sub_cran_pkgs <- cran_packages[which(!is.na(cran_packages$`Authors@R`)),]
sub_cran_pkgs <- sub_cran_pkgs[,c("Package","Maintainer","Author","Authors@R")]
clean_author2 <- function(x) {
		x <- gsub("\\[.*?\\]|\\(.*?\\)","",x)
		x <- gsub("<([^<>]*)>","",x)
		x <- gsub("\\\n","",x)
		x <- gsub("(\\\t)|(\\\")|(\\\')|(')","",x)
		x <- gsub(" +$", "", x)  # 去掉末尾空格
		x <- gsub("  ","",x)  # 两空格转一个空格
}
author_names <- lapply(sub_cran_pkgs$Author, clean_author2)

比如有这样的情况 (()) 嵌套着来，导致如下错乱的结果，眼力所及先放一个，其他情况防不胜防，正则能力已是捉襟见肘。

> sub_cran_pkgs[54,]$Author
[1] "Hiroshi Akima [aut, cph] (Fortran code (TOMS 760, 761, 697 and 433)),\n  Albrecht Gebhardt [aut, cre, cph] (R port (interp* functions), bicubic*\n    functions),\n  Thomas Petzold [ctb, cph] (aspline function),\n  Martin Maechler [ctb, cph] (interp2xyz function + enhancements),\n  YYYY Association for Computing Machinery, Inc. [cph] (covers code from\n    TOMS 760, 761, 697 and 433)"
> author_names[54]
[[1]]
[1] "Hiroshi Akima),Albrecht Gebhardt, bicubic*functions),Thomas Petzold,Martin Maechler,YYYY Association for Computing Machinery, Inc."

yihui 曾说 ?regex 要看八百遍，为之奈何？tctcab 有什么通俗的、方便查询的资料

tctcab · 2017年11月15日

Cloud2016

这个clean_author2真是给跪了……

regex的话，多用一用，习惯之后变成肌肉记忆就好。

这个网站我觉得还不错，对我挺有帮助。

Cloud2016 · 2017年11月16日

tctcab 正则写成那样是因为有不少其他情况，如果你愿意这样多看些的话。下面仅挑选姓名长度大致合理的区间

temp <- nchar( unlist(db_author_list) )
sort( unlist(db_author_list)[which(temp > 5 & temp < 25 )] )[1:1000]

你会发现一堆这样的

   [1] "'Andriyana"                "'Gijbels"                  "'Hadley Wickham'"          "'Hadley Wickham'"         
   [5] "'Ibrahim M. A.'"           "'Imanuel Costigan'"        "'Karthik Ram'"             "'R/all.is.numeric.R'"     
   [9] "'R/in.opererator.R'"       "'R/makeNames.R'"           "'R/read.xport.R'"          "'src/SASxport.c'"         
  [13] "'src/SASxport.h'"          "'SuiteSparse'"             "'sumtxt' from GitHub"      "'Verhasselt A.'"          
  [17] "\"Akantziliotou\""         "\"Alexandre Genin  \""     "\"Argyropoulos\""          "\"Bartlett\""             
  [21] "\"Belmans\""               "\"Burkoff\""               "\"Busetto\""               "\"Chen\""                 
  [25] "\"Clarke\""                "\"cph\")"                  "\"cph\")"                  "\"cph\")"                 
  [29] "\"cph\")"                  "\"cph\"))"                 "\"cph\")))"                "\"cph\")))"               
  [33] "\"cre\")"                  "\"cre\")"                  "\"cre\")"                  "\"cre\")"                 
  [37] "\"cre\")"                  "\"cre\")"                  "\"cre\")"                  "\"cre\")"                 
  [41] "\"cre\"))"                 "\"cre\"))"                 "\"cre\"))"                 "\"cre\"))"                
  [45] "\"cre\"))"                 "\"cre\"))"                 "\"cre\"))\n    )"          "\"cre\")))"               
  [49] "\"ctr\"))"                 "\"Curley\""                "\"Dang\""                  "\"DecisionPatterns \""    
  [53] "\"Developer\""             "\"Djennad\""               "\"Dunnington\""            "\"Duthie\""               
  [57] "\"Eckley\""                "\"Enea\""                  "\"Fan Zhang  \""           "\"Fearnhead\""            
  [61] "\"Ghalanos\""              "\"Graul\""                 "\"Grishin\""               "\"Hadley\""               
  [65] "\"Hansen\""                "\"Hassani\""               "\"Haynes\""                "\"Heller\""               
  [69] "\"Jim Pearson\""           "\"John Harrison\""         "\"Joubert\""               "\"Kane\""                 
  [73] "\"Killick\""               "\"Malik\""                 "\"Mariette\""              "\"Markus Loecher"         
  [77] "\"Matthias Bannert  \""    "\"McElduff\""              "\"McGearailt\""            "\"Metcalfe\""             
  [81] "\"Motpan\""                "\"Onglao\""                "\"Ospina\""                "\"Passos\""               
  [85] "\"Rigby\""                 "\"Ruau\""                  "\"Seidl\""                 "\"Shingo Yamamoto (gloops"
  [89] "\"Sidi\""                  "\"Simon"                   "\"Smith\""                 "\"Stasinopoulos\""        
  [93] "\"Ushey\""                 "\"Villa-Vialaneix\""       "\"Voudouris\""             "\"Waikato\""              
  [97] "\"Wijffels\""              "\"Wijffels\""              "\"World Bank Group\""      "\"Ziqi Lu  \""            
 [101] "& Kevin Grimm"             "& Nicholas D. Myers"       "& Nicole Heussen"          "(et al)"                  
 [105] "(et. al.)"                 "(et. al.)"                 "(Marie-Josee Fortin"       "(Tony) Jianguo Sun"       
 [109] "(Vega-Lite library)"       "Øystein Flagstad"          "Øyvind Hjelle"             "Øyvind Langsrud"          
 [113] "Øyvind Langsrud"           "Ľudmila Šimková"           "1990)."                    "4D Strategies"            
 [117] "697 and 433)"              "697 and 433))"             "A. Alexander Beaujean"     "A. Andrew M. MacDonald"   
 [121] "A. Arcagni"                "A. Bensadoun"              "A. Bensadoun"              "A. Bouvier"

而那些大于25个字符长度 sort( unlist(db_author_list)[which(temp > 25 )] )[1:1000] 的全是需要再清理的

[1] "'R/importConvertDateTime.R'"                                                                                                                                                                                                                                        
   [2] "\"Jim Pearson\"  \n  with data provided by \"U. S. Census Bureau\""                                                                                                                                                                                                 
   [3] "\"Jim Pearson\"  \n  with data provided by \"U. S. Census Bureau\""                                                                                                                                                                                                 
   [4] "\"Jim Pearson\"  \n  with data provided by \"U. S. Census Bureau\""                                                                                                                                                                                                 
   [5] "\"Jim Pearson\"  \n  with data provided by \"U. S. Census Bureau\""                                                                                                                                                                                                 
   [6] "\"Jim Pearson\"  \n  with data provided by \"U. S. Census Bureau\""                                                                                                                                                                                                 
   [7] "\"Jim Pearson\"  and \"U. S. Census Bureau\""                                                                                                                                                                                                                       
   [8] "\"Miller Zijie Zhu  \"\n             ))"                                                                                                                                                                                                                            
   [9] "(ii) the Knuth-TAOCP RNG from D. Knuth."                                                                                                                                                                                                                            
  [10] "Øyvind Langsrud and Bjørn-Helge Mevik"                                                                                                                                                                                                                              
  [11] "A. Bradley Duthie  (0000-0001-8343-4995)"                                                                                                                                                                                                                           
  [12] "A. Carpentier  (Matlab original)"                                                                                                                                                                                                                                   
  [13] "A. I. McLeod  and Mehmet Balcilar\n        ."                                                                                                                                                                                                                       
  [14] "A. I. McLeod and Hyukjun Gweon"                                                                                                                                                                                                                                     
  [15] "A. I. McLeod and N. M. Mohammad"                                                                                                                                                                                                                                    
  [16] "A. K. Nikoloulopoulos  and H. Joe"                                                                                                                                                                                                                                  
  [17] "A.I. McLeod  and Changjiang Xu"                                                                                                                                                                                                                                     
  [18] "A.I. McLeod and Justin Veenstra"                                                                                                                                                                                                                                    
  [19] "A.J. Perez-Luque; R. Moreno; R. Perez-Perez and F.J. Bonet"                                                                                                                                                                                                         
  [20] "AA (azarianaa1@mums.ac.ir)"                                                                                                                                                                                                                                         
  [21] "Aaron A. King  and Marguerite A. Butler"                                                                                                                                                                                                                            
  [22] "Aaron Clauset/Rouven Strauss and Miguel Rodriguez-Girones"                                                                                                                                                                                                          
  [23] "Aaron Robotham and Danail Obreschkow"                                                                                                                                                                                                                               
  [24] "Abba Krieger and William J. Blanford"                                                                                                                                                                                                                               
  [25] "Abbas Parchami (Department of Statistics"                                                                                                                                                                                                                           
  [26] "Abbas Parchami (Department of Statistics"                                                                                                                                                                                                                           
  [27] "Abbas Parchami (Department of Statistics"                                                                                                                                                                                                                           
  [28] "Abbas Parchami (Department of Statistics"                                                                                                                                                                                                                           
  [29] "Abbas Parchami (Department of Statistics"                                                                                                                                                                                                                           
  [30] "Abbas Parchami (Department of Statistics"                                                                                                                                                                                                                           
  [31] "Abbas Parchami (Department of Statistics"                                                                                                                                                                                                                           
  [32] "Abbasali Khalili and Shili Lin ."                                                                                                                                                                                                                                   
  [33] "Abdolreza Mohammadi and Ernst Wit"                                                                                                                                                                                                                                  
  [34] "Abdulla Abdurakhmanov  (Code in xml2json.sjs is from https://code.google.com/p/x2js/)"                                                                                                                                                                              
  [35] "Abdullah Almsaeed  (Dashboard CSS)"                                                                                                                                                                                                                                 
  [36] "Abla Boudraa and Zebida Gheribi-Aoulmi"                                                                                                                                                                                                                             
  [37] "AC Del Re & William T. Hoyt"                                                                                                                                                                                                                                        
  [38] "AC Del Re & William T. Hoyt"                                                                                                                                                                                                                                        
  [39] "Academy of Sciences of the Czech Republic)"                                                                                                                                                                                                                         
  [40] "Achim Zeileis  (author of R wrappers to tth/ttm)"                                                                                                                                                                                                                   
  [41] "Achim Zeileis  (Contributions to dynrq code essentially\n    identical to his dynlm code)"                                                                                                                                                                          
  [42] "Achim Zeileis (R code) and the R community (fortunes).\n        Contributions (fortunes and/or code) by Torsten Hothorn"                                                                                                                                            
  [43] "Acho Arnold  (C original matrix library"                                                                                                                                                                                                                            
  [44] "Adam Kapelner and Justin Bleich (R package)"                                                                                                                                                                                                                        
  [45] "Adam Kapelner and Justin Bleich (R package)"                                                                                                                                                                                                                        
  [46] "Adam L. Pintar and Zachary H. Levine"                                                                                                                                                                                                                               
  [47] "Adam M. Johansen and Leah F. South"                                                                                                                                                                                                                                 
  [48] "Adam Pearce  (core contributor to d3-jetpack)"                                                                                                                                                                                                                      
  [49] "Adam Rahman and Wayne Oldford"                                                                                                                                                                                                                                      
  [50] "Adam Rahman and Wayne Oldford (R)"                                                                                                                                                                                                                                  
  [51] "Adam Sparks  (https://orcid.org/0000-0002-0061-8359)"                                                                                                                                                                                                               
  [52] "Adam Sparks  (https://orcid.org/0000-0002-0061-8359)"                                                                                                                                                                                                               
  [53] "Adam Sparks  (https://orcid.org/0000-0002-0061-8359)"                                                                                                                                                                                                               
  [54] "Adline Dsilva  (First version Matrix heatmap)"                                                                                                                                                                                                                      
  [55] "Adobe Systems Incorporated  (Source Sans Pro font)"                                                                                                                                                                                                                 
  [56] "Adobe Systems Incorporated  (Source Sans Pro font)"                                                                                                                                                                                                                 
  [57] "Adri van Os  (Author 'antiword' utility)"                                                                                                                                                                                                                           
  [58] "Adrian Baddeley  (C function 'BinDist' copied from package\n    'stats')"                                                                                                                                                                                           
  [59] "Adrian Barnett and Peter Baker"                                                                                                                                                                                                                                     
  [60] "Adrian Bowman and Adelchi Azzalini. \n         Ported to R by B. D. Ripley  up to version 2.0"                                                                                                                                                                      
  [61] "Adrian R. Waddell and R. Wayne Oldford"                                                                                                                                                                                                                             
  [62] "Adrian R. Waddell and R. Wayne Oldford"

tctcab · 2017年11月16日

Cloud2016 确实挺麻烦，有空我再看看

Cloud2016 · 20 1月

不折腾了，直接用 Authors@R 字段提取，缺失的部分不管了。开发者协作关系网络的初探结果见这里。