• R语言
  • 如何提取R包中贡献者名字

通过 pdb <- tools::CRAN_package_db() 获取 CRAN 上每个R包的信息,我想提取每个R包的开发者和贡献者,开发者我用 Maintainer 字段提取,贡献者需要从Author或者Authors@R字段提取,而Author字段就是一段一段的文本,要从文本中识别姓名,并根据文意提取贡献者,我不会做(我知道不少可以正则表达式提取,可还有很多正则无力),所以我就看看Authors@R字段,发现它们是包含 person 类的字符(这个类可以?person查看),需要先 x = eval(parse(text= XX )) 字符串转化为表达式,计算后得到 person 类,再 paste( x$given, x$family ) 提取名字,本以为找到什么捷径了,由于很多人没有按格式书写,上一步计算完并没有得到person类或者根本是乱的,导致此路不通,所以始终免不了在一堆文本中找贡献者名字。

所以在此求助各路大神

看了一下应该直接Author列就能搞定了:authors@R 列反而好多是NA

author列,

  • 名字都用逗号隔开,所以strsplit一下就可以获得名字
  • 偶尔有[aut,dtc,cre]或 <aaa@gmail.com> 这样的字段,也可以直接正则删之。

所以直接用Author列就可以了吧
代码如下


db <- tools::CRAN_package_db()

db_author <- db$Author
db_author_tidy <- gsub("\\[.*?\\]|<.*?>","",db_author)
db_author_tidy_split <- strsplit(db_author_tidy,",") 

db_author_list <- lapply(db_author_tidy_split,stringr::str_trim)

结果

db_author_list[1:5]

[[1]]
[1] "Scott Fortmann-Roe"

[[2]]
[1] "Gaurav Sood"

[[3]]
[1] "Csillery Katalin"  "Lemaire Louisiane" "Francois Olivier"  "Blum Michael"     

[[4]]
[1] "Csillery Katalin"  "Lemaire Louisiane" "Francois Olivier"  "Blum Michael"     

[[5]]
[1] "Abdulmonem Alsaleh" "Robert Weeks"       "Ian Morison"       
[4] "RStudio"           

    tctcab 大部分是这样, Authors@RNA 表示这个R包没有别的贡献者,去掉这些个R包,再根据 Author 字段提取贡献者,清理函数

    if(file.exists('data/packages.rds')){
      cran_packages <- as.data.frame( readRDS( file = 'data/packages.rds' ), stringsAsFactors = FALSE)
    # 干脆下载下来保存到本地  tools::CRAN_package_db() 是在线下载,慢
    }else{
      # 地址 address 指定的数据集每天都在变
      address <- 'https://cran.r-project.org/web/packages/packages.rds'
      con <- url( address, "rb" )
      cran_packages <- as.data.frame( readRDS( gzcon( con ) ), stringsAsFactors = FALSE)
      close(con)
    }
    (NAmissing <- colnames( cran_packages )[ apply( cran_packages, 2, function( x ) { all( is.na(x) ) } ) ])
    cran_packages <- cran_packages[,setdiff( colnames( cran_packages ), NAmissing) ])
    sub_cran_pkgs <- cran_packages[which(!is.na(cran_packages$`Authors@R`)),]
    sub_cran_pkgs <- sub_cran_pkgs[,c("Package","Maintainer","Author","Authors@R")]
    clean_author2 <- function(x) {
    		x <- gsub("\\[.*?\\]|\\(.*?\\)","",x)
    		x <- gsub("<([^<>]*)>","",x)
    		x <- gsub("\\\n","",x)
    		x <- gsub("(\\\t)|(\\\")|(\\\')|(')","",x)
    		x <- gsub(" +$", "", x)  # 去掉末尾空格
    		x <- gsub("  ","",x)  # 两空格转一个空格
    }
    author_names <- lapply(sub_cran_pkgs$Author, clean_author2)

    比如有这样的情况 (()) 嵌套着来,导致如下错乱的结果,眼力所及先放一个,其他情况防不胜防,正则能力已是捉襟见肘。

    > sub_cran_pkgs[54,]$Author
    [1] "Hiroshi Akima [aut, cph] (Fortran code (TOMS 760, 761, 697 and 433)),\n  Albrecht Gebhardt [aut, cre, cph] (R port (interp* functions), bicubic*\n    functions),\n  Thomas Petzold [ctb, cph] (aspline function),\n  Martin Maechler [ctb, cph] (interp2xyz function + enhancements),\n  YYYY Association for Computing Machinery, Inc. [cph] (covers code from\n    TOMS 760, 761, 697 and 433)"
    > author_names[54]
    [[1]]
    [1] "Hiroshi Akima),Albrecht Gebhardt, bicubic*functions),Thomas Petzold,Martin Maechler,YYYY Association for Computing Machinery, Inc."

    yihui 曾说 ?regex 要看八百遍,为之奈何?tctcab 有什么通俗的、方便查询的资料

      Cloud2016

      这个clean_author2真是给跪了……


      regex的话,多用一用,习惯之后变成肌肉记忆就好。

      这个网站我觉得还不错,对我挺有帮助。

        tctcab 正则写成那样是因为有不少其他情况,如果你愿意 这样多看些的话。下面仅挑选姓名长度大致合理的区间

        temp <- nchar( unlist(db_author_list) )
        sort( unlist(db_author_list)[which(temp > 5 & temp < 25 )] )[1:1000] 

        你会发现一堆这样的

           [1] "'Andriyana"                "'Gijbels"                  "'Hadley Wickham'"          "'Hadley Wickham'"         
           [5] "'Ibrahim M. A.'"           "'Imanuel Costigan'"        "'Karthik Ram'"             "'R/all.is.numeric.R'"     
           [9] "'R/in.opererator.R'"       "'R/makeNames.R'"           "'R/read.xport.R'"          "'src/SASxport.c'"         
          [13] "'src/SASxport.h'"          "'SuiteSparse'"             "'sumtxt' from GitHub"      "'Verhasselt A.'"          
          [17] "\"Akantziliotou\""         "\"Alexandre Genin  \""     "\"Argyropoulos\""          "\"Bartlett\""             
          [21] "\"Belmans\""               "\"Burkoff\""               "\"Busetto\""               "\"Chen\""                 
          [25] "\"Clarke\""                "\"cph\")"                  "\"cph\")"                  "\"cph\")"                 
          [29] "\"cph\")"                  "\"cph\"))"                 "\"cph\")))"                "\"cph\")))"               
          [33] "\"cre\")"                  "\"cre\")"                  "\"cre\")"                  "\"cre\")"                 
          [37] "\"cre\")"                  "\"cre\")"                  "\"cre\")"                  "\"cre\")"                 
          [41] "\"cre\"))"                 "\"cre\"))"                 "\"cre\"))"                 "\"cre\"))"                
          [45] "\"cre\"))"                 "\"cre\"))"                 "\"cre\"))\n    )"          "\"cre\")))"               
          [49] "\"ctr\"))"                 "\"Curley\""                "\"Dang\""                  "\"DecisionPatterns \""    
          [53] "\"Developer\""             "\"Djennad\""               "\"Dunnington\""            "\"Duthie\""               
          [57] "\"Eckley\""                "\"Enea\""                  "\"Fan Zhang  \""           "\"Fearnhead\""            
          [61] "\"Ghalanos\""              "\"Graul\""                 "\"Grishin\""               "\"Hadley\""               
          [65] "\"Hansen\""                "\"Hassani\""               "\"Haynes\""                "\"Heller\""               
          [69] "\"Jim Pearson\""           "\"John Harrison\""         "\"Joubert\""               "\"Kane\""                 
          [73] "\"Killick\""               "\"Malik\""                 "\"Mariette\""              "\"Markus Loecher"         
          [77] "\"Matthias Bannert  \""    "\"McElduff\""              "\"McGearailt\""            "\"Metcalfe\""             
          [81] "\"Motpan\""                "\"Onglao\""                "\"Ospina\""                "\"Passos\""               
          [85] "\"Rigby\""                 "\"Ruau\""                  "\"Seidl\""                 "\"Shingo Yamamoto (gloops"
          [89] "\"Sidi\""                  "\"Simon"                   "\"Smith\""                 "\"Stasinopoulos\""        
          [93] "\"Ushey\""                 "\"Villa-Vialaneix\""       "\"Voudouris\""             "\"Waikato\""              
          [97] "\"Wijffels\""              "\"Wijffels\""              "\"World Bank Group\""      "\"Ziqi Lu  \""            
         [101] "& Kevin Grimm"             "& Nicholas D. Myers"       "& Nicole Heussen"          "(et al)"                  
         [105] "(et. al.)"                 "(et. al.)"                 "(Marie-Josee Fortin"       "(Tony) Jianguo Sun"       
         [109] "(Vega-Lite library)"       "Øystein Flagstad"          "Øyvind Hjelle"             "Øyvind Langsrud"          
         [113] "Øyvind Langsrud"           "Ľudmila Šimková"           "1990)."                    "4D Strategies"            
         [117] "697 and 433)"              "697 and 433))"             "A. Alexander Beaujean"     "A. Andrew M. MacDonald"   
         [121] "A. Arcagni"                "A. Bensadoun"              "A. Bensadoun"              "A. Bouvier"               

        而那些大于25个字符长度 sort( unlist(db_author_list)[which(temp > 25 )] )[1:1000] 的全是需要再清理的

        [1] "'R/importConvertDateTime.R'"                                                                                                                                                                                                                                        
           [2] "\"Jim Pearson\"  \n  with data provided by \"U. S. Census Bureau\""                                                                                                                                                                                                 
           [3] "\"Jim Pearson\"  \n  with data provided by \"U. S. Census Bureau\""                                                                                                                                                                                                 
           [4] "\"Jim Pearson\"  \n  with data provided by \"U. S. Census Bureau\""                                                                                                                                                                                                 
           [5] "\"Jim Pearson\"  \n  with data provided by \"U. S. Census Bureau\""                                                                                                                                                                                                 
           [6] "\"Jim Pearson\"  \n  with data provided by \"U. S. Census Bureau\""                                                                                                                                                                                                 
           [7] "\"Jim Pearson\"  and \"U. S. Census Bureau\""                                                                                                                                                                                                                       
           [8] "\"Miller Zijie Zhu  \"\n             ))"                                                                                                                                                                                                                            
           [9] "(ii) the Knuth-TAOCP RNG from D. Knuth."                                                                                                                                                                                                                            
          [10] "Øyvind Langsrud and Bjørn-Helge Mevik"                                                                                                                                                                                                                              
          [11] "A. Bradley Duthie  (0000-0001-8343-4995)"                                                                                                                                                                                                                           
          [12] "A. Carpentier  (Matlab original)"                                                                                                                                                                                                                                   
          [13] "A. I. McLeod  and Mehmet Balcilar\n        ."                                                                                                                                                                                                                       
          [14] "A. I. McLeod and Hyukjun Gweon"                                                                                                                                                                                                                                     
          [15] "A. I. McLeod and N. M. Mohammad"                                                                                                                                                                                                                                    
          [16] "A. K. Nikoloulopoulos  and H. Joe"                                                                                                                                                                                                                                  
          [17] "A.I. McLeod  and Changjiang Xu"                                                                                                                                                                                                                                     
          [18] "A.I. McLeod and Justin Veenstra"                                                                                                                                                                                                                                    
          [19] "A.J. Perez-Luque; R. Moreno; R. Perez-Perez and F.J. Bonet"                                                                                                                                                                                                         
          [20] "AA (azarianaa1@mums.ac.ir)"                                                                                                                                                                                                                                         
          [21] "Aaron A. King  and Marguerite A. Butler"                                                                                                                                                                                                                            
          [22] "Aaron Clauset/Rouven Strauss and Miguel Rodriguez-Girones"                                                                                                                                                                                                          
          [23] "Aaron Robotham and Danail Obreschkow"                                                                                                                                                                                                                               
          [24] "Abba Krieger and William J. Blanford"                                                                                                                                                                                                                               
          [25] "Abbas Parchami (Department of Statistics"                                                                                                                                                                                                                           
          [26] "Abbas Parchami (Department of Statistics"                                                                                                                                                                                                                           
          [27] "Abbas Parchami (Department of Statistics"                                                                                                                                                                                                                           
          [28] "Abbas Parchami (Department of Statistics"                                                                                                                                                                                                                           
          [29] "Abbas Parchami (Department of Statistics"                                                                                                                                                                                                                           
          [30] "Abbas Parchami (Department of Statistics"                                                                                                                                                                                                                           
          [31] "Abbas Parchami (Department of Statistics"                                                                                                                                                                                                                           
          [32] "Abbasali Khalili and Shili Lin ."                                                                                                                                                                                                                                   
          [33] "Abdolreza Mohammadi and Ernst Wit"                                                                                                                                                                                                                                  
          [34] "Abdulla Abdurakhmanov  (Code in xml2json.sjs is from https://code.google.com/p/x2js/)"                                                                                                                                                                              
          [35] "Abdullah Almsaeed  (Dashboard CSS)"                                                                                                                                                                                                                                 
          [36] "Abla Boudraa and Zebida Gheribi-Aoulmi"                                                                                                                                                                                                                             
          [37] "AC Del Re & William T. Hoyt"                                                                                                                                                                                                                                        
          [38] "AC Del Re & William T. Hoyt"                                                                                                                                                                                                                                        
          [39] "Academy of Sciences of the Czech Republic)"                                                                                                                                                                                                                         
          [40] "Achim Zeileis  (author of R wrappers to tth/ttm)"                                                                                                                                                                                                                   
          [41] "Achim Zeileis  (Contributions to dynrq code essentially\n    identical to his dynlm code)"                                                                                                                                                                          
          [42] "Achim Zeileis (R code) and the R community (fortunes).\n        Contributions (fortunes and/or code) by Torsten Hothorn"                                                                                                                                            
          [43] "Acho Arnold  (C original matrix library"                                                                                                                                                                                                                            
          [44] "Adam Kapelner and Justin Bleich (R package)"                                                                                                                                                                                                                        
          [45] "Adam Kapelner and Justin Bleich (R package)"                                                                                                                                                                                                                        
          [46] "Adam L. Pintar and Zachary H. Levine"                                                                                                                                                                                                                               
          [47] "Adam M. Johansen and Leah F. South"                                                                                                                                                                                                                                 
          [48] "Adam Pearce  (core contributor to d3-jetpack)"                                                                                                                                                                                                                      
          [49] "Adam Rahman and Wayne Oldford"                                                                                                                                                                                                                                      
          [50] "Adam Rahman and Wayne Oldford (R)"                                                                                                                                                                                                                                  
          [51] "Adam Sparks  (https://orcid.org/0000-0002-0061-8359)"                                                                                                                                                                                                               
          [52] "Adam Sparks  (https://orcid.org/0000-0002-0061-8359)"                                                                                                                                                                                                               
          [53] "Adam Sparks  (https://orcid.org/0000-0002-0061-8359)"                                                                                                                                                                                                               
          [54] "Adline Dsilva  (First version Matrix heatmap)"                                                                                                                                                                                                                      
          [55] "Adobe Systems Incorporated  (Source Sans Pro font)"                                                                                                                                                                                                                 
          [56] "Adobe Systems Incorporated  (Source Sans Pro font)"                                                                                                                                                                                                                 
          [57] "Adri van Os  (Author 'antiword' utility)"                                                                                                                                                                                                                           
          [58] "Adrian Baddeley  (C function 'BinDist' copied from package\n    'stats')"                                                                                                                                                                                           
          [59] "Adrian Barnett and Peter Baker"                                                                                                                                                                                                                                     
          [60] "Adrian Bowman and Adelchi Azzalini. \n         Ported to R by B. D. Ripley  up to version 2.0"                                                                                                                                                                      
          [61] "Adrian R. Waddell and R. Wayne Oldford"                                                                                                                                                                                                                             
          [62] "Adrian R. Waddell and R. Wayne Oldford"                                     
          6 年 后