yanglei
没事,问的好。建议先试 r-devel.R,数量小一些。 以下是侦察过程:
- 建立好年月数列,看看
trans_month
是不是这个格式,比如“1997-April”:#Loading the rvest package library('rvest') library(purrr) library(tidyverse) ## Create year month list current_month=as.Date(cut(Sys.Date(), "month")) date_seq=seq(from=as.Date("1997-04-01"),to=current_month , by="1 month") trans_month=format(as.Date(date_seq), "%Y-%B") ### Make a function to fetch the summarized data frame for every month unnecessary=c("thread.html#start", "author.html#start", "date.html#start", "http://www.stat.math.ethz.ch/mailman/listinfo/r-devel")
- 先以 1997 年 4 月为例子在本地跑:
这样一步一步跑哪里出错一清二楚。最后看看您的#Specifying the url for desired website to be scraped url_base <- 'https://stat.ethz.ch/pipermail/r-devel/%s/subject.html' test_web=read_html(sprintf(url_base, '1997-April' ) ) # Extract the URLs url_ <- test_web %>% rvest::html_nodes("a") %>% rvest::html_attr("href") # Extract the link text link_ <- test_web %>% rvest::html_nodes("a") %>% rvest::html_text() test_df=data.frame(link = link_, url = url_) test_df2=test_df[which(!is.na(test_df$url)),] test_df3=test_df2[which(! test_df2$url %in% unnecessary),] test_df3$link=gsub("[\r\n]", " ", test_df3$link) test_df3$url=paste0("https://stat.ethz.ch/pipermail/r-devel/", "1997-April","/",test_df3$url) test_df4=test_df3 %>% group_by(link) %>% dplyr::mutate( Link_url = dplyr::first(url) ) %>% group_by(link,Link_url ) %>% summarize( count=n() ) test_df4=data.frame(test_df4)
test_df4
是不是如下:> head(test_df4) link Link_url count 1 R-alpha: contributed packages -- Yes, use library/<package>/.. ! https://stat.ethz.ch/pipermail/r-devel/1997-April/017072.html 2 2 R-alpha: ==NULL https://stat.ethz.ch/pipermail/r-devel/1997-April/017131.html 1 3 R-alpha: as.numeric https://stat.ethz.ch/pipermail/r-devel/1997-April/017023.html 3 4 R-alpha: Be nice to the English https://stat.ethz.ch/pipermail/r-devel/1997-April/017051.html 1 5 R-alpha: binom.test https://stat.ethz.ch/pipermail/r-devel/1997-April/017075.html 1 6 R-alpha: Bug & Patch in dbeta.c (0.50 - PreR 7) https://stat.ethz.ch/pipermail/r-devel/1997-April/017048.html 3