一个爬网页的练习:看看 R 邮件列表中最热门的讨论是什么
- 已编辑
基于 Jiena 的代码,来个依赖最小的版本
# 安装必要的依赖
packages <- c("rvest", "knitr")
lapply(packages, function(pkg) {
if (system.file(package = pkg) == "") install.packages(pkg)
})
# 确保 Windows 下的中文环境也能获取正确的日期格式化结果
Sys.setlocale("LC_TIME", "C")
# 格式化日期序列
all_months <- format(
seq(
from = as.Date("1997-04-01"),
to = Sys.Date(), by = "1 month"
),
"%Y-%B"
)
# 清理帖子主题
clean_discuss_topic <- function(x) {
# 去掉中括号及其内容
x <- gsub("(\\[.*?\\])", "", x)
# 去掉末尾换行符 \n
x <- gsub("(\\\n)$", "", x)
# 两个以上的空格替换为一个空格
x <- gsub("( {2,})", " ", x)
x
}
library(magrittr)
x <- "2019-February"
base_url <- "https://stat.ethz.ch/pipermail/r-devel"
# 下面的部分可以打包成一个函数
# 输入是日期 x 输出是一个 markdown 表格
# 抓取当月的数据
scrap_webpage <- xml2::read_html(paste(base_url, x, "subject.html", sep = "/"))
# Extract the URLs 提取链接尾部
tail_url <- scrap_webpage %>%
rvest::html_nodes("a") %>%
rvest::html_attr("href")
# Extract the theme 提取链接对应的讨论主题
discuss_topic <- scrap_webpage %>%
rvest::html_nodes("a") %>%
rvest::html_text()
# url 和 讨论主题合并为数据框
discuss_df <- data.frame(discuss_topic = discuss_topic, tail_url = tail_url)
# 清理无效的帖子记录
discuss_df <- discuss_df[grepl(pattern = "\\.html$", x = discuss_df$tail_url), ]
# 清理帖子主题内容
discuss_df$discuss_topic <- clean_discuss_topic(discuss_df$discuss_topic)
# 去重 # 只保留第一条发帖记录
discuss_uni_df <- discuss_df[!duplicated(discuss_df$discuss_topic), ]
# 分组计数
discuss_count_df <- as.data.frame(table(discuss_df$discuss_topic), stringsAsFactors = FALSE)
# 对 discuss_count_df 的列重命名
colnames(discuss_count_df) <- c("discuss_topic", "count")
# 按讨论主题合并数据框
discuss <- merge(discuss_uni_df, discuss_count_df, by = "discuss_topic")
# 添加完整的讨论帖的 url
discuss <- transform(discuss, full_url = paste(base_url, x, tail_url, sep = "/"))
# 选取讨论主题、主题链接和楼层高度
discuss <- discuss[, c("discuss_topic", "full_url", "count")]
# 按楼层高度排序,转化为 Markdown 表格形式输出
discuss[order(discuss$count, decreasing = TRUE), ] %>%
knitr::kable(format = "markdown", row.names = FALSE) %>%
cat(file = paste0(x, "-disuss.md"), sep = "\n")
总依赖如下,现在差不多可以往 Travis 上搞定时任务了,只要把输出的 markdown 文件推回 Github 即可
tools::package_dependencies(packages,recursive=T) %>% unlist %>% unique
[1] "xml2" "httr" "magrittr" "selectr" "curl" "jsonlite"
[7] "mime" "openssl" "R6" "methods" "stringr" "Rcpp"
[13] "tools" "askpass" "utils" "glue" "stringi" "sys"
[19] "stats" "evaluate" "highr" "markdown" "yaml" "xfun"
我还不知道论坛里怎么贴 Markdwon 表格,即文件 2019-February-disuss.md
的内容,所以请移步 <https://github.com/XiangyunHuang/RGraphics/issues/5>
如果将所有的帖子都扒拉下来,根据帖子主题之间的关系,有没有可能将它们分类,用 shiny 做一个可视化前端,根据时间和楼层数显示每类帖子下面热门的讨论
如果把每个帖子的发帖人和回帖人也提取,那么还可以看特定的人的情况
根据帖子的 ID 长度来看是六位数,不足百万,SO 上面已经是八位数了
- 已编辑
yihui 晒泥版升级了, 只在打开时载入一次,滑块仅用来筛选时间区间。
另外,画了个简单的柱状图,显示每个月的回帖总量。
本论坛应该是不支持表格语法吧。
--- <https://d.cosx.org/d/420350-pagedown-pdf/9>
yanglei
没事,问的好。建议先试 r-devel.R,数量小一些。 以下是侦察过程:
- 建立好年月数列,看看
trans_month
是不是这个格式,比如“1997-April”:#Loading the rvest package library('rvest') library(purrr) library(tidyverse) ## Create year month list current_month=as.Date(cut(Sys.Date(), "month")) date_seq=seq(from=as.Date("1997-04-01"),to=current_month , by="1 month") trans_month=format(as.Date(date_seq), "%Y-%B") ### Make a function to fetch the summarized data frame for every month unnecessary=c("thread.html#start", "author.html#start", "date.html#start", "http://www.stat.math.ethz.ch/mailman/listinfo/r-devel")
- 先以 1997 年 4 月为例子在本地跑:
这样一步一步跑哪里出错一清二楚。最后看看您的#Specifying the url for desired website to be scraped url_base <- 'https://stat.ethz.ch/pipermail/r-devel/%s/subject.html' test_web=read_html(sprintf(url_base, '1997-April' ) ) # Extract the URLs url_ <- test_web %>% rvest::html_nodes("a") %>% rvest::html_attr("href") # Extract the link text link_ <- test_web %>% rvest::html_nodes("a") %>% rvest::html_text() test_df=data.frame(link = link_, url = url_) test_df2=test_df[which(!is.na(test_df$url)),] test_df3=test_df2[which(! test_df2$url %in% unnecessary),] test_df3$link=gsub("[\r\n]", " ", test_df3$link) test_df3$url=paste0("https://stat.ethz.ch/pipermail/r-devel/", "1997-April","/",test_df3$url) test_df4=test_df3 %>% group_by(link) %>% dplyr::mutate( Link_url = dplyr::first(url) ) %>% group_by(link,Link_url ) %>% summarize( count=n() ) test_df4=data.frame(test_df4)
test_df4
是不是如下:> head(test_df4) link Link_url count 1 R-alpha: contributed packages -- Yes, use library/<package>/.. ! https://stat.ethz.ch/pipermail/r-devel/1997-April/017072.html 2 2 R-alpha: ==NULL https://stat.ethz.ch/pipermail/r-devel/1997-April/017131.html 1 3 R-alpha: as.numeric https://stat.ethz.ch/pipermail/r-devel/1997-April/017023.html 3 4 R-alpha: Be nice to the English https://stat.ethz.ch/pipermail/r-devel/1997-April/017051.html 1 5 R-alpha: binom.test https://stat.ethz.ch/pipermail/r-devel/1997-April/017075.html 1 6 R-alpha: Bug & Patch in dbeta.c (0.50 - PreR 7) https://stat.ethz.ch/pipermail/r-devel/1997-April/017048.html 3
dapengde Jiena tctcab 欢迎各位整理一下给主站投稿 <https://cosx.org/contribute/>
dapengde @tctcab Cloud2016 @yihui
统计之都 (本坛子)也爬好了。从2006年建站开始截至到 2019-06-17 一共有28310篇帖子, 1422 页 (也就是说如果您想手动爬到底的话,要单击“载入更多” 1421 次。。。所以还是写个程序吧。), 第一篇帖子是: https://d.cosx.org/d/5-5 ,题目为”hehe 瞎提意见“, 这篇帖子是 2006-05-20T21:32:35+00:00 创立的。
- 根据讨论回复数量的前十名:
> everything_cos2_replies[1:10,] Comment_count Participant_count URL create_Date last_post_Date active_days 1 1261 1184 https://d.cosx.org/d/1553 2006-09-03 2019-06-14 4667 2 923 850 https://d.cosx.org/d/13775 2009-01-16 2017-12-03 3243 3 802 714 https://d.cosx.org/d/818 2006-08-01 2019-04-27 4652 4 796 589 https://d.cosx.org/d/7865 2007-09-14 2018-07-18 3960 5 633 590 https://d.cosx.org/d/2013 2006-09-29 2010-09-21 1453 6 533 289 https://d.cosx.org/d/9233 2007-12-29 2018-08-06 3873 7 528 450 https://d.cosx.org/d/8949 2007-12-09 2018-11-18 3997 8 520 424 https://d.cosx.org/d/14985 2009-04-22 2019-05-29 3689 9 507 490 https://d.cosx.org/d/13875 2009-02-02 2017-12-01 3224 10 359 25 https://d.cosx.org/d/14141 2009-02-28 2017-10-19 3155
- 根据讨论参与人数的前十名帖子:
*根据活动天数(最新的回复-初始贴的日期)最久的前十名:> everything_cos2_participants[1:10,] Comment_count Participant_count URL create_Date last_post_Date active_days 1 1261 1184 https://d.cosx.org/d/1553 2006-09-03 2019-06-14 4667 2 923 850 https://d.cosx.org/d/13775 2009-01-16 2017-12-03 3243 3 802 714 https://d.cosx.org/d/818 2006-08-01 2019-04-27 4652 4 633 590 https://d.cosx.org/d/2013 2006-09-29 2010-09-21 1453 5 796 589 https://d.cosx.org/d/7865 2007-09-14 2018-07-18 3960 6 507 490 https://d.cosx.org/d/13875 2009-02-02 2017-12-01 3224 7 528 450 https://d.cosx.org/d/8949 2007-12-09 2018-11-18 3997 8 520 424 https://d.cosx.org/d/14985 2009-04-22 2019-05-29 3689 9 344 293 https://d.cosx.org/d/2070 2006-10-05 2017-03-07 3806 10 533 289 https://d.cosx.org/d/9233 2007-12-29 2018-08-06 3873
> everything_cos2_active_days[1:10,] Comment_count Participant_count URL create_Date last_post_Date active_days 1 160 121 https://d.cosx.org/d/35 2006-05-25 2019-04-03 4696 2 1261 1184 https://d.cosx.org/d/1553 2006-09-03 2019-06-14 4667 3 802 714 https://d.cosx.org/d/818 2006-08-01 2019-04-27 4652 4 205 160 https://d.cosx.org/d/1 2006-05-19 2018-10-24 4541 5 34 31 https://d.cosx.org/d/464 2006-06-22 2018-10-19 4502 6 24 19 https://d.cosx.org/d/2739 2006-11-30 2019-03-12 4485 7 18 7 https://d.cosx.org/d/3004 2006-12-15 2019-02-04 4434 8 42 26 https://d.cosx.org/d/2963 2006-12-12 2019-01-14 4416 9 44 30 https://d.cosx.org/d/6142 2007-05-01 2019-05-24 4406 10 30 20 https://d.cosx.org/d/16 2006-05-23 2018-05-30 4390
- csv 数据: https://github.com/jienagu/tidyverse_examples/blob/master/COS_raw_data.csv
- 源代码: https://github.com/jienagu/tidyverse_examples/blob/master/Web_scraping_COS_Chinese.R
我目前用的是 window,跟中文八字不合,有兴趣的小伙伴们可以根据我贴的源代码加上中文标题,标题在 attributes
的列名就是title
.
手动查了一下,cos论坛基于flarum,查了一下是有REST api来提取讨论的,地址是
https://d.cosx.org/api/discussions?page[limit]=50
试了一下每个请求最大返回50串,
这个好处是不用20串一扒减少了三分之二对cos服务器的轰炸…(28310/50=568.6)
另外代码也可以精简很多,省去rvest的操作,直接jsonlite读数据, 甚至下一页的链接也可以在api返回的links里找到
最终代码:
firsturl = "https://d.cosx.org/api/discussions?page[limit]=50"
cos.li = list()
counter = 1
cos.li[[counter]] = jsonlite::fromJSON(firsturl)
# if the next link exists, scrape it.
while ("next" %in% names(cos.li[[counter]]$links)) {
counter = counter +1
cos.li[[counter]] = jsonlite::fromJSON(cos.li[[counter-1]]$links[['next']])
}
## combine all df
cos.all = do.call(rbind,
lapply(cos.li, function(l) {
df = l$data$attributes
df$link = paste0("https://d.cosx.org/d/", l$data$id)
return(df)
}))
不过碰到个问题,爬到“https://d.cosx.org/api/discussions?page[limit]=50?page[offset]=9550”
的时候出现http 500 内部错误,浏览器直接可以打开但是R里用上面jsonlite::fromJSON的话始终会出错,好奇怪
- 已编辑
tctcab 试了一下,我觉得是问题来自 9550 的下下一页:
<https://d.cosx.org/api/discussions?page%5Blimit%5D=50&page%5Boffset%5D=9650>
浏览器打开就是错误 500。
为了跳过这一页,我把楼上 Jiena tctcab 的方法柔在了一起:
# get the max page (1422 pages on 2019-06-18)
get_maxpage <- function(page_range = 1421:1500){
for (i in page_range) {
print(paste(Sys.time(), i))
COS_link <- xml2::read_html(paste0('https://d.cosx.org/all?page=', i))
url_vector= rvest::html_attr(rvest::html_nodes(COS_link, "a"), "href")
last_link = url_vector[length(url_vector)]
last_number <- as.numeric(gsub("[https://d.cosx.org/all?page=]", "",last_link) )
if(last_number <= i - 1){
message('There are ', i, ' pages with 20 posts on each.')
return(i)
}
}
}
# get json from cos
get_js <- function(url){
print(paste(Sys.time(), url))
mytry <- try(jsonlite::fromJSON(url))
if(class(mytry) == 'try-error') return(NULL)
jsonlite::fromJSON(url)
}
maxpage <- get_maxpage(1421:1500)
cos_url <- paste0('https://d.cosx.org/api/discussions?page%5Blimit%5D=50&page%5Boffset%5D=', seq(0, maxpage * 20, 50) + 50)
cos_js <- lapply(cos_url, get_js)
可以看到, 500 错误不仅出现在读 9550 时,还出现在多处,例如 10350 和 11650,因为他们的下下一页用浏览器打开就是 500 错误:
<https://d.cosx.org/api/discussions?page%5Blimit%5D=50&page%5Boffset%5D=10450>
<https://d.cosx.org/api/discussions?page%5Blimit%5D=50&page%5Boffset%5D=11750>
- 已编辑
晒泥版新增 COS 页面!中文标题加进来啦 Jiena