一个爬网页的练习:看看 R 邮件列表中最热门的讨论是什么
yihui
爬好了(从97年4月到目前的数据),戳这:https://github.com/jienagu/tidyverse_examples/blob/master/web_scraping_r_devel.R
结果(已经排序好了)如下:
> head(everything2)
link Link_url count
1 [Rd] [RFC] A case for freezing CRAN https://stat.ethz.ch/pipermail/r-devel/2014-March/068548.html 64
2 [Rd] CRAN policies https://stat.ethz.ch/pipermail/r-devel/2012-March/063678.html 51
3 [Rd] surprising behaviour of names<- https://stat.ethz.ch/pipermail/r-devel/2009-March/052522.html 49
4 [Rd] legitimate use of ::: https://stat.ethz.ch/pipermail/r-devel/2013-August/067180.html 45
5 [Rd] Computer algebra in R - would that be an idea?? https://stat.ethz.ch/pipermail/r-devel/2005-July/033940.html 40
6 [Rd] if(--as-cran)? https://stat.ethz.ch/pipermail/r-devel/2012-September/064760.html 39
这是 r-devel 的前十名:
> everything2[1:10,]
link Link_url count
1 [Rd] [RFC] A case for freezing CRAN https://stat.ethz.ch/pipermail/r-devel/2014-March/068548.html 64
2 [Rd] CRAN policies https://stat.ethz.ch/pipermail/r-devel/2012-March/063678.html 51
3 [Rd] surprising behaviour of names<- https://stat.ethz.ch/pipermail/r-devel/2009-March/052522.html 49
4 [Rd] legitimate use of ::: https://stat.ethz.ch/pipermail/r-devel/2013-August/067180.html 45
5 [Rd] Computer algebra in R - would that be an idea?? https://stat.ethz.ch/pipermail/r-devel/2005-July/033940.html 40
6 [Rd] if(--as-cran)? https://stat.ethz.ch/pipermail/r-devel/2012-September/064760.html 39
7 [Rd] declaring package dependencies https://stat.ethz.ch/pipermail/r-devel/2013-September/067446.html 39
8 [Rd] Suggestion: help(<package name>) https://stat.ethz.ch/pipermail/r-devel/2005-June/033480.html 38
9 [Rd] Bias in R's random integers? https://stat.ethz.ch/pipermail/r-devel/2018-September/076817.html 38
10 [Rd] R 3.0, Rtools3.0,l Windows7 64-bit, and permission agony https://stat.ethz.ch/pipermail/r-devel/2013-April/066388.html 37
源代码在上述贴子中。
- 已编辑
研究了一下合并帖子,觉得并不简单,因为:
- 同一个串下面标题有可能不一致
- 不同串有可能名字一样
尝试了合并标题一样的帖子,但不一定对。比如04,06,07年都有个名为“[Rd] Wish list”的串
另外就是抓取的时候一不留神就连接错误,所以我先把thread.html下载到本地再提取信息,速度也快了不少(8分钟到19秒)
结果:
合并串之后可以看出结果跟Jiena 的前十名还是有点细微差别:
# A tibble: 10 x 4
title link reps year_mon
<chr> <chr> <int> <chr>
1 "[Rd] [RFC] A case for freezi… https://stat.ethz.ch/pipermai… 69 2014-March
2 "[Rd] Wish list\n" https://stat.ethz.ch/pipermai… 62 2007-Janu…
3 "[Rd] CRAN policies\n" https://stat.ethz.ch/pipermai… 51 2012-March
4 "[Rd] legitimate use of :::\n" https://stat.ethz.ch/pipermai… 48 2014-May
5 "[Rd] NEWS.md support on CRAN… https://stat.ethz.ch/pipermai… 48 2015-May
6 "[Rd] surprising behaviour of… https://stat.ethz.ch/pipermai… 48 2009-March
7 "[Rd] declaring package depen… https://stat.ethz.ch/pipermai… 42 2013-Sept…
8 "[Rd] if(--as-cran)?\n" https://stat.ethz.ch/pipermai… 42 2012-Sept…
9 "[Rd] Bias in R's random inte… https://stat.ethz.ch/pipermai… 37 2018-Sept…
10 "[Rd] R 3.0, Rtools3.0,l Wind… https://stat.ethz.ch/pipermai… 36 2013-April
- 已编辑
晒泥版来啦:
yihui
爬好了,如下:
- r-devel:
- r-help:
- 源代码: https://github.com/jienagu/tidyverse_examples/blob/master/web_scraping_r_help.R
- csv 文件: https://github.com/jienagu/tidyverse_examples/blob/master/web_scraping_r_help.csv
- 前十名:
> everything2[1:10,] link Link_url count 1 [R] Problems in Recommending R https://stat.ethz.ch/pipermail/r-help/2009-February/379744.html 126 2 [R] R in the NY Times https://stat.ethz.ch/pipermail/r-help/2009-January/377079.html 118 3 [R] Google's R Style Guide https://stat.ethz.ch/pipermail/r-help/2009-August/402727.html 100 4 [R] How to comment in R https://stat.ethz.ch/pipermail/r-help/2009-February/380938.html 78 5 [R] Inefficiency of SAS Programming https://stat.ethz.ch/pipermail/r-help/2009-February/382798.html 66 6 [R] installing R on Ubuntu https://stat.ethz.ch/pipermail/r-help/2009-February/380559.html 64 7 [R] Popularity of R, SAS, SPSS, Stata... https://stat.ethz.ch/pipermail/r-help/2010-June/243043.html 64 8 [R] A comment about R: https://stat.ethz.ch/pipermail/r-help/2006-January/085205.html 63 9 [R] productivity tools in R? https://stat.ethz.ch/pipermail/r-help/2009-July/396050.html 56 10 [R] two questions for R beginners https://stat.ethz.ch/pipermail/r-help/2010-March/230116.html 55
- 已编辑
基于 Jiena 的代码,来个依赖最小的版本
# 安装必要的依赖
packages <- c("rvest", "knitr")
lapply(packages, function(pkg) {
if (system.file(package = pkg) == "") install.packages(pkg)
})
# 确保 Windows 下的中文环境也能获取正确的日期格式化结果
Sys.setlocale("LC_TIME", "C")
# 格式化日期序列
all_months <- format(
seq(
from = as.Date("1997-04-01"),
to = Sys.Date(), by = "1 month"
),
"%Y-%B"
)
# 清理帖子主题
clean_discuss_topic <- function(x) {
# 去掉中括号及其内容
x <- gsub("(\\[.*?\\])", "", x)
# 去掉末尾换行符 \n
x <- gsub("(\\\n)$", "", x)
# 两个以上的空格替换为一个空格
x <- gsub("( {2,})", " ", x)
x
}
library(magrittr)
x <- "2019-February"
base_url <- "https://stat.ethz.ch/pipermail/r-devel"
# 下面的部分可以打包成一个函数
# 输入是日期 x 输出是一个 markdown 表格
# 抓取当月的数据
scrap_webpage <- xml2::read_html(paste(base_url, x, "subject.html", sep = "/"))
# Extract the URLs 提取链接尾部
tail_url <- scrap_webpage %>%
rvest::html_nodes("a") %>%
rvest::html_attr("href")
# Extract the theme 提取链接对应的讨论主题
discuss_topic <- scrap_webpage %>%
rvest::html_nodes("a") %>%
rvest::html_text()
# url 和 讨论主题合并为数据框
discuss_df <- data.frame(discuss_topic = discuss_topic, tail_url = tail_url)
# 清理无效的帖子记录
discuss_df <- discuss_df[grepl(pattern = "\\.html$", x = discuss_df$tail_url), ]
# 清理帖子主题内容
discuss_df$discuss_topic <- clean_discuss_topic(discuss_df$discuss_topic)
# 去重 # 只保留第一条发帖记录
discuss_uni_df <- discuss_df[!duplicated(discuss_df$discuss_topic), ]
# 分组计数
discuss_count_df <- as.data.frame(table(discuss_df$discuss_topic), stringsAsFactors = FALSE)
# 对 discuss_count_df 的列重命名
colnames(discuss_count_df) <- c("discuss_topic", "count")
# 按讨论主题合并数据框
discuss <- merge(discuss_uni_df, discuss_count_df, by = "discuss_topic")
# 添加完整的讨论帖的 url
discuss <- transform(discuss, full_url = paste(base_url, x, tail_url, sep = "/"))
# 选取讨论主题、主题链接和楼层高度
discuss <- discuss[, c("discuss_topic", "full_url", "count")]
# 按楼层高度排序,转化为 Markdown 表格形式输出
discuss[order(discuss$count, decreasing = TRUE), ] %>%
knitr::kable(format = "markdown", row.names = FALSE) %>%
cat(file = paste0(x, "-disuss.md"), sep = "\n")
总依赖如下,现在差不多可以往 Travis 上搞定时任务了,只要把输出的 markdown 文件推回 Github 即可
tools::package_dependencies(packages,recursive=T) %>% unlist %>% unique
[1] "xml2" "httr" "magrittr" "selectr" "curl" "jsonlite"
[7] "mime" "openssl" "R6" "methods" "stringr" "Rcpp"
[13] "tools" "askpass" "utils" "glue" "stringi" "sys"
[19] "stats" "evaluate" "highr" "markdown" "yaml" "xfun"
我还不知道论坛里怎么贴 Markdwon 表格,即文件 2019-February-disuss.md
的内容,所以请移步 <https://github.com/XiangyunHuang/RGraphics/issues/5>
如果将所有的帖子都扒拉下来,根据帖子主题之间的关系,有没有可能将它们分类,用 shiny 做一个可视化前端,根据时间和楼层数显示每类帖子下面热门的讨论
如果把每个帖子的发帖人和回帖人也提取,那么还可以看特定的人的情况
根据帖子的 ID 长度来看是六位数,不足百万,SO 上面已经是八位数了
- 已编辑
yihui 晒泥版升级了, 只在打开时载入一次,滑块仅用来筛选时间区间。
另外,画了个简单的柱状图,显示每个月的回帖总量。
本论坛应该是不支持表格语法吧。
--- <https://d.cosx.org/d/420350-pagedown-pdf/9>