一个爬网页的练习：看看 R 邮件列表中最热门的讨论是什么

dapengde · 2019年6月15日

Cloud2016 好建议，我把 dygraphics 加上了，而 ggplot 改成了精确到年。

掉线可能是为了节省机时。因为 shinyapps.io 的免费机时有限，停一会儿看用户没操作，就给断了。

Cloud2016 · 2019年6月15日

dapengde 原来如此，谢谢解惑

tctcab · 2019年6月15日

容我插一句，
统计之都是不是也能趴一趴

就像这个月外面那个讨论极乐净土的长帖子就值得挖出来读一读啊。

想挖掘一些这种讨论的热火朝天的尘封的帖子，扩充@dapeng1978 那个sinx语录包

Cloud2016 · 2019年6月15日

tctcab 我甚至觉得可以整理一下作为一篇主站文章，而且保留讨论的形式，所有参与讨论的人都是作者

Jiena · 2019年6月16日

yanglei
没事，问的好。建议先试 r-devel.R，数量小一些。以下是侦察过程：

建立好年月数列，看看trans_month是不是这个格式，比如“1997-April”：

#Loading the rvest package
library('rvest')
library(purrr)
library(tidyverse)
## Create year month list
current_month=as.Date(cut(Sys.Date(), "month"))
date_seq=seq(from=as.Date("1997-04-01"),to=current_month , by="1 month")
trans_month=format(as.Date(date_seq), "%Y-%B")

### Make a function to fetch the summarized data frame for every month
unnecessary=c("thread.html#start",
              "author.html#start",
              "date.html#start",
              "http://www.stat.math.ethz.ch/mailman/listinfo/r-devel")

先以 1997 年 4 月为例子在本地跑：

  #Specifying the url for desired website to be scraped
  url_base <- 'https://stat.ethz.ch/pipermail/r-devel/%s/subject.html'
  
  test_web=read_html(sprintf(url_base, '1997-April' ) )
  # Extract the URLs
  url_ <- test_web %>%
    rvest::html_nodes("a") %>%
    rvest::html_attr("href")
  # Extract the link text
  link_ <- test_web %>%
    rvest::html_nodes("a") %>%
    rvest::html_text()
  test_df=data.frame(link = link_, url = url_)
  test_df2=test_df[which(!is.na(test_df$url)),]
  test_df3=test_df2[which(! test_df2$url %in% unnecessary),]
  test_df3$link=gsub("[\r\n]", " ", test_df3$link)
  test_df3$url=paste0("https://stat.ethz.ch/pipermail/r-devel/", "1997-April","/",test_df3$url)
  test_df4=test_df3 %>% 
    group_by(link) %>%
    dplyr::mutate(
      Link_url = dplyr::first(url)
    ) %>% group_by(link,Link_url ) %>% 
    summarize(
      count=n()
    ) 
  test_df4=data.frame(test_df4)

这样一步一步跑哪里出错一清二楚。最后看看您的 test_df4是不是如下：

> head(test_df4)
                                                                link                                                      Link_url count
1 R-alpha:  contributed packages -- Yes, use library/<package>/.. !  https://stat.ethz.ch/pipermail/r-devel/1997-April/017072.html     2
2                                                   R-alpha: ==NULL  https://stat.ethz.ch/pipermail/r-devel/1997-April/017131.html     1
3                                               R-alpha: as.numeric  https://stat.ethz.ch/pipermail/r-devel/1997-April/017023.html     3
4                                   R-alpha: Be nice to the English  https://stat.ethz.ch/pipermail/r-devel/1997-April/017051.html     1
5                                               R-alpha: binom.test  https://stat.ethz.ch/pipermail/r-devel/1997-April/017075.html     1
6                R-alpha: Bug & Patch  in  dbeta.c  (0.50 - PreR 7)  https://stat.ethz.ch/pipermail/r-devel/1997-April/017048.html     3

yanglei · 2019年6月16日

Jiena 可以了谢谢

dapengde · 2019年6月17日

tctcab 以前热衷于挖坟的时候我试过，遭遇统计之都瀑布流，不会扒，遂弃之。

晒泥版又升级了，添加了所有发帖回帖的作者统计数据。这导致数据量太大，载入需要花 10 秒 (为此特意加了个进度条随机来几句唐诗，免得没耐心等)。载入之后就好了。

R-devel 作者第一位是 Prof Brian Ripley，发了 4000 篇。前十里我只认识 Uwe Ligges 这个名字，发了将近 1000。

R-help 作者排第一的是 David Winsemius，居然写了 10000+ 多篇。Uwe Ligges 也发了 6000+。

Liechi · 2019年6月17日

dapengde 你这泥晒得真不错！

Cloud2016 · 2019年6月17日

dapengde Jiena tctcab 欢迎各位整理一下给主站投稿 <https://cosx.org/contribute/>

Jiena · 2019年6月17日

dapengde @tctcab Cloud2016 @yihui
统计之都（本坛子）也爬好了。从2006年建站开始截至到 2019-06-17 一共有28310篇帖子， 1422 页（也就是说如果您想手动爬到底的话，要单击“载入更多” 1421 次。。。所以还是写个程序吧。），第一篇帖子是： https://d.cosx.org/d/5-5 ，题目为”hehe 瞎提意见“，这篇帖子是 2006-05-20T21:32:35+00:00 创立的。

根据讨论回复数量的前十名：

> everything_cos2_replies[1:10,]
   Comment_count Participant_count                        URL create_Date last_post_Date active_days
1           1261              1184  https://d.cosx.org/d/1553  2006-09-03     2019-06-14        4667
2            923               850 https://d.cosx.org/d/13775  2009-01-16     2017-12-03        3243
3            802               714   https://d.cosx.org/d/818  2006-08-01     2019-04-27        4652
4            796               589  https://d.cosx.org/d/7865  2007-09-14     2018-07-18        3960
5            633               590  https://d.cosx.org/d/2013  2006-09-29     2010-09-21        1453
6            533               289  https://d.cosx.org/d/9233  2007-12-29     2018-08-06        3873
7            528               450  https://d.cosx.org/d/8949  2007-12-09     2018-11-18        3997
8            520               424 https://d.cosx.org/d/14985  2009-04-22     2019-05-29        3689
9            507               490 https://d.cosx.org/d/13875  2009-02-02     2017-12-01        3224
10           359                25 https://d.cosx.org/d/14141  2009-02-28     2017-10-19        3155

根据讨论参与人数的前十名帖子：

> everything_cos2_participants[1:10,]
   Comment_count Participant_count                        URL create_Date last_post_Date active_days
1           1261              1184  https://d.cosx.org/d/1553  2006-09-03     2019-06-14        4667
2            923               850 https://d.cosx.org/d/13775  2009-01-16     2017-12-03        3243
3            802               714   https://d.cosx.org/d/818  2006-08-01     2019-04-27        4652
4            633               590  https://d.cosx.org/d/2013  2006-09-29     2010-09-21        1453
5            796               589  https://d.cosx.org/d/7865  2007-09-14     2018-07-18        3960
6            507               490 https://d.cosx.org/d/13875  2009-02-02     2017-12-01        3224
7            528               450  https://d.cosx.org/d/8949  2007-12-09     2018-11-18        3997
8            520               424 https://d.cosx.org/d/14985  2009-04-22     2019-05-29        3689
9            344               293  https://d.cosx.org/d/2070  2006-10-05     2017-03-07        3806
10           533               289  https://d.cosx.org/d/9233  2007-12-29     2018-08-06        3873

*根据活动天数（最新的回复-初始贴的日期）最久的前十名：

> everything_cos2_active_days[1:10,]
   Comment_count Participant_count                       URL create_Date last_post_Date active_days
1            160               121   https://d.cosx.org/d/35  2006-05-25     2019-04-03        4696
2           1261              1184 https://d.cosx.org/d/1553  2006-09-03     2019-06-14        4667
3            802               714  https://d.cosx.org/d/818  2006-08-01     2019-04-27        4652
4            205               160    https://d.cosx.org/d/1  2006-05-19     2018-10-24        4541
5             34                31  https://d.cosx.org/d/464  2006-06-22     2018-10-19        4502
6             24                19 https://d.cosx.org/d/2739  2006-11-30     2019-03-12        4485
7             18                 7 https://d.cosx.org/d/3004  2006-12-15     2019-02-04        4434
8             42                26 https://d.cosx.org/d/2963  2006-12-12     2019-01-14        4416
9             44                30 https://d.cosx.org/d/6142  2007-05-01     2019-05-24        4406
10            30                20   https://d.cosx.org/d/16  2006-05-23     2018-05-30        4390

csv 数据： https://github.com/jienagu/tidyverse_examples/blob/master/COS_raw_data.csv
源代码： https://github.com/jienagu/tidyverse_examples/blob/master/Web_scraping_COS_Chinese.R

我目前用的是 window，跟中文八字不合，有兴趣的小伙伴们可以根据我贴的源代码加上中文标题，标题在 attributes 的列名就是title.

Jiena · 2019年6月17日

dapengde

以前热衷于挖坟的时候我试过，遭遇统计之都瀑布流

嗯嗯，确实是需要一些剑走偏锋的方法来挖这部分的信息。。。

tctcab · 2019年6月18日

Jiena

手动查了一下，cos论坛基于flarum，查了一下是有REST api来提取讨论的，地址是

https://d.cosx.org/api/discussions?page[limit]=50

试了一下每个请求最大返回50串，

这个好处是不用20串一扒减少了三分之二对cos服务器的轰炸…（28310/50=568.6）
另外代码也可以精简很多，省去rvest的操作，直接jsonlite读数据, 甚至下一页的链接也可以在api返回的links里找到

最终代码：

firsturl = "https://d.cosx.org/api/discussions?page[limit]=50" 
cos.li = list()

counter = 1
cos.li[[counter]] = jsonlite::fromJSON(firsturl)

# if the next link exists, scrape it.
while ("next" %in% names(cos.li[[counter]]$links)) {
  counter = counter +1
  cos.li[[counter]] = jsonlite::fromJSON(cos.li[[counter-1]]$links[['next']])
}

## combine all df
cos.all = do.call(rbind,
                  lapply(cos.li, function(l) {
                    df = l$data$attributes
                    df$link = paste0("https://d.cosx.org/d/", l$data$id)
                    return(df)
                  }))

不过碰到个问题，爬到“https://d.cosx.org/api/discussions?page[limit]=50?page[offset]=9550”
的时候出现http 500 内部错误，浏览器直接可以打开但是R里用上面jsonlite::fromJSON的话始终会出错，好奇怪

dapengde · 2019年6月18日

tctcab 试了一下，我觉得是问题来自 9550 的下下一页：

<https://d.cosx.org/api/discussions?page%5Blimit%5D=50&page%5Boffset%5D=9650>

浏览器打开就是错误 500。

为了跳过这一页，我把楼上 Jiena tctcab 的方法柔在了一起：

# get the max page (1422 pages on  2019-06-18)
get_maxpage <- function(page_range = 1421:1500){
  for (i in page_range) {
    print(paste(Sys.time(), i))
    COS_link <- xml2::read_html(paste0('https://d.cosx.org/all?page=', i))
    url_vector= rvest::html_attr(rvest::html_nodes(COS_link, "a"), "href")
    last_link = url_vector[length(url_vector)]
    last_number <- as.numeric(gsub("[https://d.cosx.org/all?page=]", "",last_link) )
    if(last_number <= i - 1){
      message('There are ', i, ' pages with 20 posts on each.')
      return(i)
    }
  }
}

# get json from cos
get_js <- function(url){
  print(paste(Sys.time(), url))
  mytry <- try(jsonlite::fromJSON(url))
  if(class(mytry) == 'try-error') return(NULL)
  jsonlite::fromJSON(url)
}

maxpage <- get_maxpage(1421:1500)
cos_url <- paste0('https://d.cosx.org/api/discussions?page%5Blimit%5D=50&page%5Boffset%5D=', seq(0, maxpage * 20, 50) + 50)

cos_js <- lapply(cos_url, get_js)

可以看到， 500 错误不仅出现在读 9550 时，还出现在多处，例如 10350 和 11650，因为他们的下下一页用浏览器打开就是 500 错误：

<https://d.cosx.org/api/discussions?page%5Blimit%5D=50&page%5Boffset%5D=10450>
<https://d.cosx.org/api/discussions?page%5Blimit%5D=50&page%5Boffset%5D=11750>

dapengde · 2019年6月18日

晒泥版新增 COS 页面！中文标题加进来啦 Jiena

<http://rchive.pzhao.org>

Liechi · 2019年6月18日

dapengde 这个 https 的打不开，上边 http 的可以。

COS 的链接们打不开。

tctcab · 2019年6月18日

dapengde
超链接里有引号，估计没拼好

dapengde · 2019年6月18日

Liechi tctcab 已修复。

tctcab · 2019年6月18日

dapengde
继续提意见：首先我觉得三个不同页面放在不同tab下不大好，因为默认为rdev，即使只对cos感兴趣的用户，刷新一下页面就又跑去加载rdev了，还是用单独页面好一。

其次，经常出现断开连接，在调整年限范围之后。

再次，数据处理时间还是好慢，是不是可以考虑sql

dapengde · 2019年6月18日

tctcab

你是说把他仨分成三个彼此独立的站点吗？那倒是容易，而且也快。
调整年限后断开连接的情况我还没遇见。一般情况下，用户几分钟没活动时就会自动断开，是为了节省 shinyapps.io 的机时。
确实慢，因为加载时要从 https://stat.ethz.ch/ 上读取最新的数据。等哪位把 trivis 弄好吧。此外，authors 统计数据载入得最慢，用处似乎不大，玩两天就删掉得了。

Jiena · 2019年6月18日

tctcab
嗯，好厉害，这样的确可以省很多！太棒啦！