tctcab 试了一下,我觉得是问题来自 9550 的下下一页:
<https://d.cosx.org/api/discussions?page%5Blimit%5D=50&page%5Boffset%5D=9650>
浏览器打开就是错误 500。
为了跳过这一页,我把楼上 Jiena tctcab 的方法柔在了一起:
# get the max page (1422 pages on 2019-06-18)
get_maxpage <- function(page_range = 1421:1500){
for (i in page_range) {
print(paste(Sys.time(), i))
COS_link <- xml2::read_html(paste0('https://d.cosx.org/all?page=', i))
url_vector= rvest::html_attr(rvest::html_nodes(COS_link, "a"), "href")
last_link = url_vector[length(url_vector)]
last_number <- as.numeric(gsub("[https://d.cosx.org/all?page=]", "",last_link) )
if(last_number <= i - 1){
message('There are ', i, ' pages with 20 posts on each.')
return(i)
}
}
}
# get json from cos
get_js <- function(url){
print(paste(Sys.time(), url))
mytry <- try(jsonlite::fromJSON(url))
if(class(mytry) == 'try-error') return(NULL)
jsonlite::fromJSON(url)
}
maxpage <- get_maxpage(1421:1500)
cos_url <- paste0('https://d.cosx.org/api/discussions?page%5Blimit%5D=50&page%5Boffset%5D=', seq(0, maxpage * 20, 50) + 50)
cos_js <- lapply(cos_url, get_js)
可以看到, 500 错误不仅出现在读 9550 时,还出现在多处,例如 10350 和 11650,因为他们的下下一页用浏览器打开就是 500 错误:
<https://d.cosx.org/api/discussions?page%5Blimit%5D=50&page%5Boffset%5D=10450>
<https://d.cosx.org/api/discussions?page%5Blimit%5D=50&page%5Boffset%5D=11750>