网上院士相关信息获取

zhangbing4502431 · 2013年6月20日

http://www.casad.cas.cn/channel.action?chnlid=209。点击网页上相关名字后，即可看到该院士的相关信息。我想获取这上面院士的比如出生地，毕业学校之类的信息。手动输入太麻烦了，能写程序把相关数据下下来吗？非常感谢。

jcl · 2013年6月20日

我正在学，就是不会设置参数

doctorjxd · 2013年6月20日

[s:14]

zhangbing4502431 · 2013年6月20日

回复第3楼的 doctorjxd：请问，你是怎么弄出来的，能将代码公开吗？

wangpeng692 · 2013年6月20日

太牛了，同求代码[s:11]

zhangbing4502431 · 2013年6月21日

回复第5楼的 wangpeng692：我也在等他的代码，煎熬啊，大神不现身。

manxingxing · 2013年6月23日

<br />
library(stringr)<br />
library(XML)<br />
# parse html file<br />
extract_from_url = function(url) {<br />
  doc = htmlParse(url)<br />
  txt = unlist(xpathApply(doc, "//div[@id='introduction']//dd[@class='lr']", xmlValue))<br />
  get_info(txt)<br />
}<br />
# 从文本提取个人信息<br />
get_info = function(str) {<br />
  list(<br />
    year = get_birth_year(str),<br />
    month = get_birth_month(str),<br />
    day = get_birth_day(str),<br />
    addr = get_birth_addr(str),<br />
    graduate = get_graduate_from(str)<br />
  )<br />
}<br />
get_birth_year = function(str) {<br />
  as.numeric(str_extract(str, pattern=perl("\\d{4}(?=年[^。、，,.]*生)")))<br />
}<br />
get_birth_month = function(str) {<br />
  as.numeric(str_extract(str, pattern=perl("\\d{1,2}(?=月[^。、，,.]*生)")))<br />
}<br />
get_birth_day = function(str) {<br />
  as.numeric(str_extract(str, pattern=perl("\\d{1,2}(?=日[^。、，,.]*生)")))<br />
}</p>
<p>alt1 = "(?<=生于)[^。、，,.[:digit:]]+"<br />
alt2 = "[^。、，,.[:digit:]]+(?=人。)"<br />
alt3 = "(?<=籍贯)[^。、，,.[:digit:]]+"<br />
place_expr = perl(paste(alt1, alt2, alt3, sep="|"))<br />
get_birth_addr = function(str) {<br />
  str_extract(str, place_expr)<br />
}</p>
<p>get_graduate_from = function(str) {<br />
  str_extract(str, pattern=perl("(?<=毕业于)[^。、，,.[:digit:]]+"))<br />
}</p>
<p>## 获得所有链接地址<br />
url = "http://sourcedb.cas.cn/sourcedb_ad_cas/zw2/ysxx_xxbwz/qtysmd/index.html"<br />
doc = htmlParse(url, encoding='utf-8')<br />
links = xpathSApply(doc, path="//td/a", fun=function(c){<br />
  href = xmlGetAttr(c, 'href')<br />
  names(href) = xmlValue(c)<br />
  return(href)<br />
})<br />
# replace relative links with absolute links<br />
links = sub(links, pattern="\\.\\.", replacement="http://sourcedb.cas.cn/sourcedb_ad_cas/zw2/ysxx_xxbwz")</p>
<p># grab information from each URL<br />
result = list()<br />
for(i in seq_along(links)) {<br />
  cat(names(links)[i], ":", links[i], "\n")<br />
  Sys.sleep(0.3) # avoid DOS attack<br />
  result[[i]] = c(name =names(links)[i], extract_from_url(links[i]))<br />
}<br />
result = do.call(rbind, lapply(result, as.data.frame))<br />

</p>

doctorjxd · 2013年6月23日

回复第7楼的 manxingxing：

赞！学习。

zhangbing4502431 · 2013年6月23日

回复第7楼的 manxingxing：非常感谢。到时候会用空间统计学的分析方法来处理这批数据。然后发到这个论坛上，共同学习。

wangpeng692 · 2013年6月26日

从

http://www.casad.cas.cn/channel.action?chnlid=209

转到下面这个网址是知道的

http://sourcedb.cas.cn/sourcedb_ad_cas/zw2/ysxx_xxbwz/qtysmd/index.html

后面的就摸黑了[s:13]

不知道要理解这里的代码，需要学习哪些方面的知识。

doctorjxd · 2013年6月26日

回复第10楼的 wangpeng692：

正则表达式

没时间就看看《R数据操作》第7章。

有时间的话，还是推荐看看 O'Reilly 出版的《精通正则表达式》。

wangpeng692 · 2013年6月26日

回复第11楼的 doctorjxd：谢谢指点[s:18]