• R语言
  • R到底提供了多少岗位? 答案是:2500个

事情起源于这位老兄统计了一下indeed上R的工作数目http://www.datasciencecentral.com/profiles/blogs/sas-dominates-analytics-job-market-r-up-42

然后微博上有另一位仁兄吐槽说R的数量是虚高;于是就去indeed想探个究竟

在indeed输入R,出来的结果有37000多条,但是大部分都是与R语言无关的,这也就给R工作岗位的分析带来不少麻烦。所幸的是indeed提供exact phrase搜索功能,而R又一般都是其他软件一起出现在工作要求中,结合这两者,计算出R语言的工作岗位就成为可能了。

例如,一个工作岗位要求求职者会SAS或者R,在英语中这两者一般都会以这样的四种方式出现:“SAS, R", "SAS or R", "R, SAS", "R or SAS”。利用indeed的exact phrase功能,就可以搜索出SAS跟R相邻出现的数目,利用同样的方法,我们还可以统计其他软件跟R一前一后出现的数目,最后得到总的R的工作岗位

统计了一些主要软件后,可得出R的工作岗位大概有3116个,这比在indeed中直接输入R的结果要低得多,但是比以上那位老兄得到的1693个还是要高的。。。。。

<br />
require(RCurl)<br />
require(XML)<br />
rm(list =ls())</p>
<p>RJobs <- function(x) {<br />
  jobs <- matrix(0, length(x), 5)<br />
  url1<-c("http://www.indeed.com/jobs?q=%22AnotherTool+R%22&l=united+states&radius=0")<br />
  url2<-c("http://www.indeed.com/jobs?q=%22AnotherTool+or+R%22&l=united+states&radius=0")<br />
  url3<-c("http://www.indeed.com/jobs?q=%22R+AnotherTool%22&l=united+states&radius=0")<br />
  url4<-c("http://www.indeed.com/jobs?q=%22R+or+AnotherTool%22&l=united+states&radius=0")<br />
  url5<- c("http://www.indeed.com/jobs?q=%22AnotherTool+R%22+or+%22AnotherTool+or+R%22+or+%22R+AnotherTool%22+or+%22R+or+AnotherTool+%22&l=United+States&radius=0")<br />
  url <- c(url1, url2, url3, url4, url5)<br />
  url.new <- t(sapply(x, function(x) gsub("AnotherTool", x, url)))<br />
  count.func <- function(page) {<br />
    webpage <- getURL(page)<br />
    webpage <- readLines(tc <- textConnection(webpage)); close(tc)<br />
    aa <- grep("Jobs 1 to ", webpage)<br />
    count <- ifelse (length(aa) == 0, 0, as.integer(gsub("[^0-9]", "", strsplit(webpage[aa], " ")[[1]][7])))<br />
    return(count)<br />
  }<br />
  for (i in 1:length(x)) {<br />
    for (j in 1:5) jobs[i,j] <- count.func(url.new[i,j])<br />
  }<br />
  colnames(jobs) <- c("* R", "* or R", "R *", "R or *", "All")<br />
  rownames(jobs) <- x<br />
  return(jobs)<br />
}</p>
<p>soft <- c("SAS", "SPSS", "Minitab", "Stata", "JMP", "Statistica", "Systat", "BDMP",<br />
          "Python", "Matlab", "Excel", "SQL", "java", "javascript", "perl", "PHP",<br />
          "Fortran", "S-Plus", "Linux", "C%2B%2B", "Access", "Ruby", "Shell","Coffeescript",<br />
          "Gauss") ## C%2B%2B is C++, should replace "AnotherTool" with C%2B%2B to search correctly, not C++<br />
system.time(jobs <- RJobs(soft))<br />
jobs <- jobs[order(-jobs[, 'All']), ]<br />
jobs<br />
             * R * or R R * R or * All<br />
SAS          467    120 311     79 961<br />
Matlab       200     52 319     64 628<br />
SPSS         216     42 145     34 434<br />
Python       100      6  77     29 209<br />
SQL          105      6  89     10 209<br />
Stata         58     15  79     10 160<br />
S-Plus        45     23  72      7 147<br />
C%2B%2B       58      0   7      2  67<br />
java          33      2  28      2  65<br />
Excel         34      0  19      1  52<br />
perl          21      2  25      4  52<br />
JMP           24      6  20      2  50<br />
Minitab       10      0  13      7  28<br />
Ruby           6      2  14      0  22<br />
Linux          7      0   6      1  14<br />
Access        10      0   3      0  13<br />
PHP            4      0   6      0  10<br />
Statistica     3      0   5      0   8<br />
javascript     3      0   2      0   5<br />
Shell          1      0   3      0   4<br />
Fortran        2      0   1      0   3<br />
Systat         2      0   0      0   2<br />
Gauss          1      0   0      0   1<br />
BDMP           0      0   0      0   0<br />
Coffeescript   0      0   0      0   0<br />
</p>

但是当多个软件连着出现时,这样算就可能会有部分重复计算,如"SAS, R, Matlab"同时计算了“SAS, R"还有”R,Matlab",这个工作岗位出现了两次,所以工作岗位数目应该比3116少,但是比1558大。那么怎么把重复的值去掉呢?一个可行的办法就是把所有这些exact phrase同时输入到网址中,返回的结果就会自动把重复值去掉。

于是,我们打开这个奇葩的网站,就得到了R工作岗位的总数 http://www.indeed.com/jobs?q=%22SAS+R%22+or+%22SAS+or+R%22+or+%22R+SAS%22+or+%22R+or+SAS%22+or+%22Matlab+R%22+or+%22Matlab+or+R%22+or+%22R+Matlab%22+or+%22R+or+Matlab%22+or+%22SPSS+R%22+or+%22SPSS+or+R%22+or+%22R+SPSS%22+or+%22R+or+SPSS%22+or+%22Python+R%22+or+%22Python+or+R%22+or+%22R+Python%22+or+%22R+or+Python%22+or+%22SQL+R%22+or+%22SQL+or+R%22+or+%22R+SQL%22+or+%22R+or+SQL%22+or+%22Stata+R%22+or+%22Stata+or+R%22+or+%22R+Stata%22+or+%22R+or+Stata%22+or+%22S-Plus+R%22+or+%22S-Plus+or+R%22+or+%22R+S-Plus%22+or+%22R+or+S-Plus%22+or+%22C%2B%2B+R%22+or+%22C%2B%2B+or+R%22+or+%22R+C%2B%2B%22+or+%22R+or+C%2B%2B%22+or+%22java+R%22+or+%22java+or+R%22+or+%22R+java%22+or+%22R+or+java%22+or+%22Excel+R%22+or+%22Excel+or+R%22+or+%22R+Excel%22+or+%22R+or+Excel%22+or+%22perl+R%22+or+%22perl+or+R%22+or+%22R+perl%22+or+%22R+or+perl%22+or+%22JMP+R%22+or+%22JMP+or+R%22+or+%22R+JMP%22+or+%22R+or+JMP%22+or+%22Minitab+R%22+or+%22Minitab+or+R%22+or+%22R+Minitab%22+or+%22R+or+Minitab%22+or+%22Ruby+R%22+or+%22Ruby+or+R%22+or+%22R+Ruby%22+or+%22R+or+Ruby%22+or+%22Linux+R%22+or+%22Linux+or+R%22+or+%22R+Linux%22+or+%22R+or+Linux%22+or+%22Access+R%22+or+%22Access+or+R%22+or+%22R+Access%22+or+%22R+or+Access%22+or+%22PHP+R%22+or+%22PHP+or+R%22+or+%22R+PHP%22+or+%22R+or+PHP%22&l=united+states&radius=0

那么是多少呢?

答案是:2513个!

赞认真!

getURL()函数的用处是啥?貌似readLines()可以直接读URL啊

回复 第4楼 的 谢益辉:刚试了一下直接用readLines(),确实可以读

可能readLines()跟getURL对页面parse的方法不一样吧,前者读入的行数比后者多十几行。但是具体有什么不一样,没深入去探究,希望哪位高手可以解释一下。。。。。

回来接着赞认真!