我们通常认为一条微博是一个topic 那一个人其实往往是多个topic的 比如我们的开复老师关注点特多!为了降低一些低频topic的干扰 我觉得可以先对名人的doc在所有名人的corpus中做TF-IDF之后 先筛选一下再topic modeling感觉上也更合理。

关于你提出的短文本的问题,可以参考Hu Yihui老师的一篇经典可视化论文里对twitter的操作,他认为tf-idf比LDA更合适短文本:But in our application, we found LDA not any better than td-idf for identifying meaningful clusters. Sometimes LDA clusters messages that have no words in common, because LDA treats them as belonging to the same topic. The problem is that with short messages such as tweets, these assignments are not always meaningful. With tf-idf, messages in the same cluster must share at least some words, making the cluster easier to interpret, even if it is not always semantically “correct.”
详见http://www.research.att.com/export/sites/att_labs/techdocs/TD_100840.pdf
[未知用户] 谢谢雍子哥~我用人和微博作为doc都尝试了一下,分别用tfidf筛了词,但是没有用人的筛过微博的,我回头试试~ 我自己也觉得lda提取topic对人的聚类效果一般,不过据我实验的话用weibo作为doc的关于词语的聚类效果还是挺好的呢><……
[未知用户] 谢谢雍子哥~我用人和微博作为doc都尝试了一下,分别用tfidf筛了词,但是没有用人的筛过微博的,我回头试试~ 我自己也觉得lda提取topic降维之后对人的聚类效果一般,不过据我实验的话用weibo作为doc的关于词语的聚类效果还是挺好的呢><……
如果是单纯做推荐的话,我觉得还有待进一步实验……比如做点文本的扩充或者用推荐内容的语料库来训练模型
这是怎么回事?

[1] 1
[1] "MY_ERROR: Error in if (uid == roauth$webUser) stop(\"Can't search the current user, please change an account to login.\"): 参数长度为零\n"
错误于print(dim(a)) : 找不到对象'a'
此外: 警告信息:
Can not crawl any page now. May be forbidden by Sina temporarily.
[未知用户] 模拟登陆成功了嘛?
[未知用户] 返回的是
Login successfully!
应该是成功了吧
4 天 后
1
"MY_ERROR: Error: The pages were out of range!n"
Error in print(dim(a)) : object 'a' not found
In addition: Warning messages:
1: In web.search.user(name ) : NAs introduced by coercion
2: Can not crawl any page now. May be forbidden by Sina temporarily.
>

卡在这里了。。。
Hu Yifan,非Hu Yihui,当时我在ATT的时候有人也搞不清楚Yifan/Yihui……
[未知用户] Hu Yihui 。。笑抽了 。。
2 个月 后
@pudding 确定topic数目的运行成本到底有多大?我用你的这个函数,运行5000*300左右的dtm矩阵,跑个通宵也没出结果啊?
1 年 后
setdiff的确自带去重了,这里不太能这样用,用y=x[!x %in% stopwords]即可。