谁能帮我看看下面。我想得到一些简单词,比如“我”“你”“他”的Doc-Term Matrix,为啥出现的都是三个字以上的词呢?
> reuters <- tm_map(ovid, stripWhitespace)
> for (i in 1:dt$Length) {
+ re[]<- mmseg4j(PlainTextDocument(reuters)[],zj)
+ }
> reuters <- Corpus(VectorSource(re))
> meta(reuters[[2]])
Available meta data pairs are:
Author :
DateTimeStamp: 2012-07-21 03:03:37
Description :
Heading :
ID : 2
Language : en
Origin :
> dtm <- DocumentTermMatrix(reuters)
> inspect(dtm[1:5])
A document-term matrix (5 documents, 5 terms)
Non-/sparse entries: 5/20
Sparsity : 80%
Maximal term length: 3
Weighting : term frequency (tf)
Terms
Docs 不承认 打电话 没什么 一小时 重要的
1 0 0 1 0 1
2 0 1 0 0 0
3 0 0 0 0 0
4 0 0 0 1 0
5 1 0 0 0 0