又被R伤着了

dclong · 2011年10月18日

PDF：

http://dl.dropbox.com/u/15284431/bug.pdf

顺便抱怨一下：COS论坛对文件大小的限制也太严格了一点哈，一不小心就过了。

autoban · 2011年10月18日

Not necessarily a bug... It's perfectly possible and sensible to have more levels, e.g.,

<br />
factor(1:3, 1:4)<br />

</p>

dclong · 2011年10月18日

回复第3楼的 remember, discover, invent：

过这意味着你读入一个data frame之后用levels函数不一定能正确显示一个其某列（factor）的水平，那我搞不懂这个levels函数还有什么用。而且还可能具有误导性。

autoban · 2011年10月18日

1. 如果是read.table之类的函数读进来的data.frame，这种情况应该不会发生，除非哪里subsetting过。

2. levels给出理论上应该有几类，但是每一个realization不一定每一类都能观测到。就像小样本的multinomial数据，missing cells是很可能的。我不觉得有什么不妥。

dclong · 2011年10月19日

回复第5楼的 remember, discover, invent：

然而它的确发生了...用read.table读进来的数据，木有改过...

yihui · 2011年10月19日

回复第2楼的 dclong：实话说，我现在非常后悔增加上传文件的功能。作为21世纪的公民，必须拥有在21秒内找到一个上传文件网站的能力。

回复第6楼的 dclong：我不太相信，给个可重复运行的例子？

dclong · 2011年10月19日

回复第7楼的谢益辉：

上传一个很容易，我把东西往自己的Dropbox一丢就搞定了，可是我不保证永远不删除文件。

改天你来我办公室，我show给你看。

dclong · 2011年10月19日

回复第7楼的谢益辉：

在我的一台Windows电脑，一个Windows Server和一个Linux Server上这个问题都可以重现。

leephil · 2011年10月19日

请教一下，r中怎样实现AHP算法？

yihui · 2011年10月20日

回复第8楼的 dclong：对你个人以举例为用途来说，问题解决之后完全可以删掉文件，在这个问题上没必要对天下人负责。

回复第10楼的 leephil：请不要劫机，谢谢。

wnfd · 2011年10月20日

pdf 根本看不到，没法参与了...

autoban · 2011年10月20日

回复第12楼的 wnfd：still readable here, although I prefer it is not, since that file is exposing some unnecessary privacy... For me, I can easily figure out from which lab this data set came and who was supposed to do the analysis...

回复第6楼的 dclong：Not sure what's going on. If the missing one is acutally in the data, then it might be the problem with unique() or as.character()?

dclong · 2011年10月20日

回复第12楼的 wnfd：

可能是因为国内的无法使用Dropbox的服务吧。我把文件放在我的Dropbox的Public文件夹里面了。问题简而言之就是我用read.table函数读入了一批数据到一个变量corn里面。corn是一个大data frame, 下面有一列（factor）叫做Ped09。

我使用length(levels(corn$Ped09))显示有221个水平，但是length(unique(corn$Ped09))显示只有220个不同值。我读入数据后没有进行过任何修改。

nan.xiao · 2011年10月20日

理论上对factor 用 nlevels(x) <==> length(levels(x)) 规范可靠

要专门handle字符串就得小心的用 as.is = TRUE 和 make.unique() 一类的

LZ为何不自己setdiff一下看看到底是哪个元素多出来了? 然后你就懂了

dclong · 2011年10月20日

回复第13楼的 remember, discover, invent：

我觉得你的想法应该是正确的。

不过说实话，我现在自己倒是糊涂了。我查过原始数据，发现对应于出问题的那个factor level，数据的确是missing的。可是这依然解释不了我的疑惑呀，虽然数据有missing的，可是所有Factor水平都在，而我的code检查的只是那个factor而已，并没有涉及到数据。

还有，之前老谢质疑我的可重复性的时候，我自己也心虚，所以特地在我的一台笔记本和另外两个Server上实验过了。我拷贝过去的是原始数据文件，读入数据，使用我提到到两个不同的命令得到了不同的结果（应该不是幻觉，因为文件都还在Server上，而且我是检验后回复老谢的帖子的。连续3次看花眼的概率也是很小的吧。）。而今天我再次登录Server检验的时候，竟然得到了同样的结果！

这诡异的事让我想起了以前我曾经在论坛上提到过我碰到过一个虚无缥缈的bug。说虚无缥缈是因为我以前碰到过，但是后来怎么样也无法重现那个bug，弄得别人都不信我。看来以后碰到bug的时候，趁着可以重现的时候，将整个过程录制下来才是王道啊。

不知道大家有没有碰到和我的类似的问题？

nan.xiao · 2011年10月20日

惊悚程度又上升到了一个全新的水平。。。

autoban · 2011年10月20日

回复第16楼的 dclong：

(1) One reason for lack of repeatability could be that some .RData was loaded. You might want to try in a clean folder to see if this happens again.

(2) If the factor levels were generated by read.table(), there won't be an extra level, but will be missing one level, i.e., nlevels should report a smaller number than length(unique(as.character())). This is not what you have shown in the pdf. See the e.g. below:

<br />
> (tmp=read.table(textConnection("a 1<br />
+ NA 2<br />
+ b 3")))<br />
    V1 V2<br />
1    a  1<br />
2 <NA>  2<br />
3    b  3<br />
> nlevels(tmp[,1])<br />
[1] 2<br />
> unique(as.character(tmp[,1]))<br />
[1] "a" NA  "b"<br />

(3) So, my guess is that the data frame obtained the factor levels from somewhere else, instead of from read.table(). For example,

<br />
> (tmp=read.table(textConnection("a 1<br />
+ NA 2<br />
+ b 3")))<br />
    V1 V2<br />
1    a  1<br />
2 <NA>  2<br />
3    b  3<br />
> levels(tmp[,1])=head(letters)<br />
> nlevels(tmp[,1])<br />
[1] 6<br />
> unique(as.character(tmp[,1]))<br />
[1] "a" NA  "b"<br />

</p>

dclong · 2011年10月21日

回复第17楼的 nan.xiao：

像是恐怖小说？呵呵～

dclong · 2011年10月21日

回复第18楼的 remember, discover, invent：

很佩服你的专业和执着的精神啊，一如你的昵称

nan.xiao · 2011年10月21日

回复第19楼的 dclong：是啊

“而今天我再次登录Server检验的时候，竟然得到了同样的结果！”

这句将前文所有戏剧冲突的伏笔铺垫集中释放实属点睛之笔。。。