关于矩阵计算的一个问题，请高手指教

camelbbs · 2010年9月7日

求教一个算法，上次问过，但是还是没有搞清楚。

比如有这样一个矩阵：

sox 7.2 3.8 6.8 9.2 5.6

sox 5.4 2.3 4.6 8.9 9.0

sox 6.7 NA 7.8 9.0 3.1

goo 2.4 6.7 NA 9.0 2.1

goo 2.1 5.6 7.8 9.7 1.2

pkk 2.5 4.3 6.5 4.9 0.2

pkk 2.1 3.4 3.2 NA 4.6

pkk 3.2 5.6 6.7 9.1 2.2

...

...

...

这个矩阵很简单，就是有一些同名的行，现在我要做的是，把这些同名的行的数据按每一列合并起来，按什么合并，按照每一列的平均值来合并

就是说，比如行名为sox的行，第一列数据为: 7.2, 5.4, 6.7 那么第一列的平均值就是(7.2+5.4+6.7)/3=6.4,

第二列数据为3.8, 2.3, NA, 那么平均值就是 (3.8+2.3)/2=3, 以此类推，得到每一列的平均值，作为最后的值，那么行名为sox的行最后就合并为：

sox 6.4 3 ....

就是要写这样一个程序，原理很简单吧，但是鄙人的水平有限，写的程序效率上不去。

请问在R下怎么写这个程序呢，恳请高手们帮忙指教。非常感谢！

easttiger · 2010年9月7日

<br />
myfun <-<br />
function(filename, dlm = "", what = character(0)){<br />
   # determine ncol<br />
   d <- length(scan(file(filename), what=what, nlines = 1, sep=dlm));</p>
<p>   # scan the file into R as a matrix object<br />
   M <- matrix(scan(file(filename), what=what, sep=dlm), ncol=d, byrow=T);</p>
<p>   firstCol <- M[,1];<br />
   laterCol <- matrix(as.numeric(M[,-1]), ncol=d-1, byrow=F);</p>
<p>   aggregate(laterCol, list(group=firstCol), function(x)mean(x,na.rm=T));<br />
}<br />

例如

<br />
> myfun("c:/mydata.txt")<br />

easttiger · 2010年9月7日

这里假设数据不多。如果数据很大的话，你最好先手动把第一列组名和后面的数值分开成两个文件，然后需要稍微改一下程序。

camelbbs · 2010年9月7日

Thanks very much!! In fact, I have big number for this data, as 40000 rows and 500 columns. So how to correct this function?

easttiger · 2010年9月8日

回复第4楼的 camelbbs：If you find a way to separate the first column of group names from the later columns of numbers, say using Excel or some script editor with "column editing mode", and you saved them in two files: mygroupname.txt and mydata.txt (both should contain the same number of rows), then the modified function goes like this

<br />
myfun1 <-<br />
function(groupName, dataFile, dlm = ""){<br />
   # determine ncol<br />
   d <- length(scan(file(dataFile), what=numeric(0), nlines = 1, sep=dlm));</p>
<p>   # scan the file into R as a matrix object<br />
   M <- matrix(scan(file(dataFile), what=numeric(0), sep=dlm), ncol=d, byrow=T);</p>
<p>   firstCol <- matrix(scan(file(groupName), what=character(0), sep=dlm), ncol=1, byrow=T)</p>
<p>   aggregate(M, list(group=firstCol), function(x)mean(x,na.rm=T));<br />
}<br />

usage:

<br />
> myfun1("c:/mygroupname.txt","c:/mydata.txt")<br />

yanlinlin82 · 2010年9月8日

回复第1楼的 camelbbs：

关键还是要把向量运算用起来：

<br />
d <- read.table(textConnection(<br />
'sox 7.2 3.8 6.8 9.2 5.6<br />
sox 5.4 2.3 4.6 8.9 9.0<br />
sox 6.7 NA 7.8 9.0 3.1<br />
goo 2.4 6.7 NA 9.0 2.1<br />
goo 2.1 5.6 7.8 9.7 1.2<br />
pkk 2.5 4.3 6.5 4.9 0.2<br />
pkk 2.1 3.4 3.2 NA 4.6<br />
pkk 3.2 5.6 6.7 9.1 2.2'))</p>
<p>sapply(split(d[,-1], d[,1]), colMeans, na.rm = T)<br />

camelbbs · 2010年9月8日

版主太牛了，Thanks so much! [s:19]

我看了一下，主要是aggregate这个函数，原来可以这样用，太妙了

camelbbs · 2010年9月8日

aggregate(M, list(group=firstCol), function(x)mean(x,na.rm=T));

这一句太博大精深了，我还不是太明白，首先list把行名单一化，然后根据list对矩阵M中的数据按列做mean，后面的函数好像是整合的，自动按列作mean，是不是这样

easttiger · 2010年9月8日

主要是空间效率的考虑。因为第一列是字符型，以后是数值型，所以如果一次性不区分地读，会生成一个大矩阵。然后再提取这个矩阵的数值部分后又会生成一个大矩阵，这样会造成两倍的空间消耗。所以最好开始时把第一列和后面分开读，这样就不会重复生成矩阵在内存里了。

camelbbs · 2010年9月10日

回复第6楼的 yanlinlin82：

谢谢，这也是一个好办法，但是最后生成的矩阵还需要用t()转置吧

wnfd · 2012年2月14日

6L的方法简单，我今天也遇到了一个问题。

name A A B C C C

SF1 680.11 904.46 843.50 1288.24 1162.20 843.82

SRSF11 1075.75 872.80 758.74 936.98 763.94 818.37

SRSF11 1075.75 872.80 758.74 936.98 763.94 818.37

SRSF12 715.56 696.64 695.58 687.81 759.67 715.00

SRSF12 28.50 16.13 18.50 17.38 22.50 21.00

FOX1 134.48 134.86 150.68 114.23 111.77 157.93

FOX1 138.58 135.08 150.40 115.02 114.02 155.38

PRPF3 503.24 213.47 169.97 179.71 255.57 216.32

FOX2 91.55 193.67 214.80 271.79 317.23 249.17

FOX2 92.17 239.50 261.42 317.08 382.00 180.50

请问如何得出一个列名为 A, B, C，行名为 SF1, SRSF11, SRSF12, FOX1, PRPF3, FOX2 的矩阵？(A列SF1的值表示数据中所有值的平均值。)

自己想到了一个方法就是运用2次6L的方法。还有没其他的方法呢？