两个实验条件下只有1%的基因共同表达,如何评估两个实验条件的相关性?
现需要评估A和B两个条件的相似度,在条件A和B下分别统计基因的表达量,某条件下未表达的基因赋值0。
现遇到一种特别情况,99%的基因只在其中一个条件下有表达,这种情况是否说明两个条件相关性差?
但是,pearson相关系数分别为0.9(1%共表达的基因)和0.85(全部基因),这个怎么解释?
模拟一组数据如下,1000个基因在两个条件下均表达:相关性(Pearson, R=0.9)
set.seed(20180328)
n <- 1000
corr <- 0.9
x <- rnorm(n, 20, 2)
z <- rnorm(n, 20, 2)
a <- corr / (1 - corr**2)**0.5
y <- (a * x + z) / 2 - 10
df1 <- data.frame(x = x, y = y)
c1 <- signif(cor(df1$x, df1$y, method = "spearman"), 2)
c2 <- signif(cor(df1$x, df1$y, method = "pearson"), 2)
plot(df1$x, df1$y, main = paste("spearman=", c1, "pearson=", c2, sep = " "), pch = 16)
增加100000个记录,只在其中一个条件下有表达的,相关性(Pearson, R=0.85)
## add 0 vs N
m <- 50000
z1 <- rnorm(m, 1, 0.5)
z2 <- rnorm(m, 1, 0.5)
da1 <- data.frame(x = z1, y = 0)
da2 <- data.frame(x = 0, y = z2)
df2 <- rbind(df1, da1, da2)
c3 <- signif(cor(df2$x, df2$y, method = "spearman"), 2)
c4 <- signif(cor(df2$x, df2$y, method = "pearson"), 2)
plot(df1$x, df1$y, main = paste("spearman=", c3, "pearson=", c4, sep = " "), pch = 16)