• R语言
  • R 无法分配大小为。。。的矢量?

我想处理一些DNA的数据,并把它们缩小了一千倍进行分析和绘图,处理5000K的序列没有什么问题,当长度大于6000K时就会报错,我想请问一下解决这样的问题只能是我把数据点继续缩小一下吗,有没有其他办法?

library(Biostrings)
library(BiocGenerics)
library(stats4)
library(S4Vectors)
library(IRanges)
library(XVector)
library(seqinr)
library(parallel)
library(ggplot2)
library(reshape2)
seq<-readDNAStringSet(file.choose())
seq1<-as.character(seq)
seq2<-getSequence(seq1)
a<-cumsum(seq2=="A")
g<-cumsum(seq2=="G")
t<-cumsum(seq2=="T")
c<-cumsum(seq2=="C")
x<-a+g-c-t
y<-a+c-g-t
#z<-a+t-g-c
#处理数据,减少1000倍
len<-floor(length(seq2)/1000)*1000
tpa<-matrix(a[1:len],ncol = 1000,byrow = TRUE)
tpc<-matrix(c[1:len],ncol = 1000,byrow = TRUE)
tpt<-matrix(t[1:len],ncol = 1000,byrow = TRUE)
tpg<-matrix(g[1:len],ncol = 1000,byrow = TRUE)
#绘图数据
aa<-rowSums(tpa)/1000
cc<-rowSums(tpc)/1000
tt<-rowSums(tpt)/1000
gg<-rowSums(tpg)/1000

xlab<-c(1:length(aa))
tpAT<-aa-tt
tpGC<-gg-cc
tpname<-getName.SeqFastadna(seq1)
plot_data<-data.frame(var1=tpGC,var2=tpAT,xlab=xlab)
plot_data_long<-melt(plot_data,id="xlab")
ggplot(data = plot_data_long,aes(x=xlab,y=value,color=variable))+
  geom_line()+
  theme_gray()+xlab("n(kb)")+
  ylab("Base disparity")+
  labs(color=NULL)+
  scale_color_discrete(labels=c("GC disparity","AT disparity"))+
  theme(legend.position = c(1,1),legend.justification = c(1,1))+
  theme(legend.background = element_blank())+
  theme(legend.key = element_blank())+
  ggtitle(tpname)
Error: cannot allocate vector of size 64.0 Mb

    Isabel 请告知 head(plot_data_long)dim(plot_data_long)

    另外绘图部分,我琢磨意思等价于

    ggplot(data = plot_data_long,aes(x = xlab,y= value,color = variable))+
    	geom_line()+
    	scale_color_discrete(labels = c("GC disparity","AT disparity"))+
    	labs(x = "n(kb)", y = "Base disparity", title = tpname, color = NULL)+
    	theme_gray() + 
    	theme(legend.position = c(1,1), legend.justification = c(1,1),
    		  legend.background = element_blank(), legend.key = element_blank())

    还有,再也不要把 R 内的保留字(或自带的函数) seqct等拿来做变量名,函数名(慎用),请多看看 <http://adv-r.had.co.nz/Style.html>

    最后,请看看新手须知,修改发帖

    > head(plot_data_long)
      xlab variable   value
    1    1     var1  32.225
    2    2     var1  76.389
    3    3     var1 115.496
    4    4     var1 165.891
    5    5     var1 231.758
    6    6     var1 259.611
    > dim(plot_data_long)
    [1] 9966    3

    好的,谢谢您给的建议,以后这些我都会注意,
    因为我想把两个数据放在一个图中,但是用ggplot2我不会别的方法,只能把它们给处理放在了一个数据框,这样我才会根据变量绘图。

      Isabel 一层用一个dataframe也没问题,每一层的data参数都可以设置

      可以的,我按您的代码改了我的,谢谢您,学习了,只是就这样的一段代码就让我的Rstudio总是崩溃,我还不知道是因为什么,感觉很奇怪,跟我电脑内存有关?

      sessionInfo()
      R version 3.4.2 (2017-09-28)
      Platform: i386-w64-mingw32/i386 (32-bit)
      Running under: Windows >= 8 (build 9200)
      Matrix products: default
      locale:
      [1] LC_COLLATE=Chinese (Simplified)China.936 LC_CTYPE=Chinese (Simplified)China.936 
      
      [3] LC_MONETARY=Chinese (Simplified)China.936 LC_NUMERIC=C 
      
      [5] LC_TIME=Chinese (Simplified)China.936 
      
      attached base packages:
      [1] stats graphics grDevices utils datasets methods base 
      
      loaded via a namespace (and not attached):
      [1] compiler_3.4.2 tools_3.4.2 

        Isabel 才10000行的数据,肯定不是内存的问题,就是一千万行也没问题的,你应该把运行上述代码时的sessionInfo 提供给大家,而不是刚打开R什么也不做的时候的sessionInfo

        另外,我给的那段代码就是替换你的 ggplot 那段绘图代码,不需要你修改我的代码,直接贴过去运行。说实话,你那段绘图代码,我都不敢拿到我的R上运行。我用R自带的 iris 数据作为示例

        library(ggplot2)
        tpname <- "iris data"
        ggplot(iris,aes(x = seq(150) ,y = Sepal.Length,color = Species)) +
        	geom_line() +
         	scale_color_discrete(labels = c("GC disparity","AT disparity","XXX"))+
        	labs(x = "n(kb)", y = "Base disparity", title = tpname, color = NULL)+
        	theme_gray() + 
        	theme(legend.position = c(1,1), legend.justification = c(1,1),
        		  legend.background = element_blank(), legend.key = element_blank())

        出图如下:
        iris

        很感谢您的耐心指导,那肯定是我写的代码有问题,谢谢您,我也学习一段时间了,可还是停留在写不出代码阶段,您是怎么学习的呀?运行了我的程序之后信息是这样的

        sessionInfo()
        R version 3.4.2 (2017-09-28)
        Platform: i386-w64-mingw32/i386 (32-bit)
        Running under: Windows >= 8 (build 9200)
        
        Matrix products: default
        
        locale:
        [1] LC_COLLATE=Chinese (Simplified)_China.936  LC_CTYPE=Chinese (Simplified)_China.936   
        [3] LC_MONETARY=Chinese (Simplified)_China.936 LC_NUMERIC=C                              
        [5] LC_TIME=Chinese (Simplified)_China.936    
        
        attached base packages:
        [1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     
        
        other attached packages:
        [1] stringr_1.2.0       reshape2_1.4.2      ggplot2_2.2.1       seqinr_3.4-5        Biostrings_2.44.2  
        [6] XVector_0.16.0      IRanges_2.10.5      S4Vectors_0.14.7    BiocGenerics_0.22.1
        
        loaded via a namespace (and not attached):
         [1] Rcpp_0.12.13     magrittr_1.5     zlibbioc_1.22.0  munsell_0.4.3    colorspace_1.3-2 rlang_0.1.4     
         [7] plyr_1.8.4       tools_3.4.2      grid_3.4.2       gtable_0.2.0     digest_0.6.12    ade4_1.7-8      
        [13] lazyeval_0.2.1   tibble_1.3.4     labeling_0.3     stringi_1.1.5    compiler_3.4.2   scales_0.5.0 
        memory.size(T)
        [1] 651.38
        memory.size(F)
        [1] 470.38
        memory.limit()
        [1] 2047

          Isabel 1. 学习 ggplot2 最好的书自然是 Hadley Wickham 写的 ggplot2: Elegant Graphics for Data Analysis 书的源文件在 github上 <https://github.com/hadley/ggplot2-book>
          我按照其说明编译了一版 pdf ,编译过程见 <https://cloud2016.github.io/post/how-to-compile-ggplot2-book/>

          1. 你的Win8系统是32位的?如果是64位的建议使用 64位的 R,32位的能利用的系统内存少,这个在面对大文件的时候,你会有体会的。我打开R什么也没做,查看内存使用情况如下:
          > sessionInfo()
          R version 3.4.2 (2017-09-28)
          Platform: x86_64-w64-mingw32/x64 (64-bit)
          Running under: Windows 8.1 x64 (build 9600)
          
          Matrix products: default
          
          locale:
          [1] LC_COLLATE=Chinese (Simplified)_China.936 
          [2] LC_CTYPE=Chinese (Simplified)_China.936   
          [3] LC_MONETARY=Chinese (Simplified)_China.936
          [4] LC_NUMERIC=C                              
          [5] LC_TIME=Chinese (Simplified)_China.936    
          
          attached base packages:
          [1] stats     graphics  grDevices utils     datasets  methods   base     
          
          loaded via a namespace (and not attached):
          [1] compiler_3.4.2
          > memory.size(T)
          [1] 27.62
          > memory.size(F)
          [1] 25.74
          > memory.limit()
          [1] 8104

          其实 ?memory.limit 页面说的很明白

          Command-line flag --max-mem-size sets the maximum value of obtainable memory (including a very small amount of housekeeping overhead). This cannot exceed 3Gb on 32-bit Windows, and most versions are limited to 2Gb. The minimum is currently 32Mb.

          If 32-bit R is run on most 64-bit versions of Windows the maximum value of obtainable memory is just under 4Gb. For a 64-bit versions of R under 64-bit Windows the limit is currently 8Tb.

          1. 我给的代码运行还报错吗?如报错贴出报错信息

          我的电脑是Win10,32位?
          您给的代码没错,绘图这块是对的,只是加上我的代码之后进行数据读取时还是会有这样的错
          Error: cannot allocate vector of size 64.0 Mb
          所以肯定是我的代码还有问题,我继续改。
          我知道您是谁了,嘿嘿,我这有好几本您翻译的书,我也正在学习,谢谢☺

            Isabel 1. 那也就是说下面这段代码有问题

            seq <- readDNAStringSet(file.choose())

            我还是第一次见,使用 file.choose() 去选择文件,下面我从 ?readDNAStringSet抄来一个例子,你参考人间读取文件的过程

            ## Read a gzip-compressed FASTA file:
            filepath2 <- system.file("extdata", "someORF.fa.gz", package="Biostrings")
            fasta.seqlengths(filepath2, seqtype="DNA")
            YAL001C TFC3 SGDID:S0000001, Chr I from 152168-146596, reverse complement, Verified ORF 
                                                                                               5573 
                                YAL002W VPS8 SGDID:S0000002, Chr I from 142709-148533, Verified ORF 
                                                                                               5825 
                                YAL003W EFB1 SGDID:S0000003, Chr I from 141176-144162, Verified ORF 
                                                                                               2987 
            YAL005C SSA1 SGDID:S0000004, Chr I from 142433-138505, reverse complement, Verified ORF 
                                                                                               3929 
            YAL007C ERP2 SGDID:S0000005, Chr I from 139347-136700, reverse complement, Verified ORF 
                                                                                               2648 
                               YAL008W FUN14 SGDID:S0000006, Chr I from 135916-138512, Verified ORF 
                                                                                               2597 
                                YAL009W SPO7 SGDID:S0000007, Chr I from 134856-137635, Verified ORF 
                                                                                               2780 
            
            x2 <- readDNAStringSet(filepath2)
            x2
              A DNAStringSet instance of length 7
                width seq                                             names               
            [1]  5573 ACTTGTAAATATATCTTTTATT...TTATCGACCTTATTGTTGATAT YAL001C TFC3 SGDI...
            [2]  5825 TTCCAAGGCCGATGAATTCGAC...GTAAATTTTTTTCTATTCTCTT YAL002W VPS8 SGDI...
            [3]  2987 CTTCATGTCAGCCTGCACTTCT...GGTACTCATGTAGCTGCCTCAT YAL003W EFB1 SGDI...
            [4]  3929 CACTCATATCGGGGGTCTTACT...GTCCCGAAACACGAAAAAGTAC YAL005C SSA1 SGDI...
            [5]  2648 AGAGAAAGAGTTTCACTTCTTG...TATAATTTATGTGTGAACATAG YAL007C ERP2 SGDI...
            [6]  2597 GTGTCCGGGCCTCGCAGGCGTT...AGTTTTGGCAGAATGTACTTTT YAL008W FUN14 SGD...
            [7]  2780 CAAGATAATGTCAAAGTTAGTG...CTAAGGAAGAAAAAAAAATCAC YAL009W SPO7 SGDI...

            2. ? 那你肯定搞错了,我还没翻译过什么书呢!!