• R语言
  • Error in e(s, a) : java.lang.OutOfMemoryError: Java heap space

如今,大语言模型如火如荼地发展,最近,也开始学习一点文本处理的知识。

library(StanfordCoreNLP)
library(StanfordCoreNLPjars)
library(NLP)
s <- as.String(paste("Stanford University is located in California.",
                    "It is a great university."))
s
## 可以运行
p <- StanfordCoreNLP_Pipeline(annotators = c("pos", "lemma"))
annotate(s, p)
 id type     start end features
  1 sentence     1  45 constituents=<<integer,7>>
  2 word         1   8 word=Stanford, POS=NNP, lemma=Stanford
  3 word        10  19 word=University, POS=NNP, lemma=University
  4 word        21  22 word=is, POS=VBZ, lemma=be
  5 word        24  30 word=located, POS=VBN, lemma=locate
  6 word        32  33 word=in, POS=IN, lemma=in
  7 word        35  44 word=California, POS=NNP, lemma=California
  8 word        45  45 word=., POS=., lemma=.
  9 sentence    47  71 constituents=<<integer,6>>
 10 word        47  48 word=It, POS=PRP, lemma=it
 11 word        50  51 word=is, POS=VBZ, lemma=be
 12 word        53  53 word=a, POS=DT, lemma=a
 13 word        55  59 word=great, POS=JJ, lemma=great
 14 word        61  70 word=university, POS=NN, lemma=university
 15 word        71  71 word=., POS=., lemma=.
## 报错
p <- StanfordCoreNLP_Pipeline(annotators = c("pos", "lemma", "ner"))
annotate(s, p)

NER 不能运行,报内存超出的错

Error in e(s, a) : java.lang.OutOfMemoryError: Java heap space

用到的两个 R 包下载地址
https://datacube.wu.ac.at/src/contrib/StanfordCoreNLP_0.1-11.tar.gz
英文大模型
https://datacube.wu.ac.at/src/contrib/StanfordCoreNLPjars_4.5.5-1.tar.gz

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] zh_CN.UTF-8/zh_CN.UTF-8/zh_CN.UTF-8/C/zh_CN.UTF-8/zh_CN.UTF-8

time zone: Asia/Shanghai
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] NLP_0.2-1                   StanfordCoreNLPjars_4.5.5-1
[3] StanfordCoreNLP_0.1-11     

loaded via a namespace (and not attached):
[1] compiler_4.3.2 cli_3.6.2      xml2_1.3.6     rlang_1.1.3    rJava_1.0-11 

最后

大家有推荐的 NLP / LLM 课程吗?通俗点、偏实战的。

    我试过设置 rJava 参数,启动更大内存

    library(rJava)
    options(java.parameters = "-Xmx512m")
    # 2 G 内存
    options(java.parameters = "-Xmx2g")

    但是,启用 NER 功能还是会报内存不够的错误。

    我觉得直接使用现在的事实标准 spaCy 就可以了,其他框架连看的必要都没有…… 几乎终结了这个领域 😂

    简单和速度是我喜欢 spaCy 的原因。因为出现的时间较晚,它在设计上大量利用了预训练的嵌入,处理很多传统的 NLP 任务给人一气呵成的感觉。而且现在通过 spacy-llm 好像也可以对接各种 LLM。

      nan.xiao 我只是用到很少的功能,

      udpipe

      library(udpipe)
      # 下载模型,存到本地 udpipe/ 目录下
      udpipe::udpipe_download_model(language = "english", model_dir = "udpipe/")
      # 使用模型
      udpipe(
        x = c("apply", "applying", "applies", "applied", "data", "models"),
        object = udpipe_load_model(
          file = "udpipe/english-ewt-ud-2.5-191206.udpipe"
        )
      )
        doc_id paragraph_id sentence_id sentence start end term_id token_id    token
      1   doc1            1           1    apply     1   5       1        1    apply
      2   doc2            1           1 applying     1   8       1        1 applying
      3   doc3            1           1  applies     1   7       1        1  applies
      4   doc4            1           1  applied     1   7       1        1  applied
      5   doc5            1           1     data     1   4       1        1     data
      6   doc6            1           1   models     1   6       1        1   models
         lemma upos xpos                                                 feats
      1  apply VERB   VB                                          VerbForm=Inf
      2  apply VERB  VBG                                          VerbForm=Ger
      3  apply VERB  VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
      4  apply VERB  VBN                              Tense=Past|VerbForm=Part
      5   data NOUN   NN                                           Number=Sing
      6 models NOUN  NNS                                           Number=Plur
        head_token_id dep_rel deps            misc
      1             0    root <NA> SpacesAfter=\\n
      2             0    root <NA> SpacesAfter=\\n
      3             0    root <NA> SpacesAfter=\\n
      4             0    root <NA> SpacesAfter=\\n
      5             0    root <NA> SpacesAfter=\\n
      6             0    root <NA> SpacesAfter=\\n

      data 是好的,models 没有转化。

      spacyr

      library(spacyr)
      # 下载模型
      spacy_download_langmodel("en_core_web_sm")
      spacy_initialize(model = "en_core_web_sm")
      # 动词的 lemma 化是好的
      spacy_parse(x = c("apply", "applying", "applies", "applied"))
        doc_id sentence_id token_id    token lemma  pos entity
      1  text1           1        1    apply apply VERB       
      2  text2           1        1 applying apply VERB       
      3  text3           1        1  applies apply VERB       
      4  text4           1        1  applied apply VERB   
      # 名词不行
      spacy_parse(x = c("data","models"))
        doc_id sentence_id token_id  token lemma  pos entity
      1  text1           1        1   data datum NOUN       
      2  text2           1        1 models model NOUN 

      感觉是很常规的 data 都转化出问题了。

        Cloud2016 用扩展 lemminflect 即可,参考 StackOverflow:

        import spacy
        import lemminflect
        
        nlp = spacy.load("en_core_web_sm")
        doc = nlp("apply applying applies applied data models")
        for token in doc:
            print("%9s %9s %9s" % (token.text, token.lemma_, token._.lemma()))
        
        #     apply     apply     apply
        #  applying     apply     apply
        #   applies    applie     apply
        #   applied     apply     apply
        #      data      data      data
        #    models     model     model