• R语言
  • regexpr 和 regexec 这两个函数有什么区别?

我最近看 R base 的字符串处理函数看到了 grep/agrep/grepl/regexpr/gregexpr/regexec/gregexec + regmatchs,因为这些函数名称长得很像,所以捋了捋下面这些。

|函数名称中的字符|英文释义|中文释义|
| ---- | ---- | ---- |
|rep|regular expression|正则表达式|
|g|global|全局|
|a|approximate|近似,模糊|
|l|logical|逻辑|
|reg|regular|正则的|
|expr|expression|表达式|
|exec|execute|执行|

现在的问题是,我只能试出来 regexpr 和 regexec 这两个函数输出的结果在格式上不一样,不明白还有什么区别。

my_vector <- c('>阿木<;>曼妮<', 'amu', 'amu<', '>.<')

regexpr(pattern = '[>.<]', my_vector)
## [1]  1 -1  4  1
## attr(,"match.length")
## [1]  1 -1  1  1

gregexpr(pattern = '[>.<]', my_vector)
## [[1]]
## [1] 1 4 6 9
## attr(,"match.length")
## [1] 1 1 1 1
## 
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## 
## [[3]]
## [1] 4
## attr(,"match.length")
## [1] 1
## 
## [[4]]
## [1] 1 2 3
## attr(,"match.length")
## [1] 1 1 1

regexec(pattern = '[>.<]', my_vector)
## [[1]]
## [1] 1
## attr(,"match.length")
## [1] 1
## 
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## 
## [[3]]
## [1] 4
## attr(,"match.length")
## [1] 1
## 
## [[4]]
## [1] 1
## attr(,"match.length")
## [1] 1

gregexec(pattern = '[>.<]', my_vector)
## [[1]]
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    6    9
## attr(,"match.length")
##      [,1] [,2] [,3] [,4]
## [1,]    1    1    1    1
## 
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## 
## [[3]]
##      [,1]
## [1,]    4
## attr(,"match.length")
##      [,1]
## [1,]    1
## 
## [[4]]
##      [,1] [,2] [,3]
## [1,]    1    2    3
## attr(,"match.length")
##      [,1] [,2] [,3]
## [1,]    1    1    1

    差别不是很大,找到了个参考

    regexpr(), gregexpr(): Search a character vector for regular expression matches and return the indices of the string where the match begins and the length of the match.
    regexec(): This function searches a character vector for a regular expression, much like regexpr(), but it will additionally return the locations of any parenthesized sub-expressions. Probably easier to explain through demonstration.

    这个书里面用的例子链接已经失效了,没找到原始数据,找了下可能相关的链接,不得不地说牢美的生活真是丰富多彩。

    yuanfan 正则表达式中没小括号组时,基本上没什么区别,可以把 regexpr() 视作 regexec() 的特例。有小括号组时,后者会返回每一小组的匹配信息(前者不会)。它们的区别配合 regmatches() 更容易看出来。

    x = c('>阿木<;>曼妮<', 'amu', 'amu<', '>.<')
    m1 = gregexpr('.(.<)(.)?', x)
    regmatches(x, m1)
    [[1]]
    [1] "阿木<;" "曼妮<" 
    
    [[2]]
    character(0)
    
    [[3]]
    [1] "mu<"
    
    [[4]]
    [1] ">.<"
    m2 = gregexec('.(.<)(.)?', x)
    regmatches(x, m2)
    [[1]]
         [,1]     [,2]   
    [1,] "阿木<;" "曼妮<"
    [2,] "木<"    "妮<"  
    [3,] ";"      ""     
    
    [[2]]
    character(0)
    
    [[3]]
         [,1] 
    [1,] "mu<"
    [2,] "u<" 
    [3,] ""   
    
    [[4]]
         [,1] 
    [1,] ">.<"
    [2,] ".<" 
    [3,] ""