regexpr 和 regexec 这两个函数有什么区别？

yuanfan · 2024年9月12日

我最近看 R base 的字符串处理函数看到了 grep/agrep/grepl/regexpr/gregexpr/regexec/gregexec + regmatchs，因为这些函数名称长得很像，所以捋了捋下面这些。

|函数名称中的字符|英文释义|中文释义|
| ---- | ---- | ---- |
|rep|regular expression|正则表达式|
|g|global|全局|
|a|approximate|近似，模糊|
|l|logical|逻辑|
|reg|regular|正则的|
|expr|expression|表达式|
|exec|execute|执行|

现在的问题是，我只能试出来 regexpr 和 regexec 这两个函数输出的结果在格式上不一样，不明白还有什么区别。

my_vector <- c('>阿木<;>曼妮<', 'amu', 'amu<', '>.<')

regexpr(pattern = '[>.<]', my_vector)
## [1]  1 -1  4  1
## attr(,"match.length")
## [1]  1 -1  1  1

gregexpr(pattern = '[>.<]', my_vector)
## [[1]]
## [1] 1 4 6 9
## attr(,"match.length")
## [1] 1 1 1 1
## 
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## 
## [[3]]
## [1] 4
## attr(,"match.length")
## [1] 1
## 
## [[4]]
## [1] 1 2 3
## attr(,"match.length")
## [1] 1 1 1

regexec(pattern = '[>.<]', my_vector)
## [[1]]
## [1] 1
## attr(,"match.length")
## [1] 1
## 
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## 
## [[3]]
## [1] 4
## attr(,"match.length")
## [1] 1
## 
## [[4]]
## [1] 1
## attr(,"match.length")
## [1] 1

gregexec(pattern = '[>.<]', my_vector)
## [[1]]
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    6    9
## attr(,"match.length")
##      [,1] [,2] [,3] [,4]
## [1,]    1    1    1    1
## 
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## 
## [[3]]
##      [,1]
## [1,]    4
## attr(,"match.length")
##      [,1]
## [1,]    1
## 
## [[4]]
##      [,1] [,2] [,3]
## [1,]    1    2    3
## attr(,"match.length")
##      [,1] [,2] [,3]
## [1,]    1    1    1

vickkk · 2024年9月12日

差别不是很大，找到了个参考：

regexpr(), gregexpr(): Search a character vector for regular expression matches and return the indices of the string where the match begins and the length of the match.
regexec(): This function searches a character vector for a regular expression, much like regexpr(), but it will additionally return the locations of any parenthesized sub-expressions. Probably easier to explain through demonstration.

这个书里面用的例子链接已经失效了，没找到原始数据，找了下可能相关的链接，不得不地说牢美的生活真是丰富多彩。

yihui · 2024年9月12日

yuanfan 正则表达式中没小括号组时，基本上没什么区别，可以把 regexpr() 视作 regexec() 的特例。有小括号组时，后者会返回每一小组的匹配信息（前者不会）。它们的区别配合 regmatches() 更容易看出来。

x = c('>阿木<;>曼妮<', 'amu', 'amu<', '>.<')
m1 = gregexpr('.(.<)(.)?', x)
regmatches(x, m1)

[[1]]
[1] "阿木<;" "曼妮<" 

[[2]]
character(0)

[[3]]
[1] "mu<"

[[4]]
[1] ">.<"

m2 = gregexec('.(.<)(.)?', x)
regmatches(x, m2)

[[1]]
     [,1]     [,2]   
[1,] "阿木<;" "曼妮<"
[2,] "木<"    "妮<"  
[3,] ";"      ""     

[[2]]
character(0)

[[3]]
     [,1] 
[1,] "mu<"
[2,] "u<" 
[3,] ""   

[[4]]
     [,1] 
[1,] ">.<"
[2,] ".<" 
[3,] ""