【已解决】字符串与正则表达式的转义反斜杠

Heterogeneity

问题来自于《R for Data Science》电子书里14.3.1节（基础匹配，https://r4ds.had.co.nz/strings.html#basic-matches）里的一句话：

If \ is used as an escape character in regular expressions, how do you match a literal \? Well you need to escape it, creating the regular expression \\. To create that regular expression, you need to use a string, which also needs to escape \. That means to match a literal \ you need to write "\\\\" — you need four backslashes to match one!

我看到四个反斜杠的时候有点晕……这和我想的不一样呀。我觉得应该是三个反斜杠。
我是这么（错误地）理解的：如果有内容为\\\这么一个字符串，那么第一个反斜杠会被R当作字符串的转义字符。
于是消除第一个反斜杠以后，得到一个内容为\\的字符串。
第二个反斜杠会被R（或者说，stringr包）当作正则表达式的转义字符。
于是消除第二个反斜杠以后，得到一个内容为\的正则表达式。这正是我想匹配的内容。

我上面这个说法错在哪里呢？烦请各位大神指点。

我之前看过 <https://stackoverflow.com/questions/18875852/why-string-replaceall-in-java-requires-4-slashes-in-regex-to-actually-r>，但是觉得很晕啊。

tctcab

Heterogeneity

有意思的问题。

你上面的错的地方在于 “\\\”这个字符串在R里不合法的，写一下就知道了，第三个斜杠转义了引号…

正确的理解方式原文也写了：
首先R选择用字符串来表示正则表达式，但是正则表达式并不是字符串。
特殊字符\特殊在于它既用来表示R里字符串的转义，也表示正则表达式里的转义。
所以要匹配一个\，正则表达式是\\, R里这个表达式写作\\\\

要匹配两个\\，正则表达式是\\\\,R的字符串就得用八道杠了…

八道杠代码如下：

## a string "a\\b" 
writeLines("a\\\\b")
#> a\\b
## to match `\\`, regex is `\\\\`
writeLines("\\\\\\\\")
#> \\\\

## test
grepl("\\\\\\\\", "a\\\\b")
#> [1] TRUE

<sup>Created on 2019-03-08 by the reprex package (v0.2.1)</sup>

Cloud2016

tctcab 我找到方案了

查看帮助 ?grep 它包含参数 fixed 看它的解释就是原生字符的意思，并且该参数在支持正则表达式的函数中都存在

 fixed: logical.  If ‘TRUE’, ‘pattern’ is a string to be matched as is.  
           Overrides all conflicting arguments.

举个例子

grep(x = c("a\\\\b",'a\\b'),pattern="\\\\",value=TRUE,fixed=TRUE)
[1] "a\\\\b"

grepl("\\\\\\\\", "a\\\\b",fixed=TRUE)
[1] FALSE
grepl("\\\\", "a\\\\b",fixed=TRUE)
[1] TRUE

BTW，我觉得这个功能非常实用，值得单独拎出来介绍，它默认是不启用的。Heterogeneity 你用上 fixed 就不晕了

Cloud2016

tctcab R里面是否比较好的解决了这个问题？因为 Python 支持原生字符串的功能可以比较好的解决这个问题，请看 <http://www.liujiangblog.com/course/python/74> 中反斜杠的困扰，就是说Python里面匹配文本中的反斜杠 \ 只需 r"\\"

Heterogeneity

tctcab 谢谢精彩的解答。

看完你的帖子，我觉得我可能是这么错的：我认为字符串和正则表达式是两个不同的层次，字符串比较底层，正则表达式比较高层。而每一层都会对这个串作为整体考虑一次，但是只考虑一次。于是每次都只把串中的第一个\进行转义处理（错了），处理以后把剩下的串传到更高一层进行处理。

而你说的

tctcab 第三个斜杠转义了引号…

证明了每一层会把这个串的每个字符都过一遍。所以每个转义字符就都处理掉了。跟在转义字符后面的那个字符被保留，传递到更高一层进行处理。所以形如\a\b\c\d这样的串，转义后会变成abcd传递到更高一层。

（你帖子里最后一句话是什么意思啊？版权？）

tctcab

Cloud2016
啊我没找到r的raw string用法

Heterogeneity

Cloud2016 这个博客还挺不错的……我觉得我复习Python的时候会用到

tctcab

Cloud2016
赞，可以少打好多斜杠了

Heterogeneity

Cloud2016 好棒啊！

Cloud2016

Heterogeneity tctcab 最近又发现用 fixed=TRUE 的一个副作用，就是反向引用不能用

请看 replacement 参数的解释

a replacement for matched pattern in sub and gsub. Coerced to character if possible. For fixed = FALSE this can include backreferences "\1" to "\9" to parenthesized subexpressions of pattern. For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case and "\E" to end case conversion. If a character vector of length 2 or more is supplied, the first element is used with a warning. If NA, all elements in the result corresponding to matches will be set to NA.

其中

For fixed = FALSE this can include backreferences "\1" to "\9" to parenthesized subexpressions of pattern.

dapengde

Heterogeneity 那是因为他贴的示例代码是通过 reprex 包处理了一下才贴的，最后一句是 reprex 包自动生成的。