R不务正业之简单的验证码识别

nan.xiao

R作为最强(没有之一)的统计计算环境, 有着很好的数字图像处理支持, 很适合用来做验证码识别这类工作. 我们可以通过一个最简单的例子来实践一下.

一般来说, 要让机器"认出"验证码, 大致需要讨论3个问题:

提取字模: 将验证码中出现的单个字符的数字信息提取出来, 作为比较的依据, 可以认为每个字符对应一个数字矩阵;
二值化: 根据一定的准则, 将数字矩阵中的字符本身部分用1表示, 背景、干扰点用0表示(或反之);
抓取验证码, 二值化, 根据一定准则, 与字模比对, 取字模中与之最相似的字符作为识别结果.

首先, 我们需要一个提供验证码服务的站点. 几经选择, 决定使用国内某CMS练手. 在其"成功案例"中随机选个站:

http://gongxue.cn/Inc/Checkcode.asp

每刷新一次这个页面, 将返回一个新的验证码, 形式为6位的数字、小写字母混合.

为了方便, 建立一个文件夹用以保存字模和测试文件, 设置为工作目录:

<br />
dir.create("C:/checkcode/")<br />
setwd("C:/checkcode/")<br />

根据观察, 本例中需要提取35个字符(10个数字+25个小写字母, 不知为什么, 唯独缺了"z")的像素信息. 又观察得验证码使用的图片格式为bmp, 将其裁剪为单独的字符, 存为35个单独的bmp文件(见附件), 使用GraphicsMagick转换为R可读的pnm格式:

<br />
system("gm mogrify -format pnm C:/checkcode/*.bmp")<br />

使用pixmap包中的read.pnm()函数读入图片, 结果存入列表pnm:

<br />
require(pixmap)<br />
pnm = list()<br />
length(pnm) = 35<br />
for (i in 0:34) {<br />
pnm[[i+1]] = read.pnm(paste(i, ".pnm", sep = ""))<br />
}<br />

(为了看得清楚, 全部使用显式循环)

使用pixmap包中的getChannels()函数提取RGB三个通道的数值矩阵, 结果存入列表channels:

<br />
channels = list()<br />
length(channels) = 35<br />
for (j in 1:35) {<br />
channels[[j]] = getChannels(pnm[[j]])<br />
}<br />

通过简单计算, 将RGB转换为灰度, 结果存入列表graymat:

<br />
graymat = list()<br />
length(graymat) = 35<br />
for (k in 1:35) {<br />
graymat[[k]] = 0.3 * channels[[k]][, , "red"] +<br />
               0.59 * channels[[k]][, , "green"] +<br />
               0.11 * channels[[k]][, , "blue"][1, 1]<br />
}<br />

使用biclust包中的binarize()函数对矩阵进行二值化, 存入列表binmat:

<br />
require(biclust)<br />
binmat = list()<br />
length(binmat) = 35<br />
for (l in 1:35) {<br />
binmat[[l]] = binarize(graymat[[l]][1:15, 1:10])<br />
}<br />

这里的情况是最简单的, 因为图像上没有任何干扰, 故二值化的效果极好. 大家可以输入binmat[[1]]观察字符"0"的二值化结果.

至此, 已经成功建立字模库. 下面就可以正式开始识别了:

<br />
## 建立下标与字符的对应关系<br />
charTable = c(0:9, "a", "b", "c", "d", "e", "f",<br />
              "g", "h", "i", "j", "k", "l", "m",<br />
              "n", "o", "p", "q", "r", "s", "t",<br />
              "u", "v", "w", "x", "y")</p>
<p>## 刷验证码<br />
download.file("http://gongxue.cn/Inc/Checkcode.asp",<br />
              "test.bmp", quiet = TRUE)</p>
<p>## bmp转pnm<br />
system("gm mogrify -format pnm C:/checkcode/test.bmp")</p>
<p>## 二值化计算过程 同上<br />
testmat = getChannels(read.pnm("test.pnm"))<br />
testmatgray = 0.3 * testmat[, , "red"] +<br />
              0.59 * testmat[, , "green"] +<br />
              0.11 * testmat[, , "blue"][1, 1]</p>
<p>char = list()<br />
length(char) = 6<br />
for (i in 1:6) {<br />
char[[i]] = binarize(testmatgray[1:15, (i*10-9):(i*10)])<br />
}</p>
<p>## 将每个字符的比对结果存入result 比较过程仍可优化<br />
result = rep(NA, 6)<br />
for (i in 1:6) {<br />
	for (j in 1:35) {<br />
	if (isTRUE(all.equal(char[[i]], binmat[[j]])) == TRUE)<br />
	result[i] = charTable[j]<br />
	}<br />
}<br />

输出识别结果:

<br />
print(paste(result, collapse = ""))<br />
## [1] "v54c0s"<br />

本次不务正业就此结束. 这个例子的简单之处在于验证码中没有干扰, 字符也没有做变形处理, 属于非常传统非常标准的验证码. 那些更加bt的验证码, 要保持比较高的识别率, 就对二值化方法和比较方法上提出了更高的要求, 统计方法可以在这里施展身手. 相对那些单纯增加图像的复杂性让人肉都很难辨认的验证码, 我更欣赏的是被Google收购的reCAPTCHA, 属于那种很难得的idea.

Ihavenothing

果断加精。[s:11]

yihui

我以为是识别随机干扰的验证码呢……如果后续能做出识别随机干扰的验证码，可直接发主站

大年初一的，还在钻研这些玩意儿，极客精神呐 [s:17]

nan.xiao

嗯这东西的问题是很难有普遍意义, 几乎每一个个案都需要具体分析, 稍微麻烦一点的要研究出来并且写清楚具体算法也不太容易.

不过对于去干扰点, 这里倒是碰上了个有意思的例子:

鲜果网注册验证码: http://xianguo.com/pic/valid

<br />
download.file("http://xianguo.com/pic/valid",<br />
              paste("valid.png", sep = ""),<br />
			  mode = "wb", quiet = TRUE)<br />

PNG格式, 40*140, 6位数字、大小写字母混合, 颜色与字符相近的干扰点, 字符随机倾斜某角度.

<br />
require(png)<br />
require(rimage)<br />
img = readPNG("valid.1.png")[ , , 1:3] # RGBA -> RGB<br />
imgmat = imagematrix(img) # 加一个imagematrix属性 方便画图<br />
graymat = rgb2grey(imgmat) # RGB转灰度<br />

使用rimage中的thresholding()函数, 看一下原图和直接做二值化的效果:

<br />
op = par(mfrow = c(2, 2))<br />
plot(graymat, main = "Original")<br />
plot(thresholding(graymat, mode = "fixed"),<br />
		main="Threshold = 0.5")<br />
plot(thresholding(graymat, mode = "fixed", th = 0.9),<br />
		main="Threshold = 0.9")<br />
plot(thresholding(graymat, mode="da"),<br />
		main="Auto Threshold by Discriminal Analysis")<br />
par(op)<br />

[attachment=214085,868]

效果很差, 可能与干扰点的颜色与字符颜色相近有一定关系. 往死里观察图像, 发现干扰点有一个致命的规律: 虽然干扰点的颜色与字符相近, 但每个干扰点基本都是一个单纯的暗点, 其上下左右四个位置的像素点全为白色. 这样几乎不需要任何思考就可以直接去掉绝大部分干扰点了:

<br />
graymat2 = graymat<br />
for (i in 2:39) {<br />
	for (j in 2:139) {<br />
		if (1 - graymat2[i+1, j] < 1e-15 &<br />
		1 - graymat2[i-1, j-1] < 1e-15 &<br />
		1 - graymat2[i-1, j] < 1e-15 &<br />
		1 - graymat2[i+1, j] < 1e-15) {<br />
		graymat2[i, j] = 1<br />
		}<br />
	}<br />
}<br />

看一下去掉干扰点的图以及对其二值化的效果:

<br />
op = par(mfrow = c(2, 2))<br />
plot(graymat2, main = "Original")<br />
plot(thresholding(graymat2, mode = "fixed"),<br />
		main="Threshold = 0.5")<br />
plot(thresholding(graymat2, mode = "fixed", th = 0.9),<br />
		main="Threshold = 0.9")<br />
plot(thresholding(graymat2, mode="da"),<br />
		main="Auto Threshold by Discriminal Analysis")<br />
par(op)<br />

[attachment=214085,869]

效果还是可以接受的, 并且有待于进一步优化. 值得一提的是, rimage包里带了好几个filter, 很有意思, 可玩性很强.

以下省略1万字 ...

yihui

有点儿意思了

superdesolator

GraphicsMagick怎么在WINDOWS下安装?

twofat

回复第6楼的 superdesolator：同问。。。

fyuliang

不错，学习学习

notingbad

xiongxi

果断留名