对于千字文读取无法显示最终数量报错,代码如下:
download.file(url =
"http://dapengde.com/r4rookies/qianziwen.txt",
destfile = "c:/r4r/qianziwen.txt")
qzw <- readLines('c:/r4r/qianziwen.txt', encoding = 'UTF-8')
class(qzw)
length(qzw)
nchar(qzw)
qzwmerged <- paste(qzw, collapse = '')
qzwmerged <- gsub(' ', '', qzwmerged)
nchar(qzwmerged)
qzwsingle <- strsplit(qzwmerged, '')[[1]]
chardup <- qzwsingle[duplicated(qzwsingle)]
for(i in chardup) print(paste(i, grep(i, qzw, value = T)))
运行结果如下:
> download.file(url =
+ "http://dapengde.com/r4rookies/qianziwen.txt",
+ destfile = "c:/r4r/qianziwen.txt")
trying URL 'http://dapengde.com/r4rookies/qianziwen.txt'
Content type 'text/plain' length 2373 bytes
downloaded 2373 bytes
> qzw <- readLines('c:/r4r/qianziwen.txt', encoding = 'UTF-8')
Warning message:
In readLines("c:/r4r/qianziwen.txt", encoding = "UTF-8") :
incomplete final line found on 'c:/r4r/qianziwen.txt'
> class(qzw)
[1] "character"
> length(qzw)
[1] 125
> nchar(qzw)
Error in nchar(qzw) : invalid multibyte string, element 1
> qzwmerged <- paste(qzw, collapse = '')
> qzwmerged <- gsub(' ', '', qzwmerged)
> nchar(qzwmerged)
[1] 3105
> qzwsingle <- strsplit(qzwmerged, '')[[1]]
> chardup <- qzwsingle[duplicated(qzwsingle)]
> for(i in chardup) print(paste(i, grep(i, qzw, value = T)))
...
[56] "㸸 \xe6\xfd\xc9\xc8Բ�\xe0 \xd2\xf8\xd6\xf2쿻\xcd"
[57] "㸸 \xcfҸ\xe8�\xc6\xd1\xe7 �ӱ��\xd9\xf5\xfc"
[58] "㸸 �պ\xf3\xcb\xc3\xd0\xf8 �\xc0\xec\xeb\u009fA��"
...
其中千字文的长度运行应该是249,而我这结果确实125, nchar(qzwmerged)
结果应该是1000,却不知是什么原因多达3105,for语句
循环之后也没有出现中文语句,似乎是识别中文上出现的问题,但是加了UTF-8
的编码,也不知是什么原因?还请经验丰富的朋友能告知如何修改?在此先谢谢啦!