Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
586 views
in Technique[技术] by (71.8m points)

csv - R on Windows: character encoding hell

I am trying to import a CSV encoded as OEM-866 (Cyrillic charset) into R on Windows. I also have a copy that has been converted into UTF-8 w/o BOM. Both of these files are readable by all other applications on my system, once the encoding is specified.

Furthermore, on Linux, R can read these particular files with the specified encodings just fine. I can also read the CSV on Windows IF I do not specify the "fileEncoding" parameter, but this results in unreadable text. When I specify the file encoding on Windows, I always get the following errors, for both the OEM and the Unicode file:

Original OEM file import:

> oem.csv <- read.table("~/csv1.csv", sep=";", dec=",", quote="",fileEncoding="cp866")   #result:  failure to import all rows
Warning messages:
1: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  invalid input found on input connection '~/Revolution/RProject1/csv1.csv'
2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  number of items read is not a multiple of the number of columns

UTF-8 w/o BOM file import:

> unicode.csv <- read.table("~/csv1a.csv", sep=";", dec=",", quote="",fileEncoding="UTF-8") #result:    failure to import all row
Warning messages:
1: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  invalid input found on input connection '~/Revolution/RProject1/csv1a.csv'
2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  number of items read is not a multiple of the number of columns

Locale info:

> Sys.getlocale()
   [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

What is it about R on Windows that is responsible for this? I've pretty much tried everything I could by this point, besides ditching windows.

Thank You

(Additional failed attempts):

>Sys.setlocale("LC_ALL", "en_US.UTF-8") #OS reports request to set locale to "en_US.UTF-8" cannot be honored
>options(encoding="UTF-8") #now nothing can be imported  
> noarg.unicode.csv <- read.table("~/Revolution/RProject1/csv1a.csv", sep=";", dec=",", quote="")   #result: mangled cyrillic
> encarg.unicode.csv <- read.table("~/Revolution/RProject1/csv1a.csv", sep=";", dec=",", quote="",encoding="UTF-8") #result: mangled cyrillic
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

It is possible that your problem is solved by changing fileEncoding into encoding, these parameters work differently in the read function (see ?read).

oem.csv <- read.table("~/csv1.csv", sep=";", dec=",", quote="",encoding="cp866")

Just in case however, a more complete answer, as there may be some non-obvious obstacles. In short: It is possible to work with Cyrillic in R on Windows (in my case Win 7).

You may need to try a few possible encodings to get things to work. For text mining, an important aspect is to get the your input variables to match the data. There the function of Encoding() is very useful, see also iconv(). Thus it is possible to see your native parameter.

Encoding(variant <- "Минемум")

In my case the encoding is UTF-8, although this may depend on system settings. So, we can try the result with UTF-8 and UTF-8-BOM, and make a test file in notepad++ with a line of Latin and a line of Cyrillic.

UTF8_nobom_cyrillic.csv & UTF8_bom_cyrillic.csv

part2, part3, part4
Минемум конкыптам, тхэопхражтуз, ед про

This can be imported into R by

raw_table1 <- read.csv("UTF8_nobom_cyrillic.csv", header = FALSE, sep = ",", quote = """, dec = ".", fill = TRUE, comment.char = "", encoding = "UTF-8")
raw_table2 <- read.csv("UTF8_bom_cyrillic.csv", header = FALSE, sep = ",", quote = """, dec = ".", fill = TRUE, comment.char = "", encoding = "UTF-8-BOM")

The results of these are for me for BOM regular Cyrillic in the view(raw_table1), and gibberish in console.

part2, part3, part4
??????μ????? ???????????????°?? ???…?¨?????…?€?°??????

More importantly however, the script does not give access to it.

> grep("Минемум", as.character(raw_table2[2,1]))
integer(0)

The results for No BOM UTF-8, are something like this for both view(raw_table1) and console.

part2, part3, part4
<U+041C><U+0438><U+043D><U+0435><U+043C><U+0443><U+043C> <U+043A><U+043E><U+043D><U+043A><U+044B><U+043F><U+0442><U+0430><U+043C> <U+0442><U+0445><U+044D><U+043E><U+043F><U+0445><U+0440><U+0430><U+0436><U+0442><U+0443><U+0437> <U+0435><U+0434> <U+043F><U+0440><U+043E>

However, importantly, a search for the word inside will yield the correct result.

> grep("Минемум", as.character(raw_table1[2,1]))
1

Thus, it is possible to work with non-standard characters in Windows, depending on your precise goals though. I work with non-English Latin characters regularly and the UTF-8 allows working in Windows 7 with no issues. "WINDOWS-1252" has been useful for exporting into Microsoft readers such as Excel.

PS The Russian words were generated here http://generator.lorem-ipsum.info/_russian, so essentially meaningless. PPS The warnings you mentioned remain still with no apparent important effects.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...