Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
641 views
in Technique[技术] by (71.8m points)

php - fgetcsv() drops characters with diacritics (i.e. non-ASCII) - how to fix?

Similar questions:
Some characters in CSV file are not read during PHP fgetcsv() ,
fgetcsv() ignores special characters when they are at the beginning of line

My application has a form where the users can upload a CSV file (its 5 internal users have always uploaded a valid file - comma-delimited, quoted, records end by LF), and the file is then imported into a database using PHP:

$fhandle = fopen($uploaded_file,'r');
while($row = fgetcsv($fhandle, 0, ',', '"', '\')) {
    print_r($row);
    // further code not relevant as the data is already corrupt at this point
}

For reasons I cannot change, the users are uploading the file encoded in the Windows-1250 charset - a single-byte, 8-bit character encoding.

The problem: and some (not all!) characters beyond 127 ("extended ASCII") are dropped in fgetcsv(). Example data:

"15","ústav"
"420","?pi?ák"
"7","Tmaň"

becomes

Array (
  0 => 15
  1 => "stav"
)
Array (
  0 => 420
  1 => "pi?ák"
)
Array (
  0 => 7
  1 => "Tma"
)

(Note that ? is kept, but ú is dropped)

The documentation for fgetcsv says that "since 4.3.5 fgetcsv() is now binary safe", but looks like it isn't. Am I doing something wrong, or is this function broken and I should look for a different way to parse CSV?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

It turns out that I didn't read the documentation well enough - fgetcsv() is only somewhat binary-safe. It is safe for plain ASCII < 127, but the documentation also says:

Note:

Locale setting is taken into account by this function. If LANG is e.g. en_US.UTF-8, files in one-byte encoding are read wrong by this function

In other words, fgetcsv() tries to be binary-safe, but it's actually not (because it's also messing with the charset at the same time), and it will probably mangle the data it reads (as this setting is not configured in php.ini, but rather read from $LANG).

I've sidestepped the issue by reading the lines with fgets (which works on bytes, not characters) and using a CSV function from the comment in the docs to parse them into an array:

$fhandle = fopen($uploaded_file,'r');
while($raw_row = fgets($fhandle)) { // fgets is actually binary safe
    $row = csvstring_to_array($raw_row, ',', '"', "
");
    // $row is now read correctly
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...