Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
459 views
in Technique[技术] by (71.8m points)

language detection - PHP: How do I detect if an input string is Arabic

Is there a way to detect the language of the data being entered via the input field?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

hmm i may offer an improved version of DimaKrasun's function:

functoin is_arabic($string) {
    if($string === 'arabic') {
         return true;
    }
    return false;
}

okay, enough joking!

Pekkas suggestion to use the google translate api is a good one! but you are relying on an external service which is always more complicated etc.

i think Rushyos approch is good! its just not that easy. i wrote the following function for you but its not tested, but it should work...

    <?
function uniord($u) {
    // i just copied this function fron the php.net comments, but it should work fine!
    $k = mb_convert_encoding($u, 'UCS-2LE', 'UTF-8');
    $k1 = ord(substr($k, 0, 1));
    $k2 = ord(substr($k, 1, 1));
    return $k2 * 256 + $k1;
}
function is_arabic($str) {
    if(mb_detect_encoding($str) !== 'UTF-8') {
        $str = mb_convert_encoding($str,mb_detect_encoding($str),'UTF-8');
    }

    /*
    $str = str_split($str); <- this function is not mb safe, it splits by bytes, not characters. we cannot use it
    $str = preg_split('//u',$str); <- this function woulrd probably work fine but there was a bug reported in some php version so it pslits by bytes and not chars as well
    */
    preg_match_all('/.|
/u', $str, $matches);
    $chars = $matches[0];
    $arabic_count = 0;
    $latin_count = 0;
    $total_count = 0;
    foreach($chars as $char) {
        //$pos = ord($char); we cant use that, its not binary safe 
        $pos = uniord($char);
        echo $char ." --> ".$pos.PHP_EOL;

        if($pos >= 1536 && $pos <= 1791) {
            $arabic_count++;
        } else if($pos > 123 && $pos < 123) {
            $latin_count++;
        }
        $total_count++;
    }
    if(($arabic_count/$total_count) > 0.6) {
        // 60% arabic chars, its probably arabic
        return true;
    }
    return false;
}
$arabic = is_arabic('????? ??????? ???? ??? ???? ?????. ????? ?????? ?? ?????? ?? ???? ??????'); 
var_dump($arabic);
?>

final thoughts: as you see i added for example a latin counter, the range is just a dummy number b ut this way you could detect charsets (hebrew, latin, arabic, hindi, chinese, etc...)

you may also want to eliminate some chars first... maybe @, space, line breaks, slashes etc... the PREG_SPLIT_NO_EMPTY flag for the preg_split function would be useful but because of the bug I didn't use it here.

you can as well have a counter for all the character sets and see which one of course the most...

and finally you should consider chopping your string off after 200 chars or something. this should be enough to tell what character set is used.

and you have to do some error handling! like division by zero, empty string etc etc! don't forget that please... any questions? comment!

if you want to detect the LANGUAGE of a string, you should split into words and check for the words in some pre-defined tables. you don't need a complete dictionary, just the most common words and it should work fine. tokenization/normalization is a must as well! there are libraries for that anyway and this is not what you asked for :) just wanted to mention it


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...