Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
345 views
in Technique[技术] by (71.8m points)

php - How to handle diacritics (accents) when rewriting 'pretty URLs'

I rewrite URLs to include the title of user generated travelblogs.

I do this for both readability of URLs and SEO purposes.

 http://www.example.com/gallery/280-Gorges_du_Todra/

The first integer is the id, the rest is for us humans (but is irrelevant for requesting the resource).

Now people can write titles containing any UTF-8 character, but most are not allowed in the URL. My audience is generally English speaking, but since they travel, they like to include names like

 A?t Ben Haddou

What is the proper way to translate this for displaying in an URL using PHP on linux.

So far I've seen several solutions:

  1. just strip all non allowed characters, replace spaces this has strange results:
    'A?t Ben Haddou' → /gallery/280-At_Ben_Haddou/
    Not really helpfull.

  2. just strip all non allowed characters, replace spaces, leave charcode (stackoverflow.com) most likely because of the 'regex-hammer' used
    this gives strange results: 'tést tést' → /questions/0000/t233st-t233st

  3. translate to 'nearest equivalent'
    'A?t Ben Haddou' → /gallery/280-Ait_Ben_Haddou/
    But this goes wrong for german; for example 'ü' should be transliterated 'ue'.

For me, as a Dutch person, the 3rd result 'looks' the best.
I'm quite sure however that (1) many people will have a different opinion and (2) it is just plain wrong in the german example.

Another problem with the 3rd option is: how to find all possible characters that can be converted to a 7bit equivalent?

So the question is:

  1. what, in your opinion, is the most desirable result. (within tech-limits)

  2. How to technically solve it. (reach the desired result) with PHP.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Ultimately, you're going to have to give up on the idea of "correct", for this problem. Translating the string, no matter how you do it, destroys accuracy in the name of compatibility and readability. All three options are equally compatible, but #1 and #2 suffer in terms of readability. So just run with it and go for whatever looks best — option #3.

Yes, the translations are wrong for German, but unless you start requiring your users to specify what language their titles are in (and restricting them to only one), you're not going to solve that problem without far more effort than it's worth. (For example, running each word in the title through dictionaries for each known language and translating that word's diacritics according to the rules of its language would work, but it's excessive.)

Alternatively, if German is a higher concern than other languages, make your translation always use the German version when one exists: ?ae, ?e, ?i, ?oe, üue.

Edit:

Oh, and as for the actual method, I'd translate the special cases, if any, via str_replace, then use iconv for the rest:

$text = str_replace(array("?", "?", "ü", "?"), array("ae", "oe", "ue", "ss"), $text);
$text = iconv('UTF-8', 'US-ASCII//TRANSLIT', $text);

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...