Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
354 views
in Technique[技术] by (71.8m points)

mysql - utf8_general_ci和utf8_unicode_ci有什么区别(What's the difference between utf8_general_ci and utf8_unicode_ci)

utf8_general_ciutf8_unicode_ci ,在性能方面是否存在差异?

  ask by KahWee Teng translate from so

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

These two collations are both for the UTF-8 character encoding.

(这两个归类均适用于UTF-8字符编码。)

The differences are in how text is sorted and compared.

(区别在于文本的排序和比较方式。)

Note: You should use utf8mb4 rather than utf8 .

(注意:您应该使用utf8mb4而不是utf8)

They both refer to the UTF-8 encoding, but the older utf8 had a MySQL-specific limitation preventing use of characters numbered above 0xFFFD.

(它们都引用UTF-8编码,但是较早的utf8具有MySQL特定的限制,无法使用编号大于0xFFFD的字符。)

Note: Newer versions of MySQL have updated Unicode sorting rules, available under names such as utf8mb4_0900_ai_ci for equivalent rules based on Unicode 9.0 - and with no equivalent general variant.

(注意:较新版本的MySQL已更新了Unicode排序规则,这些名称以utf8mb4_0900_ai_ci名称提供,可用于基于Unicode 9.0的等效规则-无等效的general变体。)

Key differences

(关键差异)

  • utf8mb4_unicode_ci is based on the official Unicode rules for universal sorting and comparison, which sorts accurately in a wide range of languages.

    (utf8mb4_unicode_ci基于用于统一排序和比较的官方Unicode规则,可以对多种语言进行精确排序。)

  • utf8mb4_general_ci is a simplified set of sorting rules which aims to do as well as it can while taking many short-cuts designed to improve speed.

    (utf8mb4_general_ci是一组简化的排序规则,旨在utf8mb4_general_ci ,同时采取许多旨在提高速度的捷径。)

    It does not follow the Unicode rules and will result in undesirable sorting or comparison in some situations, such as when using particular languages or characters.

    (它不遵循Unicode规则,在某些情况下(例如,使用特定语言或字符时)会导致不希望的排序或比较。)

    On modern servers, this performance boost will be all but negligible.

    (在现代服务器上,这种性能提升几乎可以忽略不计。)

    It was devised in a time when servers had a tiny fraction of the CPU performance of today's computers.

    (它是在服务器仅具有当今计算机CPU性能的一小部分的时候设计的。)

Note: there exists now an updated version of utf8mb4_unicode_ci called utf8mb4_0900_ai_ci - this is based on changes in Unicode version 9.0, and is also apparently faster.

(注意:现在存在utf8mb4_unicode_ci的更新版本,称为utf8mb4_0900_ai_ci这是基于Unicode版本9.0的更改,并且显然也更快。)

It adopts a new naming scheme whereby 0900 is the Unicode version and ai means accent-insensitive - like the previous utf8mb4_unicode_ci , accents in letters are not considered significant.

(它采用了一种新的命名方案,其中0900是Unicode版本, ai表示不区分重音-像以前的utf8mb4_unicode_ci ,字母中的重音也不被认为是重要的。)

Benefits of utf8mb4_unicode_ci over utf8mb4_general_ci

(utf8mb4_unicode_ci优于utf8mb4_general_ci)

utf8mb4_unicode_ci , which uses the Unicode rules for sorting and comparison, employs a fairly complex algorithm for correct sorting in a wide range of languages and when using a wide range of special characters.

(utf8mb4_unicode_ci使用Unicode规则进行排序和比较,它使用相当复杂的算法对多种语言和使用多种特殊字符进行正确排序。)

These rules need to take into account language-specific conventions;

(这些规则需要考虑到特定于语言的约定。)

not everybody sorts their characters in what we would call 'alphabetical order'.

(并非每个人都按照我们称为“字母顺序”的顺序对字符进行排序。)

As far as Latin (ie "European") languages go, there is not much difference between the Unicode sorting and the simplified utf8mb4_general_ci sorting in MySQL, but there are still a few differences:

(就拉丁(即“欧洲”)语言而言,Unicode排序与MySQL中简化的utf8mb4_general_ci排序之间没有太大差异,但仍然存在一些差异:)

  • For examples, the Unicode collation sorts "?" like "ss", and "?" like "OE" as people using those characters would normally want, whereas utf8mb4_general_ci sorts them as single characters (presumably like "s" and "e" respectively).

    (例如,Unicode归类对使用这些字符的人通常希望将“?”(如“ ss”)和“?”(如“ OE”)进行排序,而utf8mb4_general_ci它们排序为单个字符(大概分别为“ s”和“ e” )。)

  • Some Unicode characters are defined as ignorable, which means they shouldn't count toward the sort order and the comparison should move on to the next character instead.

    (一些Unicode字符被定义为可忽略,这意味着它们不应该计入排序顺序,而比较应该继续到下一个字符。)

    utf8mb4_unicode_ci handles these properly.

    (utf8mb4_unicode_ci正确处理这些问题。)

In non-latin languages, such as Asian languages or languages with different alphabets, there may be a lot more differences between Unicode sorting and the simplified utf8mb4_general_ci sorting.

(在非拉丁语言中,例如亚洲语言或具有不同字母的语言,Unicode排序与简化的utf8mb4_general_ci排序之间可能会有更多差异。)

The suitability of utf8mb4_general_ci will depend heavily on the language used.

(utf8mb4_general_ci的适用性在utf8mb4_general_ci取决于所使用的语言。)

For some languages, it'll be quite inadequate.

(对于某些语言,这将是远远不够的。)

What should you use?

(你应该用什么?)

There is almost certainly no reason to use utf8mb4_general_ci anymore, as we have left behind the point where CPU speed is low enough that the performance difference would be important.

(几乎肯定没有理由再使用utf8mb4_general_ci了,因为我们落后于CPU速度足够低而性能差异很重要的观点。)

Your database will almost certainly be limited by other bottlenecks than this.

(您的数据库几乎肯定会受到其他瓶颈的限制。)

In the past, some people recommended to use utf8mb4_general_ci except when accurate sorting was going to be important enough to justify the performance cost.

(过去,有些人建议使用utf8mb4_general_ci除非精确的排序对于证明性能成本合理的重要性非常重要。)

Today, that performance cost has all but disappeared, and developers are treating internationalization more seriously.

(今天,这种性能成本几乎消失了,开发人员正在更加认真地对待国际化。)

There's an argument to be made that if speed is more important to you than accuracy, you may as well not do any sorting at all.

(有一种观点认为,如果速度对您而言比准确性更重要,那么您可能根本不做任何排序。)

It's trivial to make an algorithm faster if you do not need it to be accurate.

(如果不需要精确的算法,则可以使算法更快。)

So, utf8mb4_general_ci is a compromise that's probably not needed for speed reasons and probably also not suitable for accuracy reasons.

(因此, utf8mb4_general_ci是一个折衷方案,出于速度原因可能不需要,而且出于准确性原因也可能不适合。)

One other thing I'll add is that even if you know your application only supports the English language, it may still need to deal with people's names, which can often contain characters used in other languages in which it is just as important to sort correctly.

(我要补充的另一件事是,即使您知道您的应用程序仅支持英语,它仍可能需要处理人的名字,该名字通常可以包含其他语言中使用的字符,在这些语言中正确排序同样重要。)

Using the Unicode rules for everything helps add peace of mind that the very smart Unicode people have worked very hard to make sorting work properly.

(对所有内容使用Unicode规则有助于让您放心,非常聪明的Unicode人员为使排序正常工作而进行了非常努力的工作。)

What the parts mean

(零件是什么意思)

Firstly, ci is for case-insensitive sorting and comparison.

(首先, ci用于不区分大小写的排序和比较。)

This means it's suitable for textual data, and case is not important.

(这意味着它适用于文本数据,并且大小写并不重要。)

The other types of collation are cs (case-sensitive) for textual data where case is important, and bin , for where the encoding needs to match, bit for bit, which is suitable for fields which are really encoded binary data (including, for example, Base64).

(其他类型的排序规则是: cs (区分大小写)用于区分大小写的文本数据,以及bin (对于需要匹配编码的点对点匹配),适用于真正编码二进制数据的字段(包括例如Base64)。)

Case-sensitive sorting leads to some weird results and case-sensitive comparison can result in duplicate values differing only in letter case, so case-sensitive collations are falling out of favor for textual data - if case is significant to you, then otherwise ignorable punctuation and so on is probably also significant, and a binary collation might be more appropriate.

(区分大小写的排序会导致一些奇怪的结果,并且区分大小写的比较可能导致重复值仅在字母大小写上有所不同,因此区分大小写的排序规则对文本数据不受欢迎-如果大小写对您来说很重要,则标点符号会被忽略等等也可能很重要,二进制排序可能更合适。)

Next, unicode or general refers to the specific sorting and comparison rules - in particular, the way text is normalized or compared.

(接下来, unicodegeneral指的是特定的排序和比较规则-尤其是规范化或比较文本的方式。)

There are many different sets of rules for the utf8mb4 character encoding, with unicode and general being two that attempt to work well in all possible languages rather than one specific one.

(utf8mb4字符编码有许多不同的规则集,其中unicodegeneral是两种,它们试图在所有可能的语言中都能正常工作,而不是一种特定的语言。)

The differences between these two sets of rules are the subject of this answer.

(这两组规则之间的差异是此答案的主题。)

Note that unicode uses rules from Unicode 4.0.

(请注意, unicode使用Unicode 4.0中的规则。)

Recent versions of MySQL add the rulesets unicode_520 using rules from Unicode 5.2, and 0900 (dropping the "unicode_" part) using rules from Unicode 9.0.

(MySQL的最新版本使用Unicode 5.2中的规则添加规则集unicode_520 ,并使用Unicode 9.0中的规则添加0900 (删除“ unicode_”部分)。)

And lastly, utf8mb4 is of course the character encoding used internally.

(最后, utf8mb4当然是内部使用的字符编码。)

In this answer I'm talking only about Unicode based encodings.

(在这个答案中,我仅谈论基于Unicode的编码。)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...