Sept 2021 edit:
I've been using MySQL 8.0 for a couple years now, so here's some updated info.
The MySQL manual now has a very informative page regarding conversion between utf8mb3
(currently also known as utf8
) and utf8mb4
. utf8mb3
is deprecated and will be removed eventually; and when it is removed, its current alias, utf8
, will refer to utf8mb4
instead.
With deprecated utf8mb3
, you can store up to 255 characters in an index, while with utf8mb4
, up to 191, when using COMPACT
or REDUNDANT
row format.
With COMPRESSED
or DYNAMIC
row format, index key prefixes can be up to 3072 bytes. With them, you can index up to 1024 characters for utf8mb3
, and 768 characters for utf8mb4
.
Below is my previous answer, which explains some of the logic behind the number of characters you can index versus the number of bytes.
I must revise my answer due to my research. I originally posted this (quoting myself):
I believe the answer is that you cannot know how many characters will
be in the index because you cannot know how many bytes your characters
will be (unless you do something to exclude multi-byte characters).
And I'm not sure, but it might still be correct, but not in quite the way I was thinking.
Here is the correct answer:
MySQL assumes 3 bytes per utf8 character. 255 characters is the maximum index size you can specify per column, because 256x3=768, which breaks the 767 byte limit.
If you don't specify index size, MySQL chooses the maximum size (i.e. 255 per column).
A UNIQUE constraint cannot be put on a utf8 column whose length is greater than 255, because a unique index must contain the entire cell value. But a regular index can be used - it will just index the first 255 characters (or first 767 bytes?). And that is where there is still some mystery for me.
The MySTERY:
I can see why MySQL assumes 3 bytes per character, for safety, because otherwise the UNIQUE constraint could be broken. But the docs seem to suggest that the index is actually sized in bytes, not characters. So, suppose you put a 255 char (765 byte) index on a varchar(256) column. If the characters you store are all ASCII, 1-byte characters, like A-Z, a-z, 0-9, then then you can fit the entire column into the 767 byte index. And it seems like that is what would actually happen.
Below is some more information from my original answer about characters, bytes, etc.
According to wikipedia, UTF-8 character can be 1,2, 3, or 4 bytes long.
But, according to this mysql documentation, the maximium character size is 3 bytes, and so any column index index over 255 characters might hit that byte limit. But as I understand it, it might not. If most of your characters are in the ASCII range, then your average character size will be closer to 1 byte. If your average character size is, for example, 1.3 bytes (mostly 1 byte, but a significant number of 2-3 byte characters), then you could specify an index of 767/1.3
So, if you are storing mostly 1-byte characters, your actual character limit would be more like:
767 / 1.3 = 590. But it turns out that is not the way it works. 255 characters is the limit.
As mentioned in this MySQL documentation,
Prefix limits are measured in bytes, whereas the prefix length in
CREATE INDEX statements is interpreted as number of characters for
nonbinary data types (CHAR, VARCHAR, TEXT). Take this into account
when specifying a prefix length for a column that uses a multi-byte
character set.
It seems that MySQL is advising people to do a calculation/guestimation like I just did in order to determine your key size for a varchar column. But in fact you cannot specify an index larger than 255 for utf8 columns.
Finally, if you refer back to my second link again, there is also this:
When the innodb_large_prefix configuration option is enabled, this
length limit is raised to 3072 bytes, for InnoDB tables that use the
DYNAMIC and COMPRESSED row formats.
So it seems like you can get much larger indexes if you want to, with a bit of tweaking. Just make sure the row formats are DYNAMIC or COMPRESSED. You can probably specify an index of 1023 or 1024 characters in that case.
By the way, it turns out that you can store 4-byte characters using [the utf8mb4 character set][4].
The utf8 character set apparently stores only ["plane 0" characters][5].
EDIT:
I just tried to create a composite index on a varchar(511) column with a tinyint(1) column and got the error message saying the max index size was 767 bytes. This makes me believe that MySQL assumes utf8 character set columns will contain 3 bytes per character (the maximum), and allows you to use 255 chars max. But perhaps that is only with composite indexes. I will update my answer as I find out more. But for now I'm leaving this as an edit.