February 21, 2015

MySQL, UTF8 and Indexing

See some pitfalls of using MySQL with UTF-8 when indexing your database.

When testing an application in MySQL 5.6, I came across a few interesting issues. These weren't necessarily changes in MySQL between version 5.5 and 5.6, but rather the packages I used to install MySQL 5.6.

The Situation

I usually use Ubuntu LTS server releases. That means that I'm currently using Ubuntu 14.04. Within this, I often grab repositories to grab the latest stable versions of software, such as PHP 5.6, MySQL 5.6, Nginx 1.6 and so on.

The PPA for MySQL 5.6 (ppa:ondrej/mysql-5.6) seemed to have a particular default on, which the regular old mysql-server-5.6 package did not.

The issue was that MySQL's strict mode was turned on when I installed MySQL via the PPA. This caused what used to be ignorable warnings into actual errors.

The particular issue I came across was having an index that was (potentially) too large. MySQL knew that indexing a VARCHAR(255) field could go over it's maximum index length of 767 bytes.

Not that I defend the existence of said warnings. Finding these was definitely an opportunity to right some passed wrongs!

Defaults

Most of us MySQL users are likely used to strict mode being off by default. In fact, a legacy application I work on has been chugging along nicely for the passed 10 years with this default never causing an issue.

Even if strict mode is on, we may not hit this particular issue because MySQL defaults to the Latin-1 character set.

The Latin-1 character set takes 1 byte per character, and so a VARCHAR(255) column would never hit the 767 byte limit on indexes. We'd only use up to 255 bytes.

A UTF-8 Pitfall

As mentioned, MySQL still defaults to the Latin-1 (aka ISO-8859-1 aka cp1252 aka wtfbbqsauce) character set.

However, I had the need to use UTF-8, thus supporting characters of various Eurasian alphabets.

UTF-8 characters may take from 1-4 bytes each. If we index 255 characters, that's a possible max of 1020 bytes used, which is obviously higher than 767 index byte limit.

MySQL would normally let this slide by silently truncating the index to 191 characters, for a possible total of 764 bytes used.

It would raise a warning, but not an error.

Sidenote

MySQL's UTF-8 character set actually only supports 3-byte characters. This would lead to a maximum of 765 byte indexes, which as you can see is just below the 767 byte limit.

However, as of MySQL 5.5.3+, you can use the utf8mb4 character set and utf8mb4_general_ci collation. This introduced support for 4-byte characters, and thus the full set of UTF-8 characters.

This is when we hit our error, since the new maximum index length on a VARCHAR(255) field becomes 1020 bytes.

The Issue:

In my installation of MySQL 5.6, the sql_mode was set to STRICT_TRANS_TABLES,NO_ENGINE_SUBSTITUTION.

mysql> show variables like 'sql_mode';
+---------------+--------------------------------------------+
| Variable_name | Value                                      |
+---------------+--------------------------------------------+
| sql_mode      | STRICT_TRANS_TABLES,NO_ENGINE_SUBSTITUTION |
+---------------+--------------------------------------------+
1 row in set (0.00 sec)

The addition of STRICT_TRANS_TABLES puts MySQL into strict mode when you use transaction-supported database engines. This includes our InnoDB tables.

With strict mode on, we get a nasty error when creating indexes on VARCHAR(255) fields.

There are three solutions to this:

Easiest Solution: Disable Strict Mode

Strict mode is enabled if you use the setting "STRICT_ALL_TABLES" or "STRICT_TRANS_TABLES". You can turn these off either in the my.cnf file or within each SQL Session with a query:

SET SESSION sql_mode = NO_ENGINE_SUBSTITUTION;

Here we're omitting "STRICT_TRANS_TABLES", which disables it. The default for me on MySQL 5.6 was as follows:

mysql> show variables like 'sql_mode';
+---------------+--------------------------------------------+
| Variable_name | Value                                      |
+---------------+--------------------------------------------+
| sql_mode      | STRICT_TRANS_TABLES,NO_ENGINE_SUBSTITUTION |
+---------------+--------------------------------------------+
1 row in set (0.00 sec)

Resetting the sql_mode per session (per connection made in the application) yields the following:

mysql> show variables like 'sql_mode';
+---------------+--------------------------------------------+
| Variable_name | Value                                      |
+---------------+--------------------------------------------+
| sql_mode      | NO_ENGINE_SUBSTITUTION                     |
+---------------+--------------------------------------------+
1 row in set (0.00 sec)

This is the easiest solution as it produces only warnings, which do not halt execution. This will silently truncate the index length to the 191 characters maximum.

While this is easy, and we could technically turn off strict mode on each database session in code, I was looking for a solution that would handle more use cases. This feels a bit like throwing the problem under a rug.

Medium Solution: Don't use VARCHAR(255)

You could opt to not use the full 255 characters available with a VARCHAR field.

You could instead use up to 191 characters, which leaves us just below the 767 byte limit.

CREATE TABLE users (
    username VARCHAR(60) NOT NULL,
    password VARCHAR(191) NULL NULL
) ENGINE=innodb DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci

This may be an issue if you already have a production database with data that you cannot risk truncating, or if business requirements dictate needing more than 191 characters available in the varchar fields.

We had production data that couldn't get truncated, making this option a non-starter.

Hardest Solution: Change Storage, Enable Compression

This option is the most heavy-handed. It should not be done on databases which already have data, as it changes how InnoDB data is stored on the file system, setting off all sorts of alarms within MySQL after restarting it.

Note that the described settings here are the default on MySQL 5.7, which also comes with Strict Mode on.

With this option, we'll:

  1. Change the InnoDB file format to "Barracuda" which supports data compression
  2. Enable "large_prefix"
  3. Use dynamic row formatting

Within my.cnf you can add the following to accomplish the first two:

[mysqld]
innodb_large_prefix = 1
innodb_file_format  = Barracuda

In conjunction to this, when you create your tables, you need to set ROW_FORMAT=DYNAMIC along with your storage engine, character set and other settings.

For example:

CREATE TABLE users (
    username VARCHAR(255) NOT NULL,
    password VARCHAR(255) NULL NULL
) ENGINE=innodb DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci ROW_FORMAT=DYNAMIC

That's a lot of stuff to deal with just to get a complete index on a standard field like VARCHAR(255), but it works!

This has the nasty side-affect of creating a dependency between your application and database configuration. Such a setup can become prone to error when moving around different environments.

This is something I needed to avoid, since I'm working on an application installed on customer infrastructure outside of our control.

The more configuration needed to be done, the fewer hosting options customers have, and the more customer support we would take on supporting questions about this issue.

"Best" Solution: Use Smaller Indexes

This solution fixes the underlying issue and allows us to avoid issues caused by any customer's particular configuration of MySQL.

On VARCHAR fields (this hasn't been necessary on INT fields, and generally only FULLTEXT indexes are created on TEXT fields), we have the option to use smaller indexes; Up to the first 191 characters.

-- Index first 20 characters of the username column
ALTER TABLE users ADD INDEX username_index_name (username(20))

This lets you use the full 255 characters at the "expense" of index length, rather than getting the full index length at the expense of the varchar field length.

Depending on your use case, this solution can provide a speed boost to MySQL, as we're not storing as much information in the indexes (making it faster for MySQL to search the index). It definitely will remove warnings and errors about the index length when using the utf8mb4 character set.

This also makes an application less susceptible to issues to using within different/unknown MySQL environments.

Because this could be managed within our code, not create an issue where customers needed to do MySQL configuration changes, and wouldn't risk truncating "legacy" customer data, this was our most viable solution.

There's another story here about editing framework code to support variable-length indexes. It isn't often built into database abstraction libraries, since only MySQL supports it. That's another article, however.

Errata

Here's the bug report outlining this "issue" and some possible solutions, from which the "Hardest Solution" above came from.


Interestingly, of the 4 popular databases (SqlServer, MySQL, PostgreSQL, SQLite), MySQL is the only one to support variable index lengths when creating indexes.

This essentially means you can create "prefix" indexes, where you only index the first n characters of a field.


I couldn't find any init script or configuration that defined the difference between the PPA repository and the default MySQL 5.6 package. Perhaps the difference is in compilation.

I haven't dug any deeper on that point, so it's still a mystery to me. Both servers were Ubuntu Trusty 14.04.1 TLS 64 bit.


MySQL 5.6 still has a default character set of Latin1, but frameworks like Laravel use UTF-8 explicitly. You might run into this issue whether you're aware of it or not!


Fun fact: MariaDB is now the default available database in CentOS 7 and RedHat 7.


Strict mode will also raise errors when you do things like try to add a default value on a TEXT field. Luckily that is a less-destructive issue, as MySQL can handle adding a null or empty string value into a TEXT field, even when it is defined with NOT NULL. Go figure.

Side note, If you're interested in a service to help you manage MySQL-optimized, backup and (eventually) replication-enabled database servers, sign up here to let me know! The idea is to allow you to better manage your MySQL servers, taking advantage of many of MySQL's more advanced options, especially around backup and recovery.

All Topics