I Never Did Like Latin. From MySQL Latin-1 to Utf8mb4

I took Latin in High School. Never liked it much. The only thing I remember is stultus asinus, or “stupid donkey.” Well, the stupid donkey was me when I realized I didn't have my MySQL table properly configured to handle Unicode Encoding.

We just put the wraps on a new app that migrates WordPress Blog data to NixMash Spring. The processing mostly takes place between databases, but we use the WordPress REST API to import rendered WordPress post content in the NixMash Spring posts.post_content field.

This should not normally be a problem, but WordPress likes to use non-Latin HTML 4.0 special characters like &#8217; for RIGHT SINGLE QUOTATION MARK instead of a simple &#39; quote symbol. (It also strips <P/> tags, which is why we're using the REST API postContent().rendered() in the first place.) When that HTML content was saved to the database, this happened.

Keywords: MysqlDataTruncation, Hexadecimal

Latin-1 Legacy

We'll start with the initial configuration of the MySQL posts table, database and server, where Latin1 Character Sets were causing problems.

First the database and server where we see a bit of a mix-mash between utf8 and latin1.

Now let's look at the actual posts.post_content column. Yeah, I know. Let's just say this table and database was created a long time ago.

For a more complete picture of the issue at hand, here's the output of SHOW FULL COLUMNS.

UTF8MB4

We can change Character Set and Collation at the server level, the database, the table or even the column level. Here are some approaches for database, table or column.

We want to be thorough and want to not deal with Latin-1 Character Set handling in the future, so we're updating the MariaDB Server Configuration in /etc/mysql/mariadb.conf.d/50-server.cnf as well.

Updated Character Set and Collation Settings

Let's look at the final result. Here's our SHOW FULL COLUMNS output. I should mention that the HTML migration worked with only the post_content Character Set changed to UTF8MB4, but we wanted to be thorough.

The updated Server and Database Character Set variables.

Takeaways

I did a lot of reading about Unicode, MySQL Character Sets and Collation since this HTML import exception happened. Here are some good takeaway bits and links.

An interesting post from Nick Johnson on the WordPress Apostrophe issue.

Wikipedia's list of XML and HTML Character Entities.

“Unicode is certainly difficult, and the UTF-8 encoding has a couple of inconvenient properties. However, UTF-8 has become the defacto standard encoding on the web, surpassing ASCII, Latin-1, UCS-2 and UTF-16. Just use UTF-8 everywhere. The most important reason why you should support Unicode is that you shouldn't make unnecessary assumptions about user input…So not supporting other scripts isn't just a big f*ck you to other cultures, but sticking to Latin-1 doesn't even allow you to write proper English.”

“Latin-1 is a character set, UTF-8 is a character encoding. Comparing the two doesn't even make sense. Those are two completely different things. UTF-8 is in fact a valid encoding of Latin-1.”

–two responses to a Software Engineering StackExchange Question

“UTF-8 is prepared for world domination, Latin1 isn't.” –a StackOverflow Answer

Getting out of MySQL Character Set Hell. A very long post from 2009 but very much still worth the read.

From the previous “Character Set Hell” post, what to do before screwing with your Server Character Set and Collation Settings.

# service mysql stop
# cd /var/lib
# tar zcvf mysql-backup.tar.gz mysql
# service mysql start

How do I see what character set a MySQL database / table / column is?
— StackOverflow

Differences between utf8_general_ci and utf8_unicode_ci. Includes links to their collation symbol charts. From a MySQL Forums response.

“The Unicode Standard includes characters from the Basic Multilingual Plane (BMP) and supplementary characters that lie outside the BMP.  Utf8 and utf8mb4 have identical storage characteristics: same code values, same encoding, same length. For a supplementary character, utf8 cannot store the character at all, whereas utf8mb4 requires four bytes to store it.”  —MySQL Unicode Documentation

Posted January 26, 2017 04:16 PM EST

More Like This Post