A story of how happy faces can ruin your software

June 24, 2019

Have you ever heard of emojis: 😊? I guess you did, they're everywhere and people use them to express feelings, ideas or just as a way to enrich the language. Well that, and also they can also 💩 your software.

If you're reading this, you're probably aware that mysql utf8 support is not what's supposed to be (Mysql "utf8" encoding only supports three bytes per character. The real UTF-8 encoding — which everybody uses — needs up to four bytes per character). Also, Mysql already has a section trying to disambiguate this utf8 misunderstanding: Unicode support, while also released a new encoding of their own: utf8mb4. So I'm not going to go deeper into this.

You should already know what is this post about: supporting emojis in mysql 🤯. Even though there's a solution with some minor caveats (read more here), sometimes it's not so easy (or feasible) to achieve in a production environment. Especially if you've been running your mysql instance for a while and it has now way too much information.

This is the story of how we coped with this issue, the lessons we've learned and a workaround that may help you.

Acknowledging the problem and first attempt to solve it

The story starts with a well known error message (exhibits 1, 2, 3): Incorrect string value: '\xF0\x9F\x8D\xB0' for column 'content' at row 1. As it was a recurring issue happening very often (did I mention people love emojis?), we decided to invest some time and tackle it down, after several months of living with it, the time of emojis had come 🥊.

After a couple of Google searches, all the links listed above showed up. We found what we had to do, migrate our database or in the best case those columns having the problem as explained here.

After identifying the column hading the issue and confirming it was configured as utf8, we went to our DBA and explained what we were planning to do... and here's when it starts getting complicated. He explained to us that we had a tool called Percona to perform the DB transformations, besides having the chance of performing the change directly against production. We performed a couple of tests, and using Percona would require us a downtime of almost a day, while doing the alter directly to the DB would reduce the downtime to 8 hours.

Both options were obviously not possible.

Looking for alternatives

First option

Let's disallow emojis 😈! We went to our product team with the request to change client apps to avoid sending emojis because they're causing errors on the client app and the way we tell customers there was an error does not help them to fix it. They came back with the following answer: "Why not fixing it? It would be such a great feature we've been requesting for a while".

I was like: 🤦‍, but we knew it would turn that way.

Second alternative

Knowing that running a migration was not possible, we came out with the idea of creating a new column with encoding utf8mb4 and modifying the entity reading from that table to read either from one column or the other, but writing always to the new column. Something like this:

@Entity
public class SomeEntity {
    private String content;
    private String newContent;

    public String getContent() {
        if (this.newContent == null) {
            return this.content;
        } else {
            return this.newContent;
        }
    }

    public void setContent(String content) {
        this.newContent = content;
    }
}

Also, we'd have to create a background task to gradually migrate from one column to the other and remove the hacky code after the task finishes.

Great, we had a plan, but... what if that's not the only column you need to migrate? Well, you'd have to make the same process for every column, making the solution not so scalable. Can you guess how many columns we would have to migrate? After a quick search through the logs, we've found 27. It wouldn't work out.

The solution ends up being almost good, but you'll end up with mixed utf8 and utf8mb4, which can be a problem if you have to mix them in a query (more on this later).

Third alternative

We decided to look around other options. We knew that encoding characters is a well-known way of representing characters with limited character sets (URL encodes any non-ASCII, HTML uses entities to represent some characters, base64 is used to represent anything with ASCII characters). We could use that technique to encode the string before sending it to the database.

What are the drawbacks of this solution? First, it increases the storage size. If you have a column limited to 255 characters, then the maximum size you'll be able to store is less. Second, depending on the encoding algorithm of choice, you won't have readable columns, which can make it hard for other clients (if there are any accessing your DB without using your software layer).

To minimize those issues, we've decided to only encode the characters outside the Basic Multilingual Plane (BMP) which are not supported by mysql utf8 charset. Doing that, space will only be an issue for people introducing emojis (only a subset of the current customer will have the issue), and the only thing that won't be readable in the DB will be those emojis (which already are not readable because they're not there 🤷).

For us it was a fair option, we only had to come up with a way of easily applying the solution to the 27 different columns without a lot of effort. Luckily we've found a similar use case: encryption, which needs to encrypt a column before storing it into the database and decrypt it on retrieval. Googling around we found that JPA converter could be used to achieve that and it was easy to do something similar.

We talked to some other engineers, security guys, our DBA, and nobody was fully convinced on the solution (I reckon, it's kinda hacky), but it was the only viable solution we had and they gave us green light to implement it.

Implementation

Well, implementation wise, it wasn't that hard. Creating a converter is straight forward, and applying it to every single column was just a line of code (in 27 different places for us). We did add a toggle to disable the feature in case of need (more on this later).

After we made the change, the fix was not working (it couldn't be that easy, right?). The main suspect was the connection string, and after searching for a while (thanks Google again, this is your 3rd mention here), We've found this:

For Connector/J 5.1.46 and earlier: In order to use the utf8mb4 character set for the connection, the server MUST be configured with character_set_server=utf8mb4; if that is not the case, when UTF-8 is used for character encoding in the connection string, it will map to the MySQL character set name utf8, which is an alias for utf8mb3.

We were using Connector/J 5.1.45, so we got the ok to migrate to 5.1.47, and voilá, we had emojis 🎉!!

This is what our entities look like:

// import org.apache.commons.lang3.text.translate.*
@Converter
public class Utf8ToMysqlUtf8Encoder 
    implements AttributeConverter<String, String> {
    // BMP goes from 0000 to FFFF ref: https://en.wikipedia.org/wiki/Plane_(Unicode)
    // we decided to mimic apache commons html https://commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/apache/commons/lang/StringEscapeUtils.html#escapeHtml(java.io.Writer,%20java.lang.String)
    private static final CharSequenceTranslator UTF8_NON_BMP = new LookupTranslator(EntityArrays.BASIC_ESCAPE())
            .with(NumericEntityEscaper.outsideOf(0, 0xFFFF));
    private static final CharSequenceTranslator DECODER = new LookupTranslator(EntityArrays.BASIC_UNESCAPE())
            .with(new NumericEntityUnescaper());
    
    @Override
    public String convertToDatabaseColumn(String attribute) {
        return UTF8_NON_BMP.translate(attribute);
    }

    @Override
    public String convertToEntityAttribute(String dbData) {
        return DECODER.translate(dbData);
    }
}

@Entity
public class SomeEntity {
    @Converter(Utf8ToMysqlUtf8Encoder.java)
    private String content;

    public String getContent() {
        return this.content;
    }

    public void setContent(String content) {
        this.content = content;
    }
}

Some other learnings and challenges

What happens if you mix a query with utf8 and utf8mb4 columns?

Well, you shouldn't mix encodings. This happened to us in prod while we were trying to migrate partially to utf8mb4, but we came up with a way of reproducing it:

# Run mysql in a docker container as daemon
$ docker run --network=host -e MYSQL_ROOT_PASSWORD=root -d mysql:latest
# Open a connection and setup the test database:
$ docker run -it --network host --rm mysql mysql -h127.0.0.1 -uroot -p
mysql> create database test;
mysql> use test;
mysql> CREATE TABLE test (
  id bigint(20) NOT NULL AUTO_INCREMENT,
  content VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci,
  content_new VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
  CONSTRAINT PK PRIMARY KEY (id)
); 
mysql> INSERT INTO test(content, content_new) VALUES ('AA', 'BB');
mysql> SET NAMES utf8mb4;
mysql> quit 
# Now back to bash (as mysql CLI resisted to allow emojis)
$ docker run -it --network host --rm mysql mysql -h127.0.0.1 -uroot -p -e 'use test; set names utf8mb4; select * from test where content_new != "😅";'
+----+---------+-------------+
| id | content | content_new |
+----+---------+-------------+
|  1 | AA      | BB          |
+----+---------+-------------+
# But running the same query agains the utf8 column:
$ docker run -it --network host --rm mysql mysql -h127.0.0.1 -uroot -p -e 'use test; set names utf8mb4; select * from test where content != "😅";'
ERROR 1267 (HY000) at line 1: Illegal mix of collations (utf8_unicode_ci,IMPLICIT) and (utf8mb4_0900_ai_ci,COERCIBLE) for operation '<>'

So, it might get complicated if you start mixing up character sets. Either migrate everything to utf8mb4 (if you can), or opt for a workaround as we did.

How to disable the Converter

We wanted to control the feature not relying on the release process, so we had to inject a bean dependency which allows us to control the behavior based on some external configuration.

This was another challenge because you can't inject a dependency into a converter unless you're using JPA 2.2, Spring 5.1 and Hibernate 5.3.0, which was not our case, so we ended up using the mutable static property hack proposed here

Final thoughts

No solution is always as easy as it looks
Never setback, be stubborn and you'll find your way out
Ask your teammates, the earlier you talk to them, the sooner you'll get feedback about your proposal