Short Ids

January 05, 2018 uuids

I recently encountered the following need. It had to pass a resource identifier by url, which is registered in a database (along with other data) and has a descriptive name (which must be unique), associated that is specified by a human. In principle I came up with 2 ways to identify the object:

Use an auto-incremental value of the database as a PK and add a restriction so that the name is not repeated, plus an index.
Use a generated random value (for example the UUIDs that are in fashion), plus the restriction and the index
Use the name as PK

Of the first option I do not like the fact of passing a number autoincremental in the url, which allows anyone to easily modify it and guess another resource for its id.

The second option is quite used, only that the UUID in its representation as string occupy 36 characters (32 if we remove the separators), which seems too long for an identifier. But if they have the advantage that they are URL friendly (that is, they do not need to be urlencode), and it is also unlikely that a user will have access to another resource (unless we have 2.71 quintillion resources, in which case the probability will be of 50%: more on the subject).

The third option involves making the urlencode of the name, it is not easy to deduct another resource and with respect to the length it could be a problem. But I think the counter is to reveal what is the resource: if we were identifying a company, we would be revealing the name of the company, and perhaps this is not desirable (for example if we are in a bidding system) .

The case of UUIDs can be solved using a base64 encoding instead of base16. The base64 encoding represents 6 bits per character and its alphabet allows you to use them in a URL. By representing 6 bits per character instead of 4, we go from 32 characters (128 bits / 4) to 22 characters (128/6, rounded). The issue is that 22 characters still seems a lot for an identifier, although it is still a good solution because using the 8 bits that represent each byte we would have a minimum of 16 characters, but they would not be URL friendly.

If the option of the UUIDs is useful, maybe you should take a look at http://hashids.org or you can also see some tests I did to represent the UUIDs in base 64: https://github.com/gastonfournier/short-ids and the small library https://github.com/gastonfournier/suuid

In my case, I wanted the id to have a correlation with the input data, in order to obtain the PK without accessing the base and thus not require adding another index. The function does not necessarily have to be reversible. What is the advantage of this? In principle not much, but suppose we have an index of some of our data among which is the identifier (an Elasticsearch for example), and we want to look for a data in the index, but we only have the name of the resource that we want to search. Without having to access the database, I could calculate the PK that would help me find the data in the index. If I had used the UUIDs, I would have needed to access the name of the resource to get the id and then search the index. As I said, you do not earn much... maybe it's a very particular case, but it's not more expensive to do it this way:

I take the name of the resource
I calculate the hashCode of the name (this gives an Integer with a low probability of collisions)
I convert the integer to binary (in Java an integer is represented by 32 bits, this is 1/4 of what the UUIDs occupy)
Use the algorithm to pass from binary to base64 (obtaining a string of 6 characters long: 32/6)