Ever wondered why sites like YouTube, Reddit or ImgUr use these 'weird' IDs that seem to be made up of a number of random letters and numbers? Well, they pretty much are! In this blog post we’re going to dive a bit into how they work, what the reasoning is behind them, and why you often don’t want to just use a database that creates incremental IDs.
If you’re not interested in the explanation and just want to see the code; here it is!
Introduction
So let’s start with the last section first. Why not simply use your database’s capability for auto incrementing IDs? Well; for a lot of applications they are definitely the best option. They’re small (typically 32 or 64 bit integers) which is beneficial to keeping your primary index small. Since they auto-increment, you can easily sort from old to new without needing an additional column.
But there are also downsides. First of all there is always the danger of an enumeration attack, especially (but not only!) in the case of user accounts. While you should of course be securing your application against this anyway, it’s easy to make mistakes. By making an ID easy to guess, such a mistake could easily let an attacker enumerate over things they are not allowed to see, simply by incrementing an ID.
Secondly; it’s often just not possible to have a central database to hand out IDs in large distributed system. Such a database could easily become a massive bottleneck when it’s being called hundreds or thousands of times per second.
So this often (but definitely not always) leads to the choice to have externally exposed IDs be large random numbers.
Universally Unique Identifiers
The most common form of these large random numbers are Universally/Globally Unique IDentifiers, commonly abbreviated to UUIDs or GUIDs.
A UUID is just a 128 bit integer, often formatted in hexadecimal in groups separated by a dash. For example; 123e4567-e89b-12d3-a456-426614174000. If you remove the dashes and count the length, it’s 32 characters long. Or; 16 bytes, or 128 bits. So there is nothing magical about it, a UUID with value zero would be formatted as 00000000-0000-0000-0000-000000000000
.
Or well, there is a little bit of 'magic': there are different versions of UUIDs. The older ones tried to guarantee uniqueness by incorporating things like in the case of version 1 and 2 the MAC address of a network interface. But since these can easily be altered nowadays, these versions fell out of fashion and today Version 4 is mostly used.
Version 4 UUDs are made up of 4 bits indicating the version, and then 2 or 3 bits that indicate the variant. The vast majority of UUIDs are version 4 variant 1, which leaves 122 bits available for the large random number. 2 to the power of 122 is a ridiculously large number, 5.3 undecillion. Or a 5 with 36 zeroes behind it. So a UUID is simply so large that it will take well past the heat death of the universe to generate a meaningful collision.
But how do you generate one? Well, the implementation in Java of UUID is really simple. Considering we typically use UUID.randomUUID()
we can check the sourcecode to see the implementation. It uses a SecureRandom instance to generate the 16 bytes of the UUID. It then sets the correct version and variant bits.
So; we have a simple method call in Java (UUID.randomUUID()
) we can use to generate industry standard IDs that won’t collide. Do we need anything else? Well, in most cases you don’t. The only downside is that they don’t look very nice. And that’s why sites often use alternatives!
Nice Random IDs
We now know the entropy (randomness) of a UUID is `2^122`, or roughly 5.3e+36. We know that we just need to use SecureRandom to generate a 'nice' looking ID with the same entropy.
What we’re going to do next is, instead of using random bits we then have to encode in hexadecimal, using random characters we can use directly and won’t need additional encoding. For this we’re going to use the same set of characters that Base64 uses: 0 to 9, a to z, A to Z, dash and underscore. These are the 64 (10 + 26 + 26 + 2) characters that are the 64 in base 64.
But how many characters do we need? We need to find the power of 64 that is at least as large as `2^122`. Well we can throw some math at it! To find the exponent we can use logarithms.
`x = log(2^122) -: log(64)`
So:
`x = 20.3`
So we need 20.3 characters to have an ID that is just as 'random' as a UUID. Since we can’t have a String with 20.3 characters, we’re going to round up to 21. So with 21 characters and 64 possible characters, our entropy is 64^21 = 8.5e+37. To do the calculation in Java:
var characters = (int)Math.ceil(
Math.log(Math.pow(2, 122)) / Math.log(64)
);
So, all that we need to do is create a class that sets up a list of characters we use, and then use that list to generate a set of characters.
Java Implementation
So first we create an array of characters with all characters we can randomly pick from:
for(int j = 0;j < 10;j++) {
DICT[i++] = (char)('0' + j);
}
for(int j = 0;j < 26;j++) {
DICT[i++] = (char)('A' + j);
DICT[i++] = (char)('a' + j);
}
DICT[i++] = '-';
DICT[i] = '_';
And then all there’s left to do is to use a SecureRandom instance to append random characters from this list to a StringBuilder:
var builder = new StringBuilder(length);
for(var i = 0;i < length;i++) {
builder.append(DICT[RANDOM.nextInt(DICT.length)]);
}
return builder.toString();
The complete source can be found in IdGenerator.java.
Shorter IDs
Our IDs that we intend to use instead of UUIDs should be 21 characters long. But what if we don’t need the entropy to be this high, and just want to prevent enumeration attacks. If we can assume that we can just use a single database, we can just check if the ID already exists and generate a new one when needed. If we use 64 bit IDs, that neatly fit into a long, that’s still a VERY large number. `2^64 = 1.8 * 10^19`!
So, if we plug in the `2^64` of a 64 bit integer into the equation:
`x = log(2^64) -: log(64)`
We find that x = 11! And that’s probably why YouTube uses 11 character video IDs. They probably just do it the other way around by generating a single 64 bit random int and turning it into a (sort of) base 64 encoded ID. If you drop down to 32 bits, you will need only 6 characters.
Conclusion
I hope this post demystifies the generation of UUIDs and random IDs a bit. As a developer we have multiple tools in our toolbox when it comes to creating IDs for records. While for most cases auto numbering IDs or just using UUIDs are fine, you can also opt for implementing a 'pretty' random similar to those used by YouTube or ImgUr. Just keep in mind that it’s a bit more involved than just passing in a UUID and that many databases support UUIDs natively.