Here at GoSimple, we utilize PostgreSQL as as our data platform, and today we’re going to be looking at the performance trade-offs between UUID and Integer primary keys.
Intro to UUIDs
First off, lets give a little introduction for UUIDs, which not everyone may be familiar with. A UUID (Universally Unique Identifier) is a 128 bit value, which is normally represented as 32 characters with dashes grouping them like so: 8-4-4-4-12 (an example being: 3a68015a-1c83-4104-a1a7-ee2f38a588fa). One common pitfall is to think of a UUID as a text value, it is not and should not be treated as such.
The main difference between a UUID and an Integer as a primary key, is if you were to generate random integers to use as an id for your database, you are very likely to run into collisions pretty quickly, there just aren’t enough possibilities which fit into 32 (or 64) bits to randomly generate a value from any system and not have a collision with another value you already generated; this is not the case for a UUID. To put it into perspective, here is a quote from the Wikipedia article on UUIDs:
Only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%.
Pros and cons
At GoSimple we utilize UUIDs for all of our surrogate primary key columns. Some of the benefits we’ve seen from this approach:
- Allows us to generate keys in systems external to the database when necessary.
- A ID can be unique across every table in your database (or even database in your cluster), you will never see another unrelated record (even of a different type) with that ID
- An improper query can not insert wrong data into a table that protected with foreign keys by referencing an id that may collide from another table
- It’s easier to spot mistakes in queries with joins between different relations where you may have accidentally used the incorrect column. With Integers you could have ID collisions between relations and return incorrect data without even knowing there was an issue.
- If we ever have tenants on different servers, splitting and merging data at a later point is no problem
Now those are the benefits we have observed by using UUIDs instead of Integers, but what are some of the downsides we’ve hit?
- UUIDs are 4 times larger to store than Integers, and 2 times larger than Big Integers
- This means larger tables, and larger indexes
- Joins are also slower due to the increased size
- Fewer tuples can fit into memory, so performance has the potential to be impacted by having to hit disk more often
- There is no native support for UUIDs in GiST indexes
- We use GiST indexes heavily for exclusion constraints on temporal tables, right now we have to hack around the lack of support
- Fragmented indexes due to the random nature of the UUID v4 we use to generate our values.
- Slower to generate a UUID than to get an Integer value from a sequence
For our use case, we feel the benefits provided by UUIDs outweigh the downsides.
Lets talk a little more about the observed performance impact from using UUID’s as primary keys. This is a topic that there are plenty of articles which scratch the surface, but none which I’ve come across which did an in-depth analysis on quantifying that impact on the whole, and identifying which pieces of the puzzle make up the greatest portion of the performance degradation. Continue Reading