GoSimple Twitter GoSimple LinkedIn GoSimple Facebook

Call us today 1.877.888.0342 |sales@gosimple.me |Request a Demo!

PostgreSQL primary key type analysis

Here at GoSimple, we utilize PostgreSQL as as our data platform, and today we’re going to be looking at the performance trade-offs between UUID and Integer primary keys.

Intro to UUIDs


First off, lets give a little introduction for UUIDs, which not everyone may be familiar with. A UUID (Universally Unique Identifier) is a 128 bit value, which is normally represented as 32 characters with dashes grouping them like so: 8-4-4-4-12 (an example being: 3a68015a-1c83-4104-a1a7-ee2f38a588fa). One common pitfall is to think of a UUID as a text value, it is not and should not be treated as such.

The main difference between a UUID and an Integer as a primary key, is if you were to generate random integers to use as an id for your database, you are very likely to run into collisions pretty quickly, there just aren’t enough possibilities which fit into 32 (or 64) bits to randomly generate a value from any system and not have a collision with another value you already generated; this is not the case for a UUID. To put it into perspective, here is a quote from the Wikipedia article on UUIDs:

Only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%.

Pros and cons


At GoSimple we utilize UUIDs for all of our surrogate primary key columns. Some of the benefits we’ve seen from this approach:

  • Allows us to generate keys in systems external to the database when necessary.
  • A ID can be unique across every table in your database (or even database in your cluster), you will never see another unrelated record (even of a different type) with that ID
    • An improper query can not insert wrong data into a table that protected with foreign keys by referencing an id that may collide from another table
    • It’s easier to spot mistakes in queries with joins between different relations where you may have accidentally used the incorrect column. With Integers you could have ID collisions between relations and return incorrect data without even knowing there was an issue.
  • If we ever have tenants on different servers, splitting and merging data at a later point is no problem

Now those are the benefits we have observed by using UUIDs instead of Integers, but what are some of the downsides we’ve hit?

  • UUIDs are 4 times larger to store than Integers, and 2 times larger than Big Integers
    • This means larger tables, and larger indexes
    • Joins are also slower due to the increased size
    • Fewer tuples can fit into memory, so performance has the potential to be impacted by having to hit disk more often
  • There is no native support for UUIDs in GiST indexes
    • We use GiST indexes heavily for exclusion constraints on temporal tables, right now we have to hack around the lack of support
  • Fragmented indexes due to the random nature of the UUID v4 we use to generate our values.
  • Slower to generate a UUID than to get an Integer value from a sequence

For our use case, we feel the benefits provided by UUIDs outweigh the downsides.

Performance impact


Lets talk a little more about the observed performance impact from using UUID’s as primary keys. This is a topic that there are plenty of articles which scratch the surface, but none which I’ve come across which did an in-depth analysis on quantifying that impact on the whole, and identifying which pieces of the puzzle make up the greatest portion of the performance degradation.

Hardware

To put into perspective the numbers which will be discussed, I’ll be performing these tests on two different sets of hardware.

HW1 – Medium VM

  • Debian 8
  • DigitalOcean $80/mo instance
  • 4 cores
  • 8GB ram
  • 20GB ssd

HW2 – Micro VM

  • Debian 8
  • DigitalOcean $5/mo instance
  • 1 core
  • 512MB ram
  • 20GB ssd

Testing guidelines

All queries will be run multiple times, and an average calculated, discarding the slowest run.

ID Generation

While it’s possible to generate all IDs in client software, it is still very useful to be able to generate your IDs inside the database automatically. Lets create a very simple test to compare three different methods of obtaining an ID within Postgres.

There are multiple methods to generate UUIDs which you can read more about here and here. Today we’ll be focusing on V1 (MAC address + time based), V1MC (virtual MAC address + time based), and V4 (pseudo random). These actually have quite different properties as far as indexing goes in Postgres. V4 UUIDs are, for all intents and purposes, generated randomly. Their distribution given enough time will be relatively even, and this wreaks havoc on a b-tree index. Since b-tree indexes are sorted, an insert of a random value will likely go somewhere in the middle of the index rather than at the end with a sequentially increasing value. If where the newly inserted value is supposed to go in the index does not have any more room, it first has to make room by moving around other parts of the index, and then it’s able to insert. V1 UUIDs will have a natural order to them based on the time they were generated (and computer they were generated from). This has the benefit that instead of being stuck at a random point in the index, it will just append to the end of the index, causing way less fragmentation.

In contrast to the UUID methods of ID generation, Integer IDs can be much simpler. A Postgres Serial column will create a sequence behind the scenes, and set the default value of the column as the nextval() of the sequence. To deal with concurrency, sequences will always increase every time nextval() is called, even if the transaction which is calling it is rolled back. That allows multiple connections to get values from a sequence without blocking each other, but it does allow for gaps.

Now lets compare the four methods on our different hardware for speed of generating IDs.

First off, lets create the needed extension and schema we’ll use throughout the rest of this article.

We’re now going to generate a million identifiers and see what the speed of each method of generation is.

And the results of this look like so:

generation

We can see that by far, sequences are the quickest to to generate, followed by UUID(v1), UUID(v4), and finally UUID(v1mc).

Insert performance

Insert performance is also a test on the effects of index fragmentation, so lets see the characteristics of each.

First we need to create some staging functions and tables:

We’re now left with some blank testing tables, and some staged tables filled with random strings to mitigate the performance impact of generating them on the fly.

The queries below were used for this next graph:

insert

This insert test is taking into account any index fragmentation, and ID generation costs, but that is the measured impact of inserting 1 million records into a table with different types of keys (or lack there of).

Now lets look at the cost of adding smaller sets of data (100,000 rows) to these already populated tables.

insert-inc

There is an obvious increase in the time taken to insert additional values for the UUID methods, which follows pretty closely to the initial insert performance observations.

Size of relations

Lets take a look at the size of the tables, and their indexes after everything we’ve done so far.

table-size

So here we’re able to quantify what the additional index bloat looks like. If we were to run a REINDEX on test.test4’s primary key index, it would shrink to the exact same size as the UUID(v1), and UUID(v1mc).

Lets take a look at what the bloat looks like after running that 100k row insert another 50 times.

Here is the breakdown on how long inserting the additional 5m rows in each table took.

insert-5m

I am not quite sure what caused the UUID(v4) to take so much longer on the micro hardware, but other than that oddity, everything still seems to match up to what is expected from previous observations.

Now every table has 6,200,000, lets take a look at the size of each table again.

table-size-2

For UUID(v4) bloat scaled just about linearly. The bloat with 1.2 million rows was 11MB, the bloat with 6.2 million rows is 55MB. Again, UUID(v1) has no index bloat, nor does our Integer (Sequence). The one oddity is that UUID(v1mc) does have bloat, which I did not expect.

Wrapping up


So far we’ve quantified some base line performance on the different primary key types, but only for the most basic test cases.  These tests were designed to isolate the effect each type of primary key has, and the results are much more pronounced than you’d likely see with a table designed to solve an actual business problem.

We also haven’t dug into join performance, the implications on wider tables, or offloading UUID generation to the client.  I hope to write a follow up article covering those soon.

I’d love to hear what others think about some of the unexpected results I encountered when testing.

Latest posts by Adam Brusselback (see all)

2 Responses to PostgreSQL primary key type analysis

  1. No mention of the human costs? The number one reason not to use UUIDs is how cumbersome they are to communicate. “Hey Bob, I’m getting an error rendering product five two B three seven zero six nine dash E one A C dash four five F zero dash nine D E C dash six two A two three F E six three zero B F…” Just awful. If it’s not a problem for you, great, but everywhere I’ve worked it’s been extremely commonplace to verbally communicate IDs, memorize IDs, jot them down on sticky notes, etc etc etc.

    • We have a couple of objects in our system which do need a human readable ID, and we used a surrogate integer id generated by a sequence for those use cases.

      You are totally right that for some objects which users refer to by ID often, a UUID is not something you want to display to those users unless you have no other choice. We have the fortune of our data being easily partitioned by client, and an overlap of integer surrogate ids between different clients is no big deal as long as we have a unique uuid to identify the exact record by if necessary.

Contact Us

3980 Tampa Rd, Ste 205

Oldsmar, FL 34677
Phone: 877.888.0342

Email: sales@gosimple.me

Partner Program

Know someone that could use GoSimple?

Check out our Partner Program!


View our privacy policy.

Copyright © 2016 GoSimple