Monday, January 14, 2008

Database Skills: A Sane Approach To Choosing Primary Keys

Welcome to the Database Programmer. This blog is for anyone who wants to learn about databases, both simple and advanced. Since all non-trivial websites require a database, and since database skills are different from coding skills, there is very good reason for any programmer to master the principles of database design and use.

Every Monday morning there is a new entry. The complete table of contents for the Monday morning series is here.

This week's entry is rather long. I strongly considered splitting it into two weeks, but decided to keep it as one so that it would be an easier source of reference in the future.

There is no One True Primary Key

There are several competing theories out there on how to choose primary keys. Most of them tell you to use a single kind of key for all tables, usually an integer. In contrast to those theories I have found that a robust application uses different kinds of keys for different kinds of tables. In the last 15 years I have worked on projects large and small, simple and complex. Sometimes I had total technical control, and sometimes I had to work with what others gave me, and sometimes it was a little of both. Today's essay reflects what I have worked out in those years, and how I build my tables today. My goal is to report what actually works, not to promote a particular theory about how everybody should do something.

This week we will see "rules of thumb". A rule of thumb is a guiding idea that will tend to hold true most of the time, but which you may decide to change in certain circumstances.

Rule of Thumb 1: Use Character Keys For Reference Tables

A reference table is one that tends to be strongly constant over time and has relatively few columns. Sometimes a reference table may come already populated by the programmer. Examples include tables of country codes (perhaps with international telephone prefixes), a table of provinces or states within a country, or a table of timezones. In this series I have been using the example of a school management program, for that program we might give the user a reference table of school subjects like history, math, physics and so forth.

For these tables it is best to make a character primary key, which we often call a "code", as in "timezone code" or "country code" or "subject code." The strategy is to make a code which can be used on its own as a meaningful value that people can understand. This gives us tables that are easier to use for both programmer and end-user.

Let's consider our school management program. We have a table of teachers (populated by the school staff), and a table of subjects which we have provided as a reference table. When a teacher joins the faculty, somebody must enter the subjects that that teacher is qualified to each. The tables below show two examples of what this table might look like, which is easier to read?

TEACHER - SUBJECT CROSS REFERENCE

EXAMPLE 1: INTEGER KEYS           EXAMPLE 2: CHARACTER KEYS

Teacher | Subject                 Teacher     |  Subject
--------+----------               ------------+-----------
72      | 28                      SRUSSEL     |  PHYSICS
72      | 32                      SRUSSEL     |  CALCULUS
72      | 72                      SRUSSEL     |  HISTORY
45      | 28                      ACLAYBORNE  |  PHYSICS
45      | 29                      ACLAYBORNE  |  CELLBIOLOGY
45      | 45                      ACLAYBORNE  |  RUSSIAN

The table of character keys is much easier to work with, for the simple reason that many times you can just use the codes themselves, so you can avoid a lot of JOINs to the main tables. With integers you must always JOIN to the master table so you can get a meaningful value to show the user. But not only is the table itself easier to read when you are debugging, it is easier to work with when writing queries:

-- The character key example is pretty simple:
SELECT teacher,subject FROM teachers_x_subjects

-- The integer key absolutely requires joins
SELECT x.teacher_id,x.subject_id
       t.name,s.description
  FROM teachers_x_subjects x
  JOIN teachers t ON x.teacher_id = t.teacher_id
  JOIN subjects s ON x.subject_id = s.subject_id

I often hear people say they do not like SQL because it is so complicated and they hate doing so many JOINs. It makes me wonder if the person is lost in a JOIN jungle caused by very bad advice about always using integer primary keys.

If you are using some kind of ORM system that tries to protect you from coding any SQL, that basic problem of over-complicated tables will still appear in your code. One way or another you must enter details that tell the ORM system how to get the descriptions, which would not be necessary if the keys were meaningful character values.

We can now see the surprising fact that the integer keys will slow us down in many situations. Not only do they have no performance advantage, but they actually hurt performance. The reason is because they require joins on almost every query. A 3-table query with two joins will always be much slower than a 1-table query with no joins. If you are using an ORM system that does not do JOIN's, but instead does separate fetches, then you have 3 round trips to the server instead of 1, and heaven forbid you have queries in a nested loop, the performance will simply crash and burn. All of this is kind of ironic since you so often hear people blindly repeat the dogmatic phrase "We will use integer keys for performance reasons..."

Rule of Thumb 2: Use Character Keys for Small Master Tables

Many database programmers use the term "master table" to mean any table that lists the properties of things that have some permanence, like customers, teachers, students, school subjects, countries, timezones, items (skus), and anything else that can be listed once and used many times in other places. Generally a master table has more columns than a simple reference table.

Some master tables are small and do not change often. In our ongoing example of a school management application, the list of teachers is a good example of a small master table. Compared to the list of students, which is much larger and changes every year, the table of teachers at most schools (except for huge state universities) will have only a few changes each year.

For tables like this it is good to allow the user to enter character keys if they want to. Some schools will insist on being allowed to choose their own codes like 'SRUSSEL' for "Saxifrage Russel", while others will say, "Why should I have to make up a code, can't the computer do that?"

For these tables I have found it useful to always define the primary key as a character column, and then to allow some flexibility in how it is generated. Common ways of generating codes include:

  1. Letting the user make up their own code
  2. Generating a code out of some other column or columns, like first letter of first name, plus 5 letters of last name, plus three numeric digits. (This used to be very popular in decades past).
  3. Generate a number.

The key idea here is to follow the needs of your users. Option #2 above is one of the most useful because it gives you the best of both worlds.

Rule of Thumb 3: Use Integers For Large Master Tables

Some master tables are large or they change often, or both. In our ongoing example of a school management application, the list of students will change every year, with many students coming and going. Another example is a doctor's office that has patients coming and going all of the time. I have found it best to use plain integer keys here because:

  • Unlike small master tables (like teachers) or reference tables (like school subjects), a code is not likely to have any meaning for the end-user, so the biggest argument for using it does not hold.
  • Unlike reference tables, the master table is likely to have many more columns and you will probably end up JOINing to the table many times. This means our other big reason for using codes, which is to avoid JOINs, does not hold either.
  • It is not realistic to expect end-users to be making up codes for large tables, and since the codes will have no value, why should the end-user be troubled with the job?
  • Writing algorithms to generate unique codes will run into more difficulties, and since the code has no value why bother?

Rule of Thumb 4: Use Integers For Transaction Tables

Many database programmers use the term "transaction table" to mean any kind of table that records some kind of interaction or event between master tables. In an eCommerce program the shopping cart tables are all transaction tables, they record the purchase of items by customers. In our school management program the actual classes taken by students are transactions, because they record specific interactions between students and teachers.

For these tables the auto-generated integer key tends to be the most useful. I am not going to present any arguments for this because most programmers find it self-evident. It should be enough to say that any attempt to use a compound key (like customer + date ) always ends up causing a problem by limiting what can be entered, so the meaningless integer key is the way to go.

Rule of Thumb 5: Use Multi-Column Keys In Cross References

A useful database will end up with a lot of cross reference tables in it. A cross-reference table is any table that lists various facts about how master tables relate to each other. These tables are extremely useful for validating transactions. In fact, next week's entry will be all about these tables and how to use them.

For now the important point is that the primary key of a cross-reference is a combination of the foreign keys. We do not make up an extra column, either integer or character.

TEACHER-SUBJECT CROSS REFERENCE

Teacher     |  Subject
------ -----+-------------
SRUSSEL     |  PHYSICS
SRUSSEL     |  CALCULUS
SRUSSEL     |  HISTORY
ACLAYBORNE  |  PHYSICS
ACLAYBORNE  |  CELLBIOLOGY
ACLAYBORNE  |  RUSSIAN

The SQL for this table would resemble something like this:

CREATE TABLE teachers_x_subjects (
    teacher char(10)
   ,subject char(10)
   ,primary key (teacher,subject)
   ,foreign key (teacher) references teachers(teacher)
   ,foreign key (subject) references subjects(subject)
)

The reasons for this are rather complex, and next week the entire entry will be devoted to this and similar ideas. For now we will note that this approach lets us validate teacher-class assignments so that no teacher is assigned to teach a class she is not qualified for. Using a new column as a primary key does not allow that, and therefore leads to more complicated and error-prone code.

Rule of Thumb 6: Use Given Keys For Non-Insert Imports

Many systems today that we create will interact with systems that already exist. A typical eCommerce program will get a list of items and maybe even customers from the company's main computer system.

For some of these tables, your own system will absolutely never make new rows. A very common example is a table of items on an eCommerce site that is loaded up from some other computer system.

For these tables, the simplest route is to use whatever key exists on the table as it is given to you. Any other route involves more work with no clear motivation for putting out the effort.

Rule of Thumb 7: Use Integer Keys for Import/Export Tables

Sometimes you may have a table whose original values come from another system, but unlike the previous case your own system is generating new rows for the table, and you may have to send these rows back to the original system.

One classic example of this is a list of customers. I created a website a few years ago where the list of customers is updated from a different system from time-to-time. However, new customers can also sign up online. Both systems are handing the customer list back and forth from time to time to keep them reconciled.

In these cases I have an integer primary key for the table because it follows Rule of Thumb 3, it is a large master table. The most important concept here is that you must not try to combine your key and the key from the original table. Keep the key from the original table in its own column, index on it, and use it for updates, but do not try to enforce it as a unique column. The other system must take care of its own key, and your system must take care of yours.

Rule of Thumb 8: Use An Object Id On All Tables

Back when people were getting excited about the concept of "Object-Relational Databases", they came up with the term "object id" to denote a column that contains some unique value but otherwise has no meaning. The same idea exists with different names, but Object ID is now the term that most people understand so that is the term I will use.

Your programs can be made simpler in many cases if you add an object id to every single table in addition to the primary key. An object id is useful specifically for user interface code. If you use an object id, then it is easier to write UPDATE and DELETE statements, and it is easier to write framework or ORM code that does these things for you.

If you are following these rules of thumb closely in your project then it is important not to use the object id as a primary key, and therefore you may never use it as a foreign key either. If you use an object id as the primary key then you lose a lot of the benefits of the character keys listed above.

Also if you follow these rules in your projects it means that your transaction tables have both an auto-generated primary key like CART_ID and an auto-generated object id. Some programmers are bothered by this because we don't like the idea that two columns appear to be doing the same thing, and we try to save a column. But personally this does not bother me because it helps me write robust applications, and this is not 1985 where a 10MB hard drive cost hundreds of dollars.

Absolute Rule 1: Only Atomic Values

This is not merely a "rule of thumb" but a rule that I follow absolutely. It is actually part of First Normal Form, which is that column values must be atomic, or indivisible. Another way to say it is that the column must not have "subvalues" buried in it.

I have included this rule here instead of with First Normal Form because when most programmers violate this rule they are making primary keys by combining different values together. In our example of a school program, if we have a list of the actual students taking classes in a given school year, you might have a squashed-up primary key column like this:

CLASS_CODE                | STUDENT
--------------------------+---------------
SRUSSEL-2007-PHYSICS      | NAI
SRUSSEL-2007-MATH         | PCLAYBORNE
ACLAYBORNE-2007-RUSSIAN   | JBOONE
ACLAYBORNE-2007-MATH      | NAI

There are two practical problems with doing this:

  1. You cannot use a foreign key to validate the sub-values, so you must code validation manually.
  2. Retrieving the sub-values requires extra code, either in the SELECT or in your client code. If the values were in separate columns this would not be necessary.

Absolute Rule 2: No Magic Values

Another rule that I follow is to absolutely never have magic values. A magic value is a value in a column that causes some non-obvious result. I have included this is in this essay because most programmers who break this rule do so by hard-coding special actions to occur based on values of keys in reference tables and master tables.

An example might be a table of teachers, where one of the teacher values is something like "SUBSTITUTE", and the program is hardcoded to do a lot of different things when it sees this value. Magic values are bad because the code is harder to debug. It may not be obvious to a programmer that some special value of the TEACHER column would cause special actions to occur. But if you have a column called FLAG_SUBSTITUTE then any programmer who must maintain code written by somebody else will have a much easier time of it.

Magic numbers also confuse end-users. It may seem obvious to us that the value "SUBSTITUTE" in the teacher column means substitute, but if this value causes other things to occur, and we are in the regular habit of having these values in lots of tables, then the compound effect can be lots and lots of phone calls from confused users, and big trouble for the software developer's bottom line.

Finally, magic numbers limit you. If you use the value "SUBSTITUTE" as a single teacher in the teachers file, then how do you keep track of the dozen-odd substitutes the school may hire in a year? The end-user is stuck here, they must use pen and paper. It is much better to allow them to enter the substitute as a regular faculty member with a FLAG_SUBSTITUE column to check off.

Magic numbers have plagued programming since long before databases came around. Here is a link to The Jargon File, which talks about magic numbers in other contexts.

Conclusion: Many Kinds of Tables, Many Kinds of Keys

This week we have seen that there can be many practical benefits to using different kinds of keys for different kinds of tables. Using the right kind of key for the right kind of table leads to simpler code and better performance, whether you code SQL directly or use an ORM system.

Remember always that your application will always follow the same structure as your tables. If the tables are designed well, the code will be lean, tight, efficient, and robust. Because table design is so important, it is best to know well the different kinds of tables there are: reference, master, cross-reference, and transaction, and to build the keys wisely.

Next week we will zoom into cross references and find out how important they are in good table design and how they will improve your applications.

Next Essay: Cross Reference Validation Pattern

22 comments:

zippy1981 said...

Ken another great article. I will have to add one things about integer keys. Sometimes it makes sense to use random keys.

In the world of Windows we have whats called a GUID (Globally Unique Identifier). This is basically a 128 bit integer that's randomly generated using a certain algorithm. The idea actually precedes windows and the standard is implemented on other platforms.

While having a 128 bit integer as an index column causes some performance concern, it has one advantage over sequential keys; You can easily merge data from different databases. Just dump the data from one database and insert it into the next one.

Jake said...

Great article, I was really looking forward to your discussion on this and I am much better prepared for my next project. One question though, If you have a multi column primary key like:

primary key (teacher,subject)

Can you use that key as a foreign key in another table, and if so, how do you reference it?

KenDowns said...

Zippy, thanks for the compliment!

KenDowns said...

Jake, you have anticipated next week's post, but here is the short answer:


create table x (
  ...
  ,teacher char(10)
  ,subject char(10)
  ,...other columns
  ,foreign key (teacher,subject)
   references teachers_x_subjects
     (teacher,subject)

  ,...other stuff
)

DolceRemì said...

Really great article: clear and very helpfull. I enjoyed it.

AkitaOnRails said...

Really depends. Those are not bad tips, but it depends on the circumstances. For instance, you can choose to not use integer pks "if" and only "if" those are completely opaque values. Meaning, you won't show them to the user, it is unimportant to the process. That's because that can lead to a lot of problems if you need to change the format of those varchars when some company policy or business rule change. For example, SKU ids.

In that case you want to use surrogate keys galore. Another thing is that it doesn't matter too much (not that it is ignorable, of course), if you plan to use ORM over your tables. Then you don't have to deal too much with all the joins and so forth.

Another case is when you want to have synchronized tables between physically separated databases. In that case you would want to use surrogate keys AND unique identifiers like UUIDs.

Actually it depends. Any "rule of thumb" has to be taken with a grain of salt. I would recommend that people learn the RDBMS ways first before deciding on anything.

Donald said...

AkitaOnRails: Based on the framework Andromeda which was developer by Ken Downs, which we are using, for a site we are acutally using sku as the primary key, on our products table and this has worked out really well, and we are able to change the primary key at any given time, granted changing primary keys is not the best of practices but it can be done as well as if a company rule changes and that changes product sku's there are more things to worry about.

Anonymous said...

Hello,

It would be great if you would provide examples in support of your ideas.

Examples mean clearly showing tables with field definitions and relations between tables.

ericdes said...

This is a very enlightening article. I read several books about designing databases but I always felt undecided on how to choose primary keys. I ended up doing what you advocate in most of my projects, with the exception of the object id column (I'll see how it helps). However I never put it into words and was ever wondering if another method was more appropriate (it's how I found you!). Thank you for sharing your experience!

KenDowns said...

@ericdes: Thanks! Glad you found it helpful.

igv said...

Great article. I have some concern though about Rule #1 though.
What about the case you will need to change teacher name (due to typo or just last name change etc.)?
You will need to change the name everywhere it exists and it can be easily forgotten in one place, which will cause problems.

KenDowns said...

@igv: regarding rule #1, let me ask what we would do if we had an integer key instead? The integer is not meant to convey any meaning, it is only an identifier. So far so good.

If we follow rule #1, and use character keys, the first thing to understand is that they are still just identifiers, which happen to be easier to work with because they are less abstract. It does not mean they carry an authoritative representation of some fact, such as an absolute statement of a person's name, it just means they are easier to work with and remember.

Back in the day when everybody did this (before the web), if Joan Smith (JSMITH) got married and became Joan Jones, they just left the key as-is. It worked fine.

Vijaya Kadiyala said...

Hi
This is really great article. I would like to add one point. Choosing a Primary key on Character columns is case by case.

Thanks -- Vijaya Kadiyala
www.dotnetvj.com

Anonymous said...

Thank you for this article.
Google is so full of useless spam. It's great to see that perls like this article exist...

Ben said...

Great series, just one comment.. The rule for object ID makes no sense to me as I haven't come across the concept before and there was no example of how or why I would use it. Otherwise very lucid and has helped me a great deal.

Steve Ash said...

@zippy
Just to clarify the "performance concern" that you allude to for using GUID keys:
1) 128 bits is wide, that means fewer rows per page, more pages taking up more buffer cache. This impacts every part of query processing from more data to pull from disk to more cpu time & cache misses when computing joins, etc.
2) Fragmentation and data packing-- guid keys necessarily cause more splits, which results in much less compaction, resulting in poor space utlization (i.e. more pages, etc.)
3) cache locality-- there is often (not always) some "temporal" locality when inserting adjacent records. I.e. inserting checks for a bank deposit: if there were four checks then you might insert all four into the "checks" table at the same time. With "random" keys, these checks will end up in different locations in the index. When you later must retrieve the checks in the transaction, you will be pulling from four different locations instead of from one location (in the case of sequential keys). The point is that you break the spacial locality which often benefits the intrinsic temporal locality present in a lot of tables/relationships.

Sequential keys have their problems as well (hot spots for latching, allocation), but in general I find blind adoption of GUIDs insidious.

@KenDowns
Regarding your response to IGV (and maybe I didn't read the discussion closely enough) -- but if you claim that the value has no intrinsic value and thus its only a "more meaningful" identifier-- then I think you've argued against your tip #2. I am more skeptical of breaking normalization in some of these cases, because subject names are not philosophically, intrinsically, forever-and-ever guaranteed to be the same thing. The first time you avoid the join and rely on the name in the FK, now you have incurred hefty debt when the customer changes their requirement or wants to change "calculus" to "calculus AB" and "calculus BC". Also, these tables you specifically reference as being small and rarely changing, thus the performance penalty of the join is just a bit of CPU as its all likely to be cached (under your definition). Im not disputing your rule of thumb for some circumstances, but I value the risk mitigation of post-development expense much higher than the risk mitigation of not knowing how to write a join in SQL (or the cost of typing the additional characters to do the join).

Steve

François Beausoleil said...

What would you advise I do with a table whose natural primary key would be URLs as found on the Web?

My primary key is a TEXT NOT NULL column, whose content is the URL (think crawler-like).

This will be a large master table, thus rule #3 argues it should be an integer.

Any other tips?

zippy1981 said...

François, Use varchar (or possibly nvarchar if you will be getting UTF-16) for the url field. Also cap it at like 1024 characters, not max. I don't even know if a text field can be a primary key.

François Beausoleil said...

PostgreSQL works just fine using TEXT primary columns.

Eric J. Schwarzenbach said...

I'm with you on everything but object ids. I don't think you really presented any good arguments about them. How does this make anything simpler for programs? What is the advantage for ui code? Why can't it simply use the PK? I've never used such a thing and never run into difficulty that made me think "if only I had another unique key that isn't the PK!".

KenDowns said...

@Eric, it is not simpler to have an objectId for any particular ad-hoc program, the PK is fine. But for library code it is simpler if the code can always assume the existence of some column such as "UniversalID" that it can use for SELECT, UPDATE, and DELETE. Makes the generation of SQL commands easier.

Richard said...

Really Really good article. Thanks for writing. Nice to see someone that bucks the "this is how its always been done mentality" Kudos!