Sunday, March 30, 2008

How the SQL UNION Affects Table Design

Welcome to the Database Programmer!

There are links to other essays at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Introducing the UNION

Last week we saw a JOIN. A JOIN will combine values from two (or more) different tables and put them into the same row of the output.

In contrast to the JOIN, which puts values into the same row, the output of a UNION is to add together rows from more than one query to make a larger result. Here is an example of a UNION of a customers table and a vendors table:

The image shows that rows from two tables are combined one after the other in the results. The SQL is here:

SELECT customer as code,'CUST' as type,name,state
  FROM customers
UNION ALL
SELECT customer as code,'VEND' as type,name,state
  FROM vendors

You Usually Want UNION ALL

The example above uses the syntax "UNION ALL", with the keyword "ALL" being added. You will see this "ALL" keyword on every example on this blog and in many examples elsewhere.

If the "ALL" keyword is not included, most database servers will examine every row of the results and attempt to eliminate duplicates. They do this by examining every single column of every row. This means if you pull a 10-column query of 3000 rows out of one table and 6000 rows out of another, the database will attempt to de-duplicate 9000 rows based on all ten columns. This has two disadvantages, which are:

  1. It is slow, like a snail crawling through glue, and
  2. Most people don't actually expect or want the query to be de-duplicated.

Including the "ALL" keyword tells the server to return all rows and not bother to try to de-duplicate them.

Numbering Columns

When you do a GROUP BY (which we will see in later weeks) or an ORDER BY, most database servers require you to list the columns by number, not by name. This is because the values in the result may be coming from columns that had different names in the original base tables.

Object Oriented Influences

The example above shows only three columns each for the customers and vendors tables. We can assume of course that they have more columns, and we can also guess that many of those columns will be the same for both tables. In any case where you use a UNION, you may find yourself asking why you have two tables, and if you should have only one.

To make matters worse, if you learned your table design in the school of Object Orientation then you will have a very strong desire to make a base class called "trading partner" and make customers and vendors into subclasses of that base class. Then you want your tables to reflect your classes so there you are with one table.

Nevertheless, this is usually a mistake. It will make your code more complicated and error prone. To understand why, let's look at UNION again.

The UNION clause allows you to combine information from separate similar sources only when needed. In other words, the UNION clause lets you combine information upon demand, without requiring a permanent combination.

If you combine vendors and customers you have a problem that checks can be issued to customers, or orders can be entered against vendor accounts. To prevent this, you need complicated business logic in your application, and additional columns in the tables. Moreover, a customer will have columns a vendor does not and vice-versa, so you need more logic to ignore some columns or hardcode them or otherwise handle them based on what operation is begin performed. If they are separate, simple foreign keys do the trick and you don't need that extra code. Once you know about the UNION clause, there is little incentive to combine entities that should not be combined.

Conclusion: Combine at Output

If you go back and review the essays on table design patterns you will see that good table design is all about separating facts out into many different tables. The goal of the separation is to store each fact in exactly one place. Using primary keys and foreign keys on a well normalized database ensures that data is correct while it is on the way in.

However, on the way out, you need to recombine those facts. Two weeks ago we saw that the JOIN combines facts from different tables into a row. This week we saw that we can use the UNION to combine results vertically, that is, to add the results of one query to the results of another. Judicial use of UNION makes your application lean and efficient by letting you normalize data to ensure correctness on the way in, while still combining facts where necessary on the way out.

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Other essays relating to SQL SELECT are:

6 comments:

APC said...

I often wish that the default behaviour of UNION was UNION ALL and we had to use a different syntax - such as UNION DISTINCT - to get the winnowing effect. Too many people use UNION when they ought to be using UNION ALL, through ignorance of the syntax or incomprehension of its implications.

Of course, UNION, like the other set operators, can be useful in data migration and data cleansing activities. But those aren't run-of-the-mill applications.

Cheers, APC

John Flack said...

While I understand your aversion to what you call "Object Oriented Influences", there can be good reasons for having a single table instead of two (or more). For instance, sometimes a vendor is also a customer. Or in one of our systems - a system for the federal government, sometimes the recipient of one of my client's grants makes sub-grants of its own, which my client needs to track. Some entities are recipients of both direct grants and sub-grants. Do I duplicate the information in my Grantee table into a Sub-Grantee table? Not a good idea. What we do is create an intersection table with foreign keys to an "Organization" table for grantor and grantee, plus either additional columns about the grant itself or foreign keys to that information.

KenDowns said...

John, the point of customers also being vendors is subtle, but clear. Customers and vendors are different sorts of business entities, mixing them together creates complications in the application code. Separating them prevents those complications. If this means a company is listed twice, this is usually easier to handle than debugging the code, especially since very often the contact information and relevant mailing addresses are different for each department. My argument is only that separate entities should be tracked separately, and to avoid the temptation to group them together because of occasional overlap.

With respect to your government example, I would need to know more before offering a conclusion, but based on what you have said there is no need for a table of "organizations", that table is a fictional abstraction. There is a table of "grantees" and everybody in it receives grants. The intersection table you mention records the facts of the subgrants when they happen. What you have here is a basic hierarchy of grants.

Bruno Martinez said...

I can't help to think that repetition is always wrong. By having similar tables, the same information is repeated in the schema definition. One has to update both simultaneously to keep the union well typed. It seems table repetition is a road to exponential explosion.

KenDowns said...

Bruno: the trick is to realize that it is not repetition. Two tables may have a similar list of columns, but that hardly means they are tracking the same entities. Salt and sugar look alike, should we store them together in a bowl? If two twins always wear the same clothes, are they only a single person?

But laying aside the theory, I can only offer my experience, which is that changing an occasional UNION is far less work than writing filter code to keep the salt and sugar separate after they've been mixed together.

Phaneesh said...

Thank you !! This is exactly what I was looking for.