Monday, June 29, 2009

Outer Joins

Outer joins are one of the most widely misunderstood table operators. As useful as they are, it is very easy to incorrectly apply predicates in outer join queries and introduce logical bugs that are very difficult to troubleshoot (or spend hours trying to figure out why a particular query does not produce the desired results). This article demonstrates common misunderstanding about outer joins and how to properly use them.

First, let's start with explanation of the logical processing of a query (listing here only the steps related to outer join operators, for full explanation of logical query processing read Anatomy of a Query).

1). FROM/CROSS JOIN: The FROM clause is processed to identify source tables; a cross join (Cartesian product) is formed between the two tables involved in the table operator.

2). ON: The ON clause predicates are applied and only rows that satisfy the predicate(s) (for which the predicates evaluate to TRUE) are included in the temporary result set.

3). ADD OUTER ROWS: The left side table in LEFT OUTER JOIN and right side table in RIGHT OUTER JOIN (or both tables in FULL OUTER JOIN) are the preserved tables. That means all rows (and selected attribute values) from this table(s) are present in the result set after the outer join operator is applied. At this phase the non-matching rows from the preserved table(s) are added back (non-matched based on the predicates in the ON clause). The attributes for the non-matched rows from the non-preserved table are added as NULLs.

4). WHERE: The predicates in the WHERE clause are applied and only rows for which the predicates evaluate to TRUE are included in the temporary result set.

The process repeats for any other table operators in the FROM clause taking the temporary result set from the prior table operator as left input.

The key concept here is to understand well phases 2, 3, and 4, and how filtering affects the result set. To illustrate this with example, let's create two tables with loans and customers. Each customer can have personal or business loans (or both), or no loans at all.

CREATE TABLE Loans (

loan_nbr INT NOT NULL,

customer_nbr INT NOT NULL,

loan_date DATETIME NOT NULL,

loan_amount DECIMAL(15, 2) NOT NULL,

loan_type CHAR(1) NOT NULL,

CONSTRAINT ck_loan_type

CHECK (loan_type IN ('P', 'B')), -- P=Personal; B=Business

CONSTRAINT pk_loans

PRIMARY KEY (loan_nbr));






INSERT INTO Loans

(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)

VALUES(1, 1, '20080101', 1500.00, 'P');

INSERT INTO Loans

(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)

VALUES(2, 2, '20080215', 1000.00, 'P');

INSERT INTO Loans

(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)

VALUES(3, 1, '20080311', 5000.00, 'B');

INSERT INTO Loans

(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)

VALUES(4, 3, '20080312', 2000.00, 'P');

INSERT INTO Loans

(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)

VALUES(5, 4, '20080325', 1200.00, 'P');

INSERT INTO Loans

(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)

VALUES(6, 3, '20080327', 4000.00, 'B');

INSERT INTO Loans

(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)

VALUES(7, 5, '20080410', 3500.00, 'B');

INSERT INTO Loans

(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)

VALUES(8, 2, '20080412', 2000.00, 'P');




CREATE TABLE Customers (

customer_nbr INT NOT NULL,

customer_name VARCHAR(35),

PRIMARY KEY (customer_nbr));




INSERT INTO Customers (customer_nbr, customer_name)

VALUES(1, 'Peter Brown');

INSERT INTO Customers (customer_nbr, customer_name)

VALUES(2, 'Jeff Davis');

INSERT INTO Customers (customer_nbr, customer_name)

VALUES(3, 'Kevin Fox');

INSERT INTO Customers (customer_nbr, customer_name)

VALUES(4, 'Donna Louis');

INSERT INTO Customers (customer_nbr, customer_name)

VALUES(5, 'Greg Barrow');

INSERT INTO Customers (customer_nbr, customer_name)

VALUES(6, 'Richard Douglas');


-- Add foreign key for Loans

ALTER TABLE Loans

ADD CONSTRAINT fk_loans_customers

FOREIGN KEY (customer_nbr)

REFERENCES Customers(customer_nbr);
CASE #1: Predicates on the non-preserved table attributes

The first request is to retrieve list of all customers and only the total personal loan amount (if any) they have. The following query seems to describe the request properly:

SELECT C.customer_name,

SUM(COALESCE(L.loan_amount, 0)) AS total

FROM Customers AS C

LEFT OUTER JOIN Loans AS L

ON C.customer_nbr = L.customer_nbr

WHERE L.loan_type = 'P'

GROUP BY C.customer_name

ORDER BY customer_name;
However, the results do not seem correct:

customer_name    total

---------------- --------

Donna Louis 1200.00

Jeff Davis 3000.00

Kevin Fox 2000.00

Peter Brown 1500.00
Customers Greg and Richard are missing from the output. To understand the problems let's run this step by step:

1). Cross join:

SELECT C.customer_name, L.loan_amount, L.loan_type

FROM Customers AS C

CROSS JOIN Loans AS L;
This returns Cartesian product (every possible combination of a row from the Customers table and a row from the Loans table).

2). ON predicates:

SELECT C.customer_name, L.loan_amount, L.loan_type

FROM Customers AS C

INNER JOIN Loans AS L

ON C.customer_nbr = L.customer_nbr;
This results in the following output:

customer_name   loan_amount  loan_type

--------------- ------------ ---------

Peter Brown 1500.00 P

Jeff Davis 1000.00 P

Peter Brown 5000.00 B

Kevin Fox 2000.00 P

Donna Louis 1200.00 P

Kevin Fox 4000.00 B

Greg Barrow 3500.00 B

Jeff Davis 2000.00 P
At this stage only rows that match based on the predicate for customer match are included.

3). Add outer rows:

SELECT C.customer_name, L.loan_amount, L.loan_type

FROM Customers AS C

LEFT OUTER JOIN Loans AS L

ON C.customer_nbr = L.customer_nbr;
The query returns this result set:

customer_name   loan_amount   loan_type

--------------- ------------- ---------

Peter Brown 1500.00 P

Peter Brown 5000.00 B

Jeff Davis 1000.00 P

Jeff Davis 2000.00 P

Kevin Fox 2000.00 P

Kevin Fox 4000.00 B

Donna Louis 1200.00 P

Greg Barrow 3500.00 B

Richard Douglas
NULL NULL

Here the outer rows are added, resulting in adding back a row for customer Richard, who has no loans at all and was excluded in the prior phase when the ON clause predicates were applied.

4). WHERE predicates:

SELECT C.customer_name, L.loan_amount, L.loan_type

FROM Customers AS C

LEFT OUTER JOIN Loans AS L

ON C.customer_nbr = L.customer_nbr

WHERE L.loan_type = 'P';
This result set is the base for the results in the first query attempt:

customer_name   loan_amount  loan_type

--------------- ------------ ---------

Peter Brown 1500.00 P

Jeff Davis 1000.00 P

Kevin Fox 2000.00 P

Donna Louis 1200.00 P

Jeff Davis 2000.00 P
Now it is very clear that the predicate in the WHERE clause filters the NULL for loan type for customer Richard, and customer Greg is excluded because he has only business loans. Incorrectly placing the predicate in the WHERE clause turns the outer join to inner join.

To correct this query it only requires to move the predicate on loan type from the WHERE clause to the ON clause:

SELECT C.customer_name,

SUM(COALESCE(L.loan_amount, 0)) AS total

FROM Customers AS C

LEFT OUTER JOIN Loans AS L

ON C.customer_nbr = L.customer_nbr

AND L.loan_type = 'P'

GROUP BY C.customer_name

ORDER BY customer_name;

This results in correct result set listing all customers and the total personal loan amount.

customer_name    total

---------------- --------

Donna Louis 1200.00

Greg Barrow 0.00

Jeff Davis 3000.00

Kevin Fox 2000.00

Peter Brown 1500.00

Richard Douglas 0.00


CASE #2: Predicates on the preserved table attributes

Let's look at another example demonstrating how filtering affect the preserved table attributes. The request is to retrieve the total loan amount for customer Richard Douglas, even if Richard does not have any loans at all. Since it is required to return always Richard's account info, the following query seems to satisfy the request:

SELECT C.customer_name,

SUM(COALESCE(L.loan_amount, 0)) AS total

FROM Customers AS C

LEFT OUTER JOIN Loans AS L

ON C.customer_nbr = L.customer_nbr

AND C.customer_name = 'Richard Douglas'

GROUP BY C.customer_name

ORDER BY customer_name;

However, the result set returns information for all customers:

customer_name   total

--------------- -----

Donna Louis 0.00

Greg Barrow 0.00

Jeff Davis 0.00

Kevin Fox 0.00

Peter Brown 0.00

Richard Douglas 0.00

As explained earlier, the reason for that is that outer rows are added back after the outer join predicates are applied, resulting in adding back all other customers. To get only Richard's loan information requires moving the predicate filtering on customer name to the WHERE clause:

SELECT C.customer_name,

SUM(COALESCE(L.loan_amount, 0)) AS total

FROM Customers AS C

LEFT OUTER JOIN Loans AS L

ON C.customer_nbr = L.customer_nbr

WHERE C.customer_name = 'Richard Douglas'

GROUP BY C.customer_name

ORDER BY customer_name;

This results in correctly returning only Richard's loan information:

customer_name    total

---------------- -----

Richard Douglas 0.00

12 comments:

Brad Schulz said...

Excellent illustrations of an often-misunderstood subject.

Keep up the great work!

--Brad

Charles Kincaid said...

Excellent Atricle. Well written too. I had to try these for myself. It now has me thinking back on every querry I've coma across, or written, to see where this mistake might have come to play.

Michael M David said...

Multiple Outer joins in a single statement get even more complicated. They get processed from left to right. The Left Outer join for example only affects the tables to its right as it proceeds processing down the SQL statement(left-to-right) in a hierarchical fashion. The placement of the ON keyword can also cause execution nesting where multiple temporary working sets are created. This nesting can be coded but is usually produced automatically by outer join view expansion causing the view on the right to be fully materialized before being joined to the left table or view. This solves a lot of problems automatically. See my site for more info on outer joins and their use.

Erez said...

Very important article. Calrified the maater to me. I happaned to deal with this issues lately, and this article made order in my logic.

Thanks
Erez

Kent Waldrop said...

A good article, Plamen. It seems like I am constantly supporting queries where there is some kind of misunderstanding about an outer join.

The problem that I get into is when I am supporting a query such as this that is in production:

SELECT C.customer_name, L.loan_amount, L.loan_type
FROM Customers AS CLEFT OUTER JOIN Loans AS L
ON C.customer_nbr = L.customer_nbr
WHERE L.loan_type = 'P';

I see this kind of thing frequently in production. Now my problem is asking if the query is returning the results that the end user wants? Sometimes the answer is yes. Frequently I hear an answer similar to "nobody is complaining", but in fact noone seems to be certain.



Kent Waldrop

Robert Carnegie rja.carnegie@excite.com said...

One pick at, concerning joins of three or more tables. Well, two picks: you're describing an abstract logical process, but how your server actually executes the query is liable to be different for different queries that always return the same logical results. It doesn't strictly affect outcome but it can affect performance. But I guess that is a whole other story.

My main argument, though, was that, unless I'm being stupid and ignorant, in a query with many tables there is only one WHERE clause, at the end. (Well, together with its stepbrother, HAVING.) So you repeatedly construct the intermediate result set with ON - or the server does something equivalent to doing that - and then filter the final results with WHERE.

A smart server may not do exactly that. For instance, in a mathematically complex query, you could implicitly invoke an intermediate result set that, if it actually existed, would take more computers to store it than there are in the whole world. But that doesn't matter if your computer can give you the correct final result without commandeering the whole Internet to work it out.

Hey, can folks e-mail me if you coMe back on this? Thank!

Plamen Ratchev said...

Robert,

The point here is not to illustrate how the query engine will execute the joins and WHERE predicates. Physical execution is different from logical processing order. But understanding the logical processing order and how to place predicates correctly on the join conditions or in WHERE is the essence of writing correct outer joins. If you have more than two tables, simply the result set from the first two is used as virtual table that is then joined to the third table, and so on.

-- Plamen

Kent Waldrop said...

I agree that most problems that I see with outer joins are because of the WHERE clause. However, there is another class of problems in which the fault is less obvious.

Yesterday I was faced with a fragment similar in structure to this:

select
stuff
from A
left join b
on b.aKey = a.aKey
join c
on b.cKey = c.ckey

Yet another example of a "de facto inner join". The outer rows will not be included because the inner join with C constrains b.cKey to contain only non-null values. The syntax was corrected to:

select
stuff
from A
left join
( B join c
on b.cKey = c.cKey
) on b.aKey = a.aKey

Plamen Ratchev said...

Good point Kent! Sometimes it is easy to forget that the predicates on the next table operator can be the culprit, not the WHERE predicates. But as long as the fundamentals about outer joins are understood, those type of problems can be avoided.

Anonymous said...

Hope you're still answering questions.

What happens in classic SQL, where you didn't have a ON clause, but only the where clause? Did the introduction of the ON clause create a new algorithm for performing a query? The "problem" in case #2 doesn't even arise in classic SQL.

Plamen Ratchev said...

Anonymous,

I am not sure what you refer to as classic SQL. If you mean the old SQL-89 syntax for join:

SELECT A.column, B.column
FROM A, B
WHERE A.keycol = B.keycol;

Then this is inner join and there is no phase to add outer rows (step 3 is missing). If you refer to the proprietary outer join syntax available in SQL Server:

SELECT A.column, B.column
FROM A, B
WHERE A.keycol *= B.keycol;

This case if different and you do not have the same logical processing order.

Ranjith said...

Plamen very good illustration using the example. Thanks for pointing it out.