Entity-Attribute-Value (often referenced as EAV) table design is one of the “wonderful” things that newcomers to the SQL land discover very quickly! Traditional procedural programmers have very hard time understanding why tables have fixed number of columns, and adding a new column requires table and possibly code change. Their great problem solving skills from the procedural world help them find a “natural” solution – the EAV design – where a pair of key (describing the attribute) and value allow to store data with flexibility.
Example is storing all employee information in 3 columns: key column to identify employee (entity), attribute key column (attribute) to identify the attribute stored (name, birth date, SSN, date of hire, etc.), and value column (value) to store the actual value.
|
Listing 1 |
-- EAV table |
Here the pair of loan number and customer number identifies the entity. The code column stores the attribute (amount, date, and type of loan) and the value column represents that actual value.
A quick demonstration of the problems caused by this design. Let’s write a query to retrieve customers with personal loans over 1,000 for the period Jan 1, 2008 through Jan 31, 2008. Here is a first attempt to satisfy the request:
Listing 2 |
SELECT A.loan_nbr, |
One of the first issues to notice is the multiple joins to retrieve each attribute. This is because each attribute is stored in separate row and we need to put back together the customer loan info. Besides that seems the code should be OK… except one little problem! The nature of the SQL is such that nothing guarantees that the predicates in the WHERE clause will be executed in the order listed. They will be expanded in the query plan and the cast to DATETIME or DECIMAL can be applied to the loan type value which will result in conversion error. I described this in more detail in my post Predicates in SQL.
To fix this you may be tempted to try a different approach: using table expressions to retrieve each attribute. Listing 3 shows a version of such query.
Listing 3 |
SELECT A.loan_nbr, |
However, this query has the same problem because the table expressions are expanded in the query plan and nothing guarantees the order of executing the predicates. One approach to retrieve the requested data successfully is to use a pivoting technique. The query below illustrates that.
Listing 4 |
SELECT loan_nbr, |
Here the CASE expressions guarantee that only the correct values will be casted to the specific data type. But then how efficient is this query for such a trivial request?
The refactoring process
Usually the first step of correcting an EAV design is to create a normalized table for storing the data. In our case the table structure can look like Listing 5 (note that here for clarify the loan type is spelled out as entire word; in real business scenario it may be sufficient to indicate only P or B; alternatively create table to store loan types and reference via a foreign key constraint).
Listing 5 |
CREATE TABLE Loans ( |
Now each attribute is in separate column with appropriate data type. This guarantees the integrity of the data as well as now we can define constraints to restrict data values (like the CHECK constraint for loan type).
The next step is to convert and transfer data from the old format to the new table. We already know that a pivoting technique works and can be used here. Here is the query to transfer from EAV to normalized format.
Listing 6 |
INSERT INTO Loans |
The last step in the refactoring process is to replace the old EAV table with view for backward compatibility (if there is code referencing that original table that cannot be changed at this time). There are different methods to accomplish this. For simplicity here we can use unpivoting using UNION and query for each attribute (note that you have to drop the original EAV table first because the view has the same name).
Listing 7 |
CREATE VIEW EAV_Loans |
If there is any old code that references the original EAV table for data modifications you will have to add instead of triggers to the view to handle it.
Now, let’s answer the same data request to retrieve customers with personal loans over 1,000 for the period Jan 1, 2008 through Jan 31, 2008, this time using the normalized table.
Listing 8 |
SELECT loan_nbr, customer_nbr, loan_date, loan_amount |
Simple and elegant, just as it should be!
Labels: best practices, refactoring, t-sql programming