Tuesday, July 27, 2010

Refactoring Entity-Attribute-Value Design

Entity-Attribute-Value (often referenced as EAV) table design is one of the “wonderful” things that newcomers to the SQL land discover very quickly! Traditional procedural programmers have very hard time understanding why tables have fixed number of columns, and adding a new column requires table and possibly code change. Their great problem solving skills from the procedural world help them find a “natural” solution – the EAV design – where a pair of key (describing the attribute) and value allow to store data with flexibility.
    Example is storing all employee information in 3 columns: key column to identify employee (entity), attribute key column (attribute) to identify the attribute stored (name, birth date, SSN, date of hire, etc.), and value column (value) to store the actual value.

More on EAV...

Tony Andrews has an excellent article explaining the problems with EAV design: OTLT and EAV: the two big design mistakes all beginners make. It is worth to note that in some cases where it is required to store many attributes that change very often (new attributes added, or existing attributes deleted) and there is no need for referential or domain integrity and query logic based on these attributes then it may be a valid scenario to utilize EAV design.

Normally the value column has to be of some large length character data type, like VARCHAR(200), and data has to be casted to text when saved and then casted back to the original data type when retrieved. 
    Once all coding is done then comes the realization of how bad this approach is… And sometimes you just walk into a project where EAV has been already deployed and now it is your job to fix it. The typical signs of EAV implementation is the existence of very complex (and slow) queries to satisfy relatively trivial user requests for data and the lack of data integrity.

How do you fix EAV design?
Here is one example of EAV design and the refactoring process to correct it. In our case the information for bank loans is stored in a table representing EAV design. Listing 1 below shows the scripts to create the table and insert sample data.

Listing 1

-- EAV table
CREATE TABLE EAV_Loans (
 
loan_nbr INT NOT NULL,
 
customer_nbr INT NOT NULL,
 
code VARCHAR(30) NOT NULL,
 
value VARCHAR(200),
 
CONSTRAINT pk_eav_loans
 
PRIMARY KEY (loan_nbr, customer_nbr, code));

INSERT INTO EAV_Loans
(loan_nbr, customer_nbr, code, value)
VALUES(1, 1, 'date', '20080110');
INSERT INTO EAV_Loans
(loan_nbr, customer_nbr, code, value)
VALUES(1, 1, 'amount', '1500.00');
INSERT INTO EAV_Loans
(loan_nbr, customer_nbr, code, value)
VALUES(1, 1, 'type', 'personal');
INSERT INTO EAV_Loans
(loan_nbr, customer_nbr, code, value)
VALUES(2, 2, 'date', '20080215');
INSERT INTO EAV_Loans
(loan_nbr, customer_nbr, code, value)
VALUES(2, 2, 'amount', '3500.00');
INSERT INTO EAV_Loans
(loan_nbr, customer_nbr, code, value)
VALUES(2, 2, 'type', 'personal');

SELECT loan_nbr, customer_nbr, code, value
FROM EAV_Loans;

/*

loan_nbr    customer_nbr code      value
----------- ------------ --------- ----------
1           1            amount    1500.00
1           1            date      20080110
1           1            type      personal
2           2            amount    3500.00
2           2            date      20080215
2           2            type      personal


*/

Here the pair of loan number and customer number identifies the entity. The code column stores the attribute (amount, date, and type of loan) and the value column represents that actual value.
    A quick demonstration of the problems caused by this design. Let’s write a query to retrieve customers with personal loans over 1,000 for the period Jan 1, 2008 through Jan 31, 2008. Here is a first attempt to satisfy the request:

Listing 2

SELECT A.loan_nbr,
      
A.customer_nbr,
      
CAST(A.value AS DATETIME) AS loan_date,
      
CAST(B.value AS DECIMAL(15, 2)) AS loan_amount
FROM EAV_Loans AS A
INNER JOIN EAV_Loans AS B
  
ON A.loan_nbr = B.loan_nbr
 
AND A.customer_nbr = B.customer_nbr
INNER JOIN EAV_Loans AS C
  
ON A.loan_nbr = C.loan_nbr
 
AND A.customer_nbr = C.customer_nbr
WHERE A.code = 'date'
 
AND CAST(A.value AS DATETIME) >= '20080101'
 
AND CAST(A.value AS DATETIME) <  '20080201'
 
AND B.code = 'amount'
 
AND CAST(B.value AS DECIMAL(15, 2)) > 1000.00
 
AND C.code = 'type'
 
AND C.value = 'personal';

One of the first issues to notice is the multiple joins to retrieve each attribute. This is because each attribute is stored in separate row and we need to put back together the customer loan info. Besides that seems the code should be OK… except one little problem! The nature of the SQL is such that nothing guarantees that the predicates in the WHERE clause will be executed in the order listed. They will be expanded in the query plan and the cast to DATETIME or DECIMAL can be applied to the loan type value which will result in conversion error. I described this in more detail in my post Predicates in SQL.
    To fix this you may be tempted to try a different approach: using table expressions to retrieve each attribute. Listing 3 shows a version of such query.

Listing 3

SELECT A.loan_nbr,
      
A.customer_nbr,
      
loan_date,
      
loan_amount
FROM (SELECT loan_nbr, customer_nbr,
            
CAST(value AS DATETIME) AS loan_date
     
FROM EAV_Loans
     
WHERE code = 'date') AS A
INNER JOIN (SELECT loan_nbr, customer_nbr,
                  
CAST(value AS DECIMAL(15, 2)) AS loan_amount
           
FROM EAV_Loans
           
WHERE code = 'amount') AS B
  
ON A.loan_nbr = B.loan_nbr
 
AND A.customer_nbr = B.customer_nbr
INNER JOIN (SELECT loan_nbr, customer_nbr,
                  
value AS loan_type
           
FROM EAV_Loans
           
WHERE code = 'type') AS C
  
ON A.loan_nbr = C.loan_nbr
 
AND A.customer_nbr = C.customer_nbr
WHERE loan_date >= '20080101'
 
AND loan_date '20080201'
 
AND loan_amount > 1000.00
 
AND loan_type = 'personal';

However, this query has the same problem because the table expressions are expanded in the query plan and nothing guarantees the order of executing the predicates. One approach to retrieve the requested data successfully is to use a pivoting technique. The query below illustrates that.

Listing 4

SELECT loan_nbr,
      
customer_nbr,
      
loan_date,
      
loan_amount
FROM (
SELECT loan_nbr,
      
customer_nbr,
      
MAX(CASE WHEN code = 'date' THEN CAST(value AS DATETIME) END),
      
MAX(CASE WHEN code = 'amount' THEN CAST(value AS DECIMAL(15, 2)) END),
      
MAX(CASE WHEN code = 'type' THEN value END)
FROM EAV_Loans
GROUP BY loan_nbr, customer_nbr
     
) AS L(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)
WHERE loan_date >= '20080101'
 
AND loan_date '20080201'
 
AND loan_amount > 1000.00
 
AND loan_type = 'personal';

Here the CASE expressions guarantee that only the correct values will be casted to the specific data type. But then how efficient is this query for such a trivial request?

The refactoring process
Usually the first step of correcting an EAV design is to create a normalized table for storing the data. In our case the table structure can look like Listing 5 (note that here for clarify the loan type is spelled out as entire word; in real business scenario it may be sufficient to indicate only P or B; alternatively create table to store loan types and reference via a foreign key constraint).

Listing 5

CREATE TABLE Loans (
 
loan_nbr INT NOT NULL,
 
customer_nbr INT NOT NULL,
 
loan_date DATETIME NOT NULL,
 
loan_amount DECIMAL(15, 2) NOT NULL,
 
loan_type VARCHAR(10) NOT NULL,
 
CONSTRAINT ck_loan_type
 
CHECK (loan_type IN ('personal', 'business')),
 
CONSTRAINT pk_loans
 
PRIMARY KEY (loan_nbr));

Now each attribute is in separate column with appropriate data type. This guarantees the integrity of the data as well as now we can define constraints to restrict data values (like the CHECK constraint for loan type).
    The next step is to convert and transfer data from the old format to the new table. We already know that a pivoting technique works and can be used here. Here is the query to transfer from EAV to normalized format.

Listing 6

INSERT INTO Loans
(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)
SELECT loan_nbr,
      
customer_nbr,
      
MAX(CASE WHEN code = 'date' THEN CAST(value AS DATETIME) END),
      
MAX(CASE WHEN code = 'amount' THEN CAST(value AS DECIMAL(15, 2)) END),
      
MAX(CASE WHEN code = 'type' THEN value END)
FROM EAV_Loans
GROUP BY loan_nbr, customer_nbr;

The last step in the refactoring process is to replace the old EAV table with view for backward compatibility (if there is code referencing that original table that cannot be changed at this time).  There are different methods to accomplish this. For simplicity here we can use unpivoting using UNION and query for each attribute (note that you have to drop the original EAV table first because the view has the same name).

Listing 7

CREATE VIEW EAV_Loans
(loan_nbr, customer_nbr, code, value)
AS
SELECT
loan_nbr, customer_nbr,
      
CAST('date' AS VARCHAR(30)),
      
CONVERT(VARCHAR(200), loan_date, 112)
FROM Loans
UNION
SELECT
loan_nbr, customer_nbr,
      
CAST('amount' AS VARCHAR(30)),
      
CAST(loan_amount AS VARCHAR(200))
FROM Loans
UNION
SELECT
loan_nbr, customer_nbr,
      
CAST('type' AS VARCHAR(30)),
      
CAST(loan_type AS VARCHAR(200))
FROM Loans;

If there is any old code that references the original EAV table for data modifications you will have to add instead of triggers to the view to handle it.

Now, let’s answer the same data request to retrieve customers with personal loans over 1,000 for the period Jan 1, 2008 through Jan 31, 2008, this time using the normalized table.

Listing 8

SELECT loan_nbr, customer_nbr, loan_date, loan_amount
FROM Loans
WHERE loan_date >= '20080101'
 
AND loan_date '20080201'
 
AND loan_amount > 1000.00
 
AND loan_type = 'personal';

Simple and elegant, just as it should be!

5 comments:

Mohammad said...

Hi,
It is very good article. But some of people (myself included) known the EAV as Open Schema.

Anonymous said...

Yes pivot refacturing has worked well, however this does not change the problem of "adding a new column requires table and possibly code changes" when using EAV.

Anonymous said...

"adding a new column requires table and possibly code changes"
Code change: Depends on how you write the code. You know you can access metadata...
Table change: Runtime DDL or EAV, each is to be avoided, so you can choose.

Pete said...

Good article -- thanks. I work with EAV a lot in healthcare, where it can be highly useful because of the "sparse matrix" nature of the data. Question for you -- do you know of an algorithm or technique for transforming an EAV table to a flat table "on the fly" when the number of attributes that must become columns is potentially large and/or little is known about them a priori?

Your method involves knowing all of the attributes in advance, hard-coding the new table that will represent them as columns, and then constructing the query that transforms. I would like to be able to write a query that takes any EAV table about which I have little prior knowledge and transforms and presents the data in a flattened format.

Thanks in advance for any advice!

PS -- if this cannot be accomplished in SQL, I would also be interested in a programmatic algorithm using arrays.

Anonymous said...

Good article.

I've written about the many problems of EAV, AKA MUCK.
http://rodgersnotes.wordpress.com/2010/09/21/muck-massively-unified-code-key-generic-three-table-data-model/
Based on frustrating experience.

If anyone brings up the idea of using MUCK/EAV, feel free to refer them to the article.

Best!