Refactoring Entity-Attribute-Value Design

Entity-Attribute-Value (often referenced as EAV) table design is one of the “wonderful” things that newcomers to the SQL land discover very quickly! Traditional procedural programmers have very hard time understanding why tables have fixed number of columns, and adding a new column requires table and possibly code change. Their great problem solving skills from the procedural world help them find a “natural” solution – the EAV design – where a pair of key (describing the attribute) and value allow to store data with flexibility.
Example is storing all employee information in 3 columns: key column to identify employee (entity), attribute key column (attribute) to identify the attribute stored (name, birth date, SSN, date of hire, etc.), and value column (value) to store the actual value.

More on EAV...

Tony Andrews has an excellent article explaining the problems with EAV design: OTLT and EAV: the two big design mistakes all beginners make. It is worth to note that in some cases where it is required to store many attributes that change very often (new attributes added, or existing attributes deleted) and there is no need for referential or domain integrity and query logic based on these attributes then it may be a valid scenario to utilize EAV design.

Normally the value column has to be of some large length character data type, like VARCHAR(200), and data has to be casted to text when saved and then casted back to the original data type when retrieved.
Once all coding is done then comes the realization of how bad this approach is… And sometimes you just walk into a project where EAV has been already deployed and now it is your job to fix it. The typical signs of EAV implementation is the existence of very complex (and slow) queries to satisfy relatively trivial user requests for data and the lack of data integrity.

How do you fix EAV design?
Here is one example of EAV design and the refactoring process to correct it. In our case the information for bank loans is stored in a table representing EAV design. Listing 1 below shows the scripts to create the table and insert sample data.

Listing 1

-- EAV table              

CREATE TABLE EAV_Loans (              

  loan_nbr INT NOT NULL,              

  customer_nbr INT NOT NULL,              

  code VARCHAR(30) NOT NULL,              

  value VARCHAR(200),              

  CONSTRAINT pk_eav_loans              

  PRIMARY KEY (loan_nbr, customer_nbr, code));              



INSERT INTO EAV_Loans              

(loan_nbr, customer_nbr, code, value)              

VALUES(1, 1, 'date', '20080110');              

INSERT INTO EAV_Loans              

(loan_nbr, customer_nbr, code, value)              

VALUES(1, 1, 'amount', '1500.00');              

INSERT INTO EAV_Loans              

(loan_nbr, customer_nbr, code, value)              

VALUES(1, 1, 'type', 'personal');              

INSERT INTO EAV_Loans              

(loan_nbr, customer_nbr, code, value)              

VALUES(2, 2, 'date', '20080215');              

INSERT INTO EAV_Loans              

(loan_nbr, customer_nbr, code, value)              

VALUES(2, 2, 'amount', '3500.00');              

INSERT INTO EAV_Loans              

(loan_nbr, customer_nbr, code, value)              

VALUES(2, 2, 'type', 'personal');              



SELECT loan_nbr, customer_nbr, code, value              

FROM EAV_Loans;              



/*              



loan_nbr    customer_nbr code      value               

----------- ------------ --------- ----------               

1           1            amount    1500.00               

1           1            date      20080110               

1           1            type      personal               

2           2            amount    3500.00               

2           2            date      20080215               

2           2            type      personal               





*/

Here the pair of loan number and customer number identifies the entity. The code column stores the attribute (amount, date, and type of loan) and the value column represents that actual value.
A quick demonstration of the problems caused by this design. Let’s write a query to retrieve customers with personal loans over 1,000 for the period Jan 1, 2008 through Jan 31, 2008. Here is a first attempt to satisfy the request:

Listing 2

SELECT A.loan_nbr,              

       A.customer_nbr,              

       CAST(A.value AS DATETIME) AS loan_date,              

       CAST(B.value AS DECIMAL(15, 2)) AS loan_amount              

FROM EAV_Loans AS A              

INNER JOIN EAV_Loans AS B              

   ON A.loan_nbr = B.loan_nbr              

  AND A.customer_nbr = B.customer_nbr              

INNER JOIN EAV_Loans AS C              

   ON A.loan_nbr = C.loan_nbr              

  AND A.customer_nbr = C.customer_nbr              

WHERE A.code = 'date'              

  AND CAST(A.value AS DATETIME) >= '20080101'              

  AND CAST(A.value AS DATETIME) <  '20080201'              

  AND B.code = 'amount'              

  AND CAST(B.value AS DECIMAL(15, 2)) > 1000.00              

  AND C.code = 'type'              

  AND C.value = 'personal';

One of the first issues to notice is the multiple joins to retrieve each attribute. This is because each attribute is stored in separate row and we need to put back together the customer loan info. Besides that seems the code should be OK… except one little problem! The nature of the SQL is such that nothing guarantees that the predicates in the WHERE clause will be executed in the order listed. They will be expanded in the query plan and the cast to DATETIME or DECIMAL can be applied to the loan type value which will result in conversion error. I described this in more detail in my post Predicates in SQL.
To fix this you may be tempted to try a different approach: using table expressions to retrieve each attribute. Listing 3 shows a version of such query.

Listing 3

SELECT A.loan_nbr,              

       A.customer_nbr,              

       loan_date,              

       loan_amount              

FROM (SELECT loan_nbr, customer_nbr,              

             CAST(value AS DATETIME) AS loan_date              

      FROM EAV_Loans              

      WHERE code = 'date') AS A              

INNER JOIN (SELECT loan_nbr, customer_nbr,              

                   CAST(value AS DECIMAL(15, 2)) AS loan_amount              

            FROM EAV_Loans              

            WHERE code = 'amount') AS B              

   ON A.loan_nbr = B.loan_nbr              

  AND A.customer_nbr = B.customer_nbr              

INNER JOIN (SELECT loan_nbr, customer_nbr,              

                   value AS loan_type              

            FROM EAV_Loans              

            WHERE code = 'type') AS C              

   ON A.loan_nbr = C.loan_nbr              

  AND A.customer_nbr = C.customer_nbr              

WHERE loan_date >= '20080101'              

  AND loan_date <  '20080201'              

  AND loan_amount > 1000.00              

  AND loan_type = 'personal';

However, this query has the same problem because the table expressions are expanded in the query plan and nothing guarantees the order of executing the predicates. One approach to retrieve the requested data successfully is to use a pivoting technique. The query below illustrates that.

Listing 4

SELECT loan_nbr,              

       customer_nbr,             

       loan_date,             

       loan_amount             

FROM (             

SELECT loan_nbr,             

       customer_nbr,             

       MAX(CASE WHEN code = 'date' THEN CAST(value AS DATETIME) END),             

       MAX(CASE WHEN code = 'amount' THEN CAST(value AS DECIMAL(15, 2)) END),             

       MAX(CASE WHEN code = 'type' THEN value END)             

FROM EAV_Loans             

GROUP BY loan_nbr, customer_nbr             

      ) AS L(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)             

WHERE loan_date >= '20080101'             

  AND loan_date <  '20080201'             

  AND loan_amount > 1000.00             

  AND loan_type = 'personal';

Here the CASE expressions guarantee that only the correct values will be casted to the specific data type. But then how efficient is this query for such a trivial request?

The refactoring process
Usually the first step of correcting an EAV design is to create a normalized table for storing the data. In our case the table structure can look like Listing 5 (note that here for clarify the loan type is spelled out as entire word; in real business scenario it may be sufficient to indicate only P or B; alternatively create table to store loan types and reference via a foreign key constraint).

Listing 5

CREATE TABLE Loans (              

  loan_nbr INT NOT NULL,              

  customer_nbr INT NOT NULL,              

  loan_date DATETIME NOT NULL,              

  loan_amount DECIMAL(15, 2) NOT NULL,              

  loan_type VARCHAR(10) NOT NULL,              

  CONSTRAINT ck_loan_type              

  CHECK (loan_type IN ('personal', 'business')),              

  CONSTRAINT pk_loans              

  PRIMARY KEY (loan_nbr));

Now each attribute is in separate column with appropriate data type. This guarantees the integrity of the data as well as now we can define constraints to restrict data values (like the CHECK constraint for loan type).
The next step is to convert and transfer data from the old format to the new table. We already know that a pivoting technique works and can be used here. Here is the query to transfer from EAV to normalized format.

Listing 6

INSERT INTO Loans              

(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)              

SELECT loan_nbr,              

       customer_nbr,              

       MAX(CASE WHEN code = 'date' THEN CAST(value AS DATETIME) END),              

       MAX(CASE WHEN code = 'amount' THEN CAST(value AS DECIMAL(15, 2)) END),              

       MAX(CASE WHEN code = 'type' THEN value END)              

FROM EAV_Loans              

GROUP BY loan_nbr, customer_nbr;

The last step in the refactoring process is to replace the old EAV table with view for backward compatibility (if there is code referencing that original table that cannot be changed at this time). There are different methods to accomplish this. For simplicity here we can use unpivoting using UNION and query for each attribute (note that you have to drop the original EAV table first because the view has the same name).

Listing 7

CREATE VIEW EAV_Loans              

(loan_nbr, customer_nbr, code, value)              

AS              

SELECT loan_nbr, customer_nbr,              

       CAST('date' AS VARCHAR(30)),              

       CONVERT(VARCHAR(200), loan_date, 112)              

FROM Loans              

UNION              

SELECT loan_nbr, customer_nbr,              

       CAST('amount' AS VARCHAR(30)),              

       CAST(loan_amount AS VARCHAR(200))              

FROM Loans              

UNION              

SELECT loan_nbr, customer_nbr,              

       CAST('type' AS VARCHAR(30)),              

       CAST(loan_type AS VARCHAR(200))              

FROM Loans;

If there is any old code that references the original EAV table for data modifications you will have to add instead of triggers to the view to handle it.

Now, let’s answer the same data request to retrieve customers with personal loans over 1,000 for the period Jan 1, 2008 through Jan 31, 2008, this time using the normalized table.

Listing 8

SELECT loan_nbr, customer_nbr, loan_date, loan_amount              

FROM Loans              

WHERE loan_date >= '20080101'              

  AND loan_date <  '20080201'              

  AND loan_amount > 1000.00              

  AND loan_type = 'personal';

Simple and elegant, just as it should be!

Labels: best practices, refactoring, t-sql programming