Friday, August 21, 2009

Column Aggregates

We all know how to use the aggregate functions MIN, MAX, COUNT, etc. to calculate aggregates across rows. For example, using MAX we can determine the maximum value for group of rows. But what if we want to see the maximum value for each row across columns? There is no aggregate function that accepts list of columns…

Let's look at example to illustrate this. Here is sample table with four data columns:

CREATE TABLE Foo (

 
keycol INT NOT NULL PRIMARY KEY,

 
col1 INT NOT NULL,

 
col2 INT NOT NULL,

 
col3 INT NOT NULL,

 
col4 INT NOT NULL);

 

INSERT INTO Foo VALUES(150110); 

INSERT INTO Foo VALUES(20031);

INSERT INTO Foo VALUES(30000);

INSERT INTO Foo VALUES(491228);

INSERT INTO Foo VALUES(58888);

Our goal is to calculate the maximum value for the four data columns for each row. With two or three columns it may be an easy task using a CASE expression. But adding more columns will make a very long and complex CASE expression. Is there an easier way?

One solution that provides a shortcut is utilizing the SQL Server capabilities to generate an XML result and then using XQuery to find the max value. Here is the query:

-- max across columns with XQuery

SELECT c.query('keycol').value('.''INT'AS keycol,

       
c.value('local-name(./*[data(.)=

                max(../*[not(local-name(.)="keycol") ])][1])'


                
'VARCHAR(20)'AS max_col,

       
c.value('max(*[not(local-name(.)="keycol")])''FLOAT'AS max_val

FROM (SELECT keycolcol1col2col3col4

      
FROM Foo

      
FOR XML PATHTYPEAS T(x)

CROSS 
APPLY x.nodes('row'AS R(c);

The tricky part here is the filtering of the key column from the evaluation for max. Here are the results:
keycol  max_col  max_val

------- -------- --------

1       col4     10

2       col3     3

3       col1     0

4       col3     22

5       col1     8

How about counting the number of none zero values across columns? Here is a solution for that using very similar query:
-- count non-zero columns with XQuery

SELECT c.query('keycol').value('.''INT'AS keycol,

       
c.value('count(*[not(local-name(.)="keycol") 

                and not(.=0)])'


               
'INT')  AS cnt_non_zero

FROM (SELECT keycolcol1col2col3col4

      
FROM Foo

      
FOR XML PATHTYPEAS T(x)

CROSS 
APPLY x.nodes('row'AS R(c);

And the results:

keycol  cnt_non_zero

------- ------------

1       3

2       2

3       0

4       4

5       4

This shows the power of XQuery to solve this problem in untraditional way. The catch? Look at the execution plan of the queries...

More efficient method to solve the first problem is to use unpivoting. Here is solution using the UNPIVOT operator (the same can be done with ANSI query using cross join with numbers table):
-- max across columns with UNPIVOT

SELECT keycolcol AS max_colval AS max_val

FROM (

SELECT keycolvalcol

       
ROW_NUMBER() OVER(PARTITION BY keycol ORDER BY val DESCcolAS rk

FROM Foo

UNPIVOT

(val FOR col IN (col1col2col3col4)) AS UAS T

WHERE rk 1;

And very similar solution using unpivoting to solve the second problem:

-- count non-zero columns with UNPIVOT

SELECT keycolCOUNT(NULLIF(val0)) AS cnt_non_zero

FROM Foo

UNPIVOT

(val FOR col IN (col1col2col3col4)) AS U

GROUP BY keycol;

10 comments:

Brian Tkatch said...

A very useful case for UNPIVOT. Thanx Plamen.

--CELKO-- said...

This is the proprietary GREATEST () and LEAST() functions in Oracle. I like it and think it should be standardized. That would solve the problem of how to handle NULLs (drop them or propagate them?).

In SQL FOR SMARTIES I also have a swap pair algorithm to sort the columns on a row. You can easily modify it to pick the greatest value.

--CELKO-- said...

>> How about counting the number of none zero values across ABS(SIGN(columns? <<

SELECT key_col,
(ABS(SIGN(col1)) + ABS(SIGN(col2)) + ABS(SIGN(col3)) + ABS(SIGN(col))) AS non_zero_cnt
FROM Foo;

This is both portable and faster than using proprietary code.

Mohammad said...

Hi Plamen,
I have created a novel and very efficient query for this problem. This method has much lower “query cost” than your pivoting method.


SELECT keycol,
COALESCE(PARSENAME(pre_parsing, 1), 'col1') AS max_col,
COALESCE(PARSENAME(pre_parsing, 2), col1) AS max_val
FROM (SELECT keycol,
CASE WHEN col1 >ALL (SELECT col2 UNION SELECT col3 UNION SELECT col4) THEN CAST(col1 AS VARCHAR(10))+'.col1'
WHEN col2 >ALL (SELECT col1 UNION SELECT col3 UNION SELECT col4) THEN CAST(col2 AS VARCHAR(20))+'.col2'
WHEN col3 >ALL (SELECT col1 UNION SELECT col2 UNION SELECT col4) then CAST(col3 AS VARCHAR(10))+'.col3'
WHEN col4 >ALL (SELECT col1 UNION SELECT col2 UNION SELECT col3) then CAST(col4 AS VARCHAR(10))+'.col4'
END AS pre_parsing,
col1
FROM foo f1
) AS D;
/*
keycol max_col max_val
----------- -------- -----------
1 col4 10
2 col3 3
3 col1 0
4 col3 22
5 col1 8
*/

Plamen Ratchev said...

Hi Mohammad,

Yes, this is interesting approach. Here is another fast solution:

SELECT keycol,
CAST(SUBSTRING(MAX(col), 5, 4) AS VARCHAR(16)) AS max_col,
CAST(SUBSTRING(MAX(col), 1, 4) AS INT) AS max_val
FROM Foo
CROSS APPLY(SELECT CAST(col1 AS BINARY(4)) +
CAST('col1' AS BINARY(4))
UNION ALL
SELECT CAST(col2 AS BINARY(4)) +
CAST('col2' AS BINARY(4))
UNION ALL
SELECT CAST(col3 AS BINARY(4)) +
CAST('col3' AS BINARY(4))
UNION ALL
SELECT CAST(col4 AS BINARY(4)) +
CAST('col4' AS BINARY(4))
) AS C(col)
GROUP BY keycol;

Mohammad said...

>>With two or three columns it may be an easy task using a CASE expression. But adding more columns will make a very long and complex CASE expression. <<
Hi,
I do not agree with this sentence. Because adding more column will NOT make the query complex. Here the suitable solution with 4 columns:
SELECT keycol,
COALESCE(PARSENAME(pre_parsing, 1), 'col1') AS max_col,
COALESCE(PARSENAME(pre_parsing, 2), col1) AS max_val
FROM (SELECT keycol,
CASE WHEN col1 > col2 AND col1 > col3 AND col1 > col4
THEN CAST(col1 AS VARCHAR(10)) + '.col1'
WHEN col2 > col1 AND col2 > col3 AND col2 > col4
THEN CAST(col2 AS VARCHAR(20)) + '.col2'
WHEN col3 > col1 AND col3 > col2 AND col3 > col4
THEN CAST(col3 AS VARCHAR(10)) + '.col3'
WHEN col4 > col1 AND col4 > col2 AND col4 > col3
THEN CAST(col4 AS VARCHAR(10)) + '.col4'
END AS pre_parsing,
col1
FROM foo f1
) AS D;

And here for six data columns:
SELECT keycol,
COALESCE(PARSENAME(pre_parsing, 1), 'col1') AS max_col,
COALESCE(PARSENAME(pre_parsing, 2), col1) AS max_val
FROM (SELECT keycol,
CASE WHEN col1 > col2 AND col1 > col3 AND col1 > col4 AND col1 > col5 AND col1 > col6
THEN CAST(col1 AS VARCHAR(10)) + '.col1'
WHEN col2 > col1 AND col2 > col3 AND col2 > col4 AND col2 > col5 AND col2 > col6
THEN CAST(col2 AS VARCHAR(20)) + '.col2'
WHEN col3 > col1 AND col3 > col2 AND col3 > col4 AND col3 > col5 AND col3 > col6
THEN CAST(col3 AS VARCHAR(10)) + '.col3'
WHEN col4 > col1 AND col4 > col2 AND col4 > col3 AND col4 > col5 AND col4 > col6
THEN CAST(col4 AS VARCHAR(10)) + '.col4'
WHEN col5 > col1 AND col5 > col2 AND col5 > col3 AND col5 > col4 AND col5 > col6
THEN CAST(col5 AS VARCHAR(10)) + '.col5'
WHEN col6 > col1 AND col6 > col2 AND col6 > col3 AND col6 > col4 AND col6 > col5
THEN CAST(col6 AS VARCHAR(10)) + '.col6'
END AS pre_parsing,
col1
FROM newfoo f1
) AS D;

Or maybe I am wrong!

Mohammad said...

Even simpler version with VALUES clause:

SELECT keycol,
CAST(SUBSTRING(MAX(col), 5, 4) AS VARCHAR(16)) AS max_col,
CAST(SUBSTRING(MAX(col), 1, 4) AS INT) AS max_val
FROM Foo
CROSS APPLY(VALUES
(CAST(col1 AS BINARY(4)) + CAST('col1' AS BINARY(4))),
(CAST(col2 AS BINARY(4)) + CAST('col2' AS BINARY(4))),
(CAST(col3 AS BINARY(4)) + CAST('col3' AS BINARY(4))),
(CAST(col4 AS BINARY(4)) + CAST('col4' AS BINARY(4)))
) AS C(col)
GROUP BY keycol;

Mohammad said...

Hi Plamen,
In UNPIVOT technique what if we what maximum and minimum of values across of columns per rows?
Need two ROW_NUMBERs and pivoting technique like following of is simpler?

SELECT keycol,
MAX(CASE WHEN maximum = 1 THEN attribut END) AS max_col,
MAX(CASE WHEN maximum = 1 THEN value END) AS maximum_value,
MAX(CASE WHEN minimum = 1 THEN attribut END) AS min_col,
MAX(CASE WHEN minimum = 1 THEN value END) AS minimum_value
FROM (SELECT *,
ROW_NUMBER() OVER(PARTITION BY keycol
ORDER BY value DESC) AS maximum,
ROW_NUMBER() OVER(PARTITION BY keycol
ORDER BY value ASC) AS minimum
FROM Foo
UNPIVOT (value FOR attribut IN (col1, col2, col3, col4)
) AS U
) AS D
WHERE minimum = 1
OR maximum = 1
GROUP BY keycol;

Plamen Ratchev said...

Hi Mohammad,

Using two ROW_NUMBER functions is probably the easiest method to handle MIN and MAX request.

Mohammad Salimabadi said...

Thank for reply,
Your CROSS APPLY solution can be simpler still. We can concatenate two columns in aggregates functions.

SELECT keycol,
CAST(SUBSTRING(MAX(CAST(value AS BINARY(4)) + CAST(attribute AS BINARY(4))), 5, 4) AS VARCHAR(16)) AS max_col,
CAST(SUBSTRING(MAX(CAST(value AS BINARY(4)) + CAST(attribute AS BINARY(4))), 1, 4) AS INT) AS max_val
FROM Foo
CROSS APPLY
(VALUES (col1, 'col1'), (col2, 'col2'), (col3, 'col3'), (col4, 'col4')
) AS D1(value, attribute)
GROUP BY keycol;