Wednesday, November 12, 2008

Top N by Group

It is a very common request to select the top N items per group. Examples are the top 2 sales agents per region, the last two orders per customer, etc. There are various techniques to accomplish this. On SQL Server 2005 and 2008 this normally involves CROSS APPLY, TOP, and the ranking functions.

Here are a couple examples of solving this problem utilizing the ranking functions in SQL Server 2005/2008. These methods are very simple and efficient, at the same time providing flexibility to manage ties.

CREATE TABLE Loans (

  loan_nbr INT NOT NULL PRIMARY KEY,

  loan_date DATETIME NOT NULL,

  loan_amount DECIMAL(12, 2) NOT NULL

              DEFAULT 0.0,

  customer_nbr INT NOT NULL,

  loan_type CHAR(1) NOT NULL

            DEFAULT 'P'

            CHECK (loan_type IN ('P'-- personal

                                'B')) -- business

  );

 

INSERT INTO Loans

VALUES (1, '20080801', 12000.00, 2, 'P'),

       (2, '20080805', 15700.00, 1, 'B'),

       (3, '20080610', 12000.00, 3, 'P'),

       (4, '20080401', 5000.00, 1, 'P'),

       (5, '20080715', 25000.00, 4, 'B'),

       (6, '20080610', 25000.00, 5, 'P'),

       (7, '20080501', 1000.00, 6, 'P'),

       (8, '20080810', 6000.00, 7, 'B'),

       (9, '20080815', 2000.00, 8, 'B'),

       (10, '20080815', 1000.00, 9, 'P'),

       (11, '20080715', 5500.00, 10, 'P'),

       (12, '20080615', 1000.00, 11, 'B'),

       (13, '20080820', 6000.00, 12, 'P'),

       (14, '20080510', 28000.00, 6, 'B'),

       (15, '20080815', 2000.00, 10, 'P'),

       (16, '20080810', 1500.00, 8, 'P'),

       (17, '20080817', 10000.00, 10, 'B'),

       (18, '20080816', 2500.00, 9, 'P');  

 

-- Top 3 loans by loan type (no ties)

SELECT loan_nbr, loan_type, loan_amount, rk

FROM (SELECT loan_nbr, loan_type, loan_amount,

            ROW_NUMBER() OVER(PARTITION BY loan_type

                               ORDER BY loan_amount DESC) AS rk

      FROM Loans) AS L

WHERE rk <= 3;

 

-- Top 3 loans by loan type (with ties)

SELECT loan_nbr, loan_type, loan_amount, rk

FROM (SELECT loan_nbr, loan_type, loan_amount,

            DENSE_RANK() OVER(PARTITION BY loan_type

                               ORDER BY loan_amount DESC) AS rk

      FROM Loans) AS L

WHERE rk <= 3;

 

-- Latest loan for each customer

SELECT customer_nbr, loan_nbr, loan_type, loan_amount, loan_date

FROM (SELECT loan_nbr, loan_type, loan_amount,

            customer_nbr, loan_date,

            ROW_NUMBER() OVER(PARTITION BY customer_nbr

                               ORDER BY loan_date DESC) AS rk

      FROM Loans) AS L

WHERE rk = 1;

6 comments:

Anonymous said...

Great articles!! Thanks.

Mohammad said...

Hi Plamen Ratchev,
I think following approach is a good alternate for ROW_NUMBER. Also we can see Rec_ID for each group like yours methods.
If you know better methods please send.

[code]
--CROSS APPLY with Standard Ranking
SELECT loan_nbr, loan_type, loan_amount, D.recID
FROM loan_amount l
CROSS APPLY (SELECT COUNT(*) AS recID
FROM loan_amont
WHERE loan_type = l.loan_type
AND loan_amount >= l.loan_amound) D
WHERE D.recID <= 3;
[/code]

Plamen Ratchev said...

Hi Mohammad,
This method using CROSS APPLY will not be very efficient. Using ROW_NUMBER is a much better approach. You can read an excellent discussion on the topic in Itzik Ben-Gan's book T-SQL Querying.

Mohammad said...

Hi Plamen,
Is following approach efficient with lots of data set?


--3 most order for each customer from Orders table (Northwind database)

SELECT O1.CustomerID, O1.OrderID, MAX(O1.OrderDate) AS OrderDate
FROM Orders O1
JOIN Orders O2 --Self Join
ON O1.CustomerID = O2.CustomerID
AND O1.OrderDate <= O2.OrderDate
GROUP BY O1.CustomerID, O1.OrderID
HAVING COUNT(*) <= 3

Plamen Ratchev said...

Mohammad,

The best is to test with your data. But I do not think this approach will be as efficient as the ROW_NUMBER method.

Mohammad said...

Yes, ROW_NUMBER is fastest approach. other methods need to scan more than one time the tables data
other methods like:
top+orderby using apply()
top+orderby using in()
self join
counting