SQL: SELECT DISTINCT on Some Columns

- April 25, 2020

In some cases we want to ensure uniqueness only on a subset of row columns.

There are many different solutions proposed, some of them are simple and others are less. I didn't find a generic one. So let's state the problem first, then analyze it and define a generic solution.

Problem

DISTINCT is not a function, we cannot "apply" it on a subset of columns. Some folks try to substitute it with GROUP BY. But GROUP BY requires that all the SELECTed columns will be either grouped by or aggregated. Let's assume we use a FIRST/MIN/MAX aggregate. In this case all the columns that are not grouped by will have same values. I can hardly believe this is desired.

Analysis

Let's understand better why the original DISTINCT approach does not work. We get too many rows, since all those not important columns are returned with their multiple values, therefore we have duplicates on the important columns. Fortunately there is an OVER clause, which comes to solve this exact problem!

Solution

The OVER clause lets first partition (group) the rows and only then apply the aggregate function. Assuming we want to SELECT fields A, B, C, D and want that all the A-B pairs will be distinct. So we probably want to partition all the rows to groups of unique A-B permutations and within each group peek a random C (or may be the first C ordered by D (or even C)). For example:

Assuming we have a Promotions entity:

public partial class Promotions
{
    public int PromotionId { get; set; }
    public string PromotionName { get; set; }
    public decimal? Discount { get; set; }
    public DateTime StartDate { get; set; }
    public DateTime ExpiredDate { get; set; }
}

We want to partition (group) our promotions by StartDate-ExpiredDate pairs and then return those with the highest Discount:

DbContext.Promotions.Query((Promotions promo) => {
    // partition by StartDate-ExpiredDate and order by Discount
    var window = PARTITION(BY(promo.StartDate), BY(promo.ExpiredDate)).
                     ORDER(BY(promo.Discount).DESC);

    var r = SELECT(DISTINCT<Promotions>(promo.StartDate.@as(), promo.ExpiredDate.@as(),
        // take the FIRST_VALUE which corresponds to the row with the highest discount
        AggregateBy(FIRST_VALUE(promo.Discount)).OVER(window).@as(promo.Discount),
        AggregateBy(FIRST_VALUE(promo.PromotionId)).OVER(window).@as(promo.PromotionId),
        AggregateBy(FIRST_VALUE(promo.PromotionName)).OVER(window).
                                                       @as(promo.PromotionName)));
    FROM(promo);
    return r;
});

Performance Note: in case we don't care which row is returned in a partition, it's better to ORDER BY PK because usually rows are already ordered by it. Also needless to say that it's much faster to PARTITION BY indexed columns.

Summary

With this technique we can require uniqueness of an arbitrary column set and return the "right" row in case some group has multiple rows. A working example is hosted on GitHub in the DistinctOn class.

Search This Blog