SQL: Speed difference when GROUP BY and WINDOW functions are equivalent
SQL Window Functions (archive) are more powerful than just Group By (archive) queries. Yet, they sometimes are equivalent.
Is there a speed difference when this is the case?
1. Test with a simple SQLite example
Consider the following orders
table (here the first 10 rows):
orderid | customerid | total |
---|---|---|
1001 |
4308 |
25.52 |
1002 |
11683 |
35.33 |
1003 |
5676 |
30.79 |
1004 |
3097 |
77.6 |
1005 |
10374 |
109.04 |
1006 |
9241 |
15.13 |
1007 |
7189 |
11.02 |
1008 |
7228 |
33.11 |
1009 |
1125 |
17.19 |
1010 |
7340 |
4.6 |
We want to know the maximum total price that each customer paid.
2. Two ways to do that
We can either write a group-by query:
select
customerid,
max(total) as max
from orders
group by customerid;
or use a window function, and filtering the duplicated with distinct
:
select distinct
customerid,
max(total) over (partition by customerid) as max
from orders;
The results are identical.
3. Efficiency
However, if we ask SQLite to explain the plan of execution for both queries (and there are no indexes), we get very different plans.
Here is the execution plan for the group-by query:
id | parent | notused | detail |
---|---|---|---|
6 |
0 |
0 |
SCAN orders |
8 |
0 |
0 |
USE TEMP B-TREE FOR GROUP BY |
And here is the execution plan for the window-function query:
id | parent | notused | detail |
---|---|---|---|
3 |
0 |
0 |
CO-ROUTINE (subquery-2) |
6 |
3 |
0 |
SCAN orders |
14 |
3 |
0 |
USE TEMP B-TREE FOR ORDER BY |
31 |
0 |
0 |
SCAN (subquery-2) |
88 |
0 |
0 |
USE TEMP B-TREE FOR DISTINCT |
It would be nice if SQLite would also show some cardinality and cost like Oracle SQL database does, but that’s enough to see the gist of it.
4. Why?
Window functions are designed to keep each row first, but you can decide to reduce
them all later: this explains why we have a second SCAN
happening.
Order by can only reduce the results, and so it can be done more efficiently, without having to keep some useless results for each row.
At some point, we could expect the query optimizer (archive) to optimize both to the same execution plan thanks so some equational reasoning.
Until then, it’s up to you to spot useless use of window functions where group by suffice.
Note
|
I used SQLite as a demo here, but this is not the only database engine to not perform this optimization. |