Lecture 10: Window function in SQL#
Learning Objectives#
By the end of this lecture, students should understand:
Window functions
# load the ipython-sql extension
%load_ext sql
%config SqlMagic.displaylimit = 10
import json
import urllib.parse
with open('data/credentials.json') as f:
login = json.load(f)
username = login['user']
password = urllib.parse.quote(login['password'])
host = login['host']
port = login['port']
The sql extension is already loaded. To reload it, use:
%reload_ext sql
Let’s connect to your dvdrental database
%sql postgresql://{username}:{password}@{host}:{port}/dvdrental
'Connected: postgres@dvdrental'
Introduction#
Window functions perform calculations across a set of table rows that are somehow related to the current row. They are often used for running totals, moving averages, and ranking.
Common Window Functions#
ROW_NUMBER()
RANK()
DENSE_RANK()
NTILE()
LAG()
LEAD()
SUM()
AVG()
Example Dataset: dvdrental#
1. ROW_NUMBER()#
Description: Assign a unique sequential integer to each rental for each customer, ordered by rental date.
SQL with Window Function:
%%sql
SELECT
customer_id,
rental_date,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY rental_date) AS row_num
FROM rental;
* postgresql://postgres:***@localhost:5432/dvdrental
16044 rows affected.
customer_id | rental_date | row_num |
---|---|---|
1 | 2005-05-25 11:30:37 | 1 |
1 | 2005-05-28 10:35:23 | 2 |
1 | 2005-06-15 00:54:12 | 3 |
1 | 2005-06-15 18:02:53 | 4 |
1 | 2005-06-15 21:08:46 | 5 |
1 | 2005-06-16 15:18:57 | 6 |
1 | 2005-06-18 08:41:48 | 7 |
1 | 2005-06-18 13:33:59 | 8 |
1 | 2005-06-21 06:24:45 | 9 |
1 | 2005-07-08 03:17:05 | 10 |
Task: Assume you are given the table below on rental transactions made by users. Write a query to obtain the third transaction of every customer. Output the user id, and rental date.
%%sql
SELECT * FROM rental LIMIT 10;
* postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
rental_id | rental_date | inventory_id | customer_id | return_date | staff_id | last_update |
---|---|---|---|---|---|---|
2 | 2005-05-24 22:54:33 | 1525 | 459 | 2005-05-28 19:40:33 | 1 | 2006-02-16 02:30:53 |
3 | 2005-05-24 23:03:39 | 1711 | 408 | 2005-06-01 22:12:39 | 1 | 2006-02-16 02:30:53 |
4 | 2005-05-24 23:04:41 | 2452 | 333 | 2005-06-03 01:43:41 | 2 | 2006-02-16 02:30:53 |
5 | 2005-05-24 23:05:21 | 2079 | 222 | 2005-06-02 04:33:21 | 1 | 2006-02-16 02:30:53 |
6 | 2005-05-24 23:08:07 | 2792 | 549 | 2005-05-27 01:32:07 | 1 | 2006-02-16 02:30:53 |
7 | 2005-05-24 23:11:53 | 3995 | 269 | 2005-05-29 20:34:53 | 2 | 2006-02-16 02:30:53 |
8 | 2005-05-24 23:31:46 | 2346 | 239 | 2005-05-27 23:33:46 | 2 | 2006-02-16 02:30:53 |
9 | 2005-05-25 00:00:40 | 2580 | 126 | 2005-05-28 00:22:40 | 1 | 2006-02-16 02:30:53 |
10 | 2005-05-25 00:02:21 | 1824 | 399 | 2005-05-31 22:44:21 | 2 | 2006-02-16 02:30:53 |
11 | 2005-05-25 00:09:02 | 4443 | 142 | 2005-06-02 20:56:02 | 2 | 2006-02-16 02:30:53 |
%%sql
SELECT
customer_id,
rental_date
FROM (
SELECT
customer_id,
rental_date,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY rental_date) AS row_num
FROM rental
) AS subquery
WHERE row_num <= 3;
* postgresql://postgres:***@localhost:5432/dvdrental
1797 rows affected.
customer_id | rental_date |
---|---|
1 | 2005-05-25 11:30:37 |
1 | 2005-05-28 10:35:23 |
1 | 2005-06-15 00:54:12 |
2 | 2005-05-27 00:09:24 |
2 | 2005-06-17 20:54:58 |
2 | 2005-07-10 06:31:24 |
3 | 2005-05-27 17:17:09 |
3 | 2005-05-29 22:43:55 |
3 | 2005-06-16 01:34:05 |
4 | 2005-06-15 09:31:28 |
2. RANK()#
Description: Rank customers based on their total count of rentals, allowing for gaps in ranking values.
SQL with Window Function:
%%sql
SELECT
customer_id,
rental_count,
RANK() OVER (ORDER BY rental_count DESC) AS rank
FROM (
SELECT
customer_id,
COUNT(*) AS rental_count
FROM
rental
GROUP BY
customer_id
) ;
* postgresql://postgres:***@localhost:5432/dvdrental
599 rows affected.
customer_id | rental_count | rank |
---|---|---|
148 | 46 | 1 |
526 | 45 | 2 |
144 | 42 | 3 |
236 | 42 | 3 |
75 | 41 | 5 |
469 | 40 | 6 |
197 | 40 | 6 |
137 | 39 | 8 |
178 | 39 | 8 |
468 | 39 | 8 |
Alternative Approach:
3. DENSE_RANK()#
Task: Rank each rental for each customer by the total count of rentals, WITHOUT gaps in ranking values.
SQL with Window Function:
%%sql
SELECT
customer_id,
rental_count,
DENSE_RANK() OVER (ORDER BY rental_count DESC) AS dense_rank
FROM (
SELECT
customer_id,
COUNT(*) AS rental_count
FROM
rental
GROUP BY
customer_id
) ;
* postgresql://postgres:***@localhost:5432/dvdrental
599 rows affected.
customer_id | rental_count | dense_rank |
---|---|---|
148 | 46 | 1 |
526 | 45 | 2 |
144 | 42 | 3 |
236 | 42 | 3 |
75 | 41 | 4 |
469 | 40 | 5 |
197 | 40 | 5 |
137 | 39 | 6 |
178 | 39 | 6 |
468 | 39 | 6 |
Differences Between RANK
, DENSE_RANK
, and ROW_NUMBER
#
Function |
Description |
Handling of Ties |
---|---|---|
|
Assigns a unique sequential integer to rows within a partition of a result set. |
No ties, each row gets a unique number. |
|
Assigns a rank to each row within the partition of a result set. |
Ties receive the same rank, with gaps in the ranking sequence. |
|
Similar to |
Ties receive the same rank, without gaps in the ranking sequence. |
This table summarizes the key differences between ROW_NUMBER
, RANK
, and DENSE_RANK
functions, particularly in how they handle ties.
4. NTILE()#
Task: Distribute rentals for each customer into four quartiles based on rental date.
SQL with Window Function:
%%sql
SELECT
customer_id,
rental_date,
NTILE(4) OVER (PARTITION BY customer_id ORDER BY rental_date) AS quartile
FROM rental
WHERE customer_id = 1;
* postgresql://postgres:***@localhost:5432/dvdrental
32 rows affected.
customer_id | rental_date | quartile |
---|---|---|
1 | 2005-05-25 11:30:37 | 1 |
1 | 2005-05-28 10:35:23 | 1 |
1 | 2005-06-15 00:54:12 | 1 |
1 | 2005-06-15 18:02:53 | 1 |
1 | 2005-06-15 21:08:46 | 1 |
1 | 2005-06-16 15:18:57 | 1 |
1 | 2005-06-18 08:41:48 | 1 |
1 | 2005-06-18 13:33:59 | 1 |
1 | 2005-06-21 06:24:45 | 2 |
1 | 2005-07-08 03:17:05 | 2 |
5. LAG()#
Task: Find the time gap between rentals for each customer
SQL with Window Function:
%%sql
SELECT
customer_id,
rental_id,
rental_date,
LAG(rental_date, 5) OVER (PARTITION BY customer_id ORDER BY rental_date) AS prev_rental_date,
rental_date - LAG(rental_date) OVER (PARTITION BY customer_id ORDER BY rental_date) AS time_gap
FROM
rental
WHERE customer_id = 4;
* postgresql://postgres:***@localhost:5432/dvdrental
22 rows affected.
customer_id | rental_id | rental_date | prev_rental_date | time_gap |
---|---|---|---|---|
4 | 1297 | 2005-06-15 09:31:28 | None | None |
4 | 1633 | 2005-06-16 08:08:40 | None | 22:37:12 |
4 | 1707 | 2005-06-16 14:01:27 | None | 5:52:47 |
4 | 1735 | 2005-06-16 15:51:52 | None | 1:50:25 |
4 | 2043 | 2005-06-17 14:31:12 | None | 22:39:20 |
4 | 2642 | 2005-06-19 09:39:01 | 2005-06-15 09:31:28 | 1 day, 19:07:49 |
4 | 7660 | 2005-07-28 02:10:10 | 2005-06-16 08:08:40 | 38 days, 16:31:09 |
4 | 7718 | 2005-07-28 04:37:59 | 2005-06-16 14:01:27 | 2:27:49 |
4 | 8741 | 2005-07-29 18:44:57 | 2005-06-16 15:51:52 | 1 day, 14:06:58 |
4 | 9100 | 2005-07-30 08:46:09 | 2005-06-17 14:31:12 | 14:01:12 |
6. LEAD()#
Task: Retrieve the next rental date for each rental of each customer.
SQL with Window Function:
%%sql
SELECT
customer_id,
rental_date,
LEAD(rental_date, 1) OVER (PARTITION BY customer_id ORDER BY rental_date) AS next_rental_date
FROM rental;
* postgresql://postgres:***@localhost:5432/dvdrental
16044 rows affected.
customer_id | rental_date | next_rental_date |
---|---|---|
1 | 2005-05-25 11:30:37 | 2005-05-28 10:35:23 |
1 | 2005-05-28 10:35:23 | 2005-06-15 00:54:12 |
1 | 2005-06-15 00:54:12 | 2005-06-15 18:02:53 |
1 | 2005-06-15 18:02:53 | 2005-06-15 21:08:46 |
1 | 2005-06-15 21:08:46 | 2005-06-16 15:18:57 |
1 | 2005-06-16 15:18:57 | 2005-06-18 08:41:48 |
1 | 2005-06-18 08:41:48 | 2005-06-18 13:33:59 |
1 | 2005-06-18 13:33:59 | 2005-06-21 06:24:45 |
1 | 2005-06-21 06:24:45 | 2005-07-08 03:17:05 |
1 | 2005-07-08 03:17:05 | 2005-07-08 07:33:56 |
Task: Find the week-to-week growth rate in total payment amount
%%sql
SELECT * FROM payment
* postgresql://postgres:***@localhost:5432/dvdrental
14596 rows affected.
payment_id | customer_id | staff_id | rental_id | amount | payment_date |
---|---|---|---|---|---|
17503 | 341 | 2 | 1520 | 7.99 | 2007-02-15 22:25:46.996577 |
17504 | 341 | 1 | 1778 | 1.99 | 2007-02-16 17:23:14.996577 |
17505 | 341 | 1 | 1849 | 7.99 | 2007-02-16 22:41:45.996577 |
17506 | 341 | 2 | 2829 | 2.99 | 2007-02-19 19:39:56.996577 |
17507 | 341 | 2 | 3130 | 7.99 | 2007-02-20 17:31:48.996577 |
17508 | 341 | 1 | 3382 | 5.99 | 2007-02-21 12:33:49.996577 |
17509 | 342 | 2 | 2190 | 5.99 | 2007-02-17 23:58:17.996577 |
17510 | 342 | 1 | 2914 | 5.99 | 2007-02-20 02:11:44.996577 |
17511 | 342 | 1 | 3081 | 2.99 | 2007-02-20 13:57:39.996577 |
17512 | 343 | 2 | 1547 | 4.99 | 2007-02-16 00:10:50.996577 |
%%sql
WITH weekly_payments AS (
SELECT
DATE_TRUNC('week', payment_date) AS week,
SUM(amount) AS total_amount
FROM
payment
GROUP BY
DATE_TRUNC('week', payment_date)
)
SELECT
week,
total_amount,
LEAD(total_amount) OVER (ORDER BY week) AS next_week_amount,
ROUND((LEAD(total_amount) OVER (ORDER BY week) - total_amount) / total_amount * 100, 2) AS growth_rate
FROM
weekly_payments;
* postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
week | total_amount | next_week_amount | growth_rate |
---|---|---|---|
2007-02-12 00:00:00 | 4923.98 | 3427.86 | -30.38 |
2007-02-19 00:00:00 | 3427.86 | 5358.29 | 56.32 |
2007-02-26 00:00:00 | 5358.29 | 5443.20 | 1.58 |
2007-03-12 00:00:00 | 5443.20 | 13085.07 | 140.39 |
2007-03-19 00:00:00 | 13085.07 | 6562.62 | -49.85 |
2007-04-02 00:00:00 | 6562.62 | 7911.84 | 20.56 |
2007-04-09 00:00:00 | 7911.84 | 8361.11 | 5.68 |
2007-04-23 00:00:00 | 8361.11 | 5723.89 | -31.54 |
2007-04-30 00:00:00 | 5723.89 | 514.18 | -91.02 |
2007-05-14 00:00:00 | 514.18 | None | None |
7. SUM()#
Task: Calculate the cummulative sum of payment amounts for each customer, ordered by payment date.
For example, if a customer made payments of \(10, \)20, and \(30 on three different dates, the running total after each payment would be \)10, \(30, and \)60, respectively.
SQL with Window Function:
%%sql
SELECT
customer_id,
payment_date,
SUM(amount) OVER (PARTITION BY customer_id ORDER BY payment_date) AS running_total
FROM payment;
* postgresql://postgres:***@localhost:5432/dvdrental
14596 rows affected.
customer_id | payment_date | running_total |
---|---|---|
1 | 2007-02-14 23:22:38.996577 | 5.99 |
1 | 2007-02-15 16:31:19.996577 | 6.98 |
1 | 2007-02-15 19:37:12.996577 | 16.97 |
1 | 2007-02-16 13:47:23.996577 | 21.96 |
1 | 2007-02-18 07:10:14.996577 | 26.95 |
1 | 2007-02-18 12:02:25.996577 | 27.94 |
1 | 2007-02-21 04:53:11.996577 | 31.93 |
1 | 2007-03-01 07:19:30.996577 | 36.92 |
1 | 2007-03-02 14:05:18.996577 | 40.91 |
1 | 2007-03-02 16:30:04.996577 | 41.90 |
8. AVG()#
Task: Calculate the cummulative average of payment amounts for each customer, ordered by payment date.
SQL with Window Function:
%%sql
SELECT
customer_id,
payment_date,
ROUND(AVG(amount) OVER (PARTITION BY customer_id ORDER BY payment_date), 2) AS running_avg
FROM
payment
ORDER BY
customer_id,
payment_date;
* postgresql://postgres:***@localhost:5432/dvdrental
14596 rows affected.
customer_id | payment_date | running_avg |
---|---|---|
1 | 2007-02-14 23:22:38.996577 | 5.99 |
1 | 2007-02-15 16:31:19.996577 | 3.49 |
1 | 2007-02-15 19:37:12.996577 | 5.66 |
1 | 2007-02-16 13:47:23.996577 | 5.49 |
1 | 2007-02-18 07:10:14.996577 | 5.39 |
1 | 2007-02-18 12:02:25.996577 | 4.66 |
1 | 2007-02-21 04:53:11.996577 | 4.56 |
1 | 2007-03-01 07:19:30.996577 | 4.62 |
1 | 2007-03-02 14:05:18.996577 | 4.55 |
1 | 2007-03-02 16:30:04.996577 | 4.19 |