Lecture 10: Window function in SQL#

Learning Objectives#

By the end of this lecture, students should understand:

  • Window functions

# load the ipython-sql extension
%load_ext sql
%config SqlMagic.displaylimit = 10
import json
import urllib.parse

with open('data/credentials.json') as f:
    login = json.load(f)
    
username = login['user']
password = urllib.parse.quote(login['password'])
host = login['host']
port = login['port']
The sql extension is already loaded. To reload it, use:
  %reload_ext sql

Let’s connect to your dvdrental database

%sql postgresql://{username}:{password}@{host}:{port}/dvdrental
'Connected: postgres@dvdrental'

Introduction#

Window functions perform calculations across a set of table rows that are somehow related to the current row. They are often used for running totals, moving averages, and ranking.

Common Window Functions#

  1. ROW_NUMBER()

  2. RANK()

  3. DENSE_RANK()

  4. NTILE()

  5. LAG()

  6. LEAD()

  7. SUM()

  8. AVG()

Example Dataset: dvdrental#

1. ROW_NUMBER()#

Description: Assign a unique sequential integer to each rental for each customer, ordered by rental date.

SQL with Window Function:

%%sql
SELECT 
    customer_id, 
    rental_date, 
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY rental_date) AS row_num
FROM rental;
 * postgresql://postgres:***@localhost:5432/dvdrental
16044 rows affected.
customer_id rental_date row_num
1 2005-05-25 11:30:37 1
1 2005-05-28 10:35:23 2
1 2005-06-15 00:54:12 3
1 2005-06-15 18:02:53 4
1 2005-06-15 21:08:46 5
1 2005-06-16 15:18:57 6
1 2005-06-18 08:41:48 7
1 2005-06-18 13:33:59 8
1 2005-06-21 06:24:45 9
1 2005-07-08 03:17:05 10
16044 rows, truncated to displaylimit of 10

Task: Assume you are given the table below on rental transactions made by users. Write a query to obtain the third transaction of every customer. Output the user id, and rental date.

%%sql
SELECT * FROM rental LIMIT 10;
 * postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
rental_id rental_date inventory_id customer_id return_date staff_id last_update
2 2005-05-24 22:54:33 1525 459 2005-05-28 19:40:33 1 2006-02-16 02:30:53
3 2005-05-24 23:03:39 1711 408 2005-06-01 22:12:39 1 2006-02-16 02:30:53
4 2005-05-24 23:04:41 2452 333 2005-06-03 01:43:41 2 2006-02-16 02:30:53
5 2005-05-24 23:05:21 2079 222 2005-06-02 04:33:21 1 2006-02-16 02:30:53
6 2005-05-24 23:08:07 2792 549 2005-05-27 01:32:07 1 2006-02-16 02:30:53
7 2005-05-24 23:11:53 3995 269 2005-05-29 20:34:53 2 2006-02-16 02:30:53
8 2005-05-24 23:31:46 2346 239 2005-05-27 23:33:46 2 2006-02-16 02:30:53
9 2005-05-25 00:00:40 2580 126 2005-05-28 00:22:40 1 2006-02-16 02:30:53
10 2005-05-25 00:02:21 1824 399 2005-05-31 22:44:21 2 2006-02-16 02:30:53
11 2005-05-25 00:09:02 4443 142 2005-06-02 20:56:02 2 2006-02-16 02:30:53
%%sql
SELECT 
    customer_id, 
    rental_date
FROM (
    SELECT 
        customer_id, 
        rental_date, 
        ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY rental_date) AS row_num
    FROM rental
) AS subquery
WHERE row_num <= 3;
 * postgresql://postgres:***@localhost:5432/dvdrental
1797 rows affected.
customer_id rental_date
1 2005-05-25 11:30:37
1 2005-05-28 10:35:23
1 2005-06-15 00:54:12
2 2005-05-27 00:09:24
2 2005-06-17 20:54:58
2 2005-07-10 06:31:24
3 2005-05-27 17:17:09
3 2005-05-29 22:43:55
3 2005-06-16 01:34:05
4 2005-06-15 09:31:28
1797 rows, truncated to displaylimit of 10

2. RANK()#

Description: Rank customers based on their total count of rentals, allowing for gaps in ranking values.

SQL with Window Function:

%%sql
SELECT 
    customer_id, 
    rental_count, 
    RANK() OVER (ORDER BY rental_count DESC) AS rank
FROM (
    SELECT 
        customer_id, 
        COUNT(*) AS rental_count
    FROM 
        rental
    GROUP BY 
        customer_id
) ;
 * postgresql://postgres:***@localhost:5432/dvdrental
599 rows affected.
customer_id rental_count rank
148 46 1
526 45 2
144 42 3
236 42 3
75 41 5
469 40 6
197 40 6
137 39 8
178 39 8
468 39 8
599 rows, truncated to displaylimit of 10

Alternative Approach:

3. DENSE_RANK()#

Task: Rank each rental for each customer by the total count of rentals, WITHOUT gaps in ranking values.

SQL with Window Function:

%%sql
SELECT 
    customer_id, 
    rental_count, 
    DENSE_RANK() OVER (ORDER BY rental_count DESC) AS dense_rank
FROM (
    SELECT 
        customer_id, 
        COUNT(*) AS rental_count
    FROM 
        rental
    GROUP BY 
        customer_id
) ;
 * postgresql://postgres:***@localhost:5432/dvdrental
599 rows affected.
customer_id rental_count dense_rank
148 46 1
526 45 2
144 42 3
236 42 3
75 41 4
469 40 5
197 40 5
137 39 6
178 39 6
468 39 6
599 rows, truncated to displaylimit of 10

Differences Between RANK, DENSE_RANK, and ROW_NUMBER#

Function

Description

Handling of Ties

ROW_NUMBER

Assigns a unique sequential integer to rows within a partition of a result set.

No ties, each row gets a unique number.

RANK

Assigns a rank to each row within the partition of a result set.

Ties receive the same rank, with gaps in the ranking sequence.

DENSE_RANK

Similar to RANK(), but without gaps in ranking values.

Ties receive the same rank, without gaps in the ranking sequence.

This table summarizes the key differences between ROW_NUMBER, RANK, and DENSE_RANK functions, particularly in how they handle ties.

4. NTILE()#

Task: Distribute rentals for each customer into four quartiles based on rental date.

SQL with Window Function:

%%sql
SELECT 
    customer_id, 
    rental_date, 
    NTILE(4) OVER (PARTITION BY customer_id ORDER BY rental_date) AS quartile
FROM rental
WHERE customer_id = 1;
 * postgresql://postgres:***@localhost:5432/dvdrental
32 rows affected.
customer_id rental_date quartile
1 2005-05-25 11:30:37 1
1 2005-05-28 10:35:23 1
1 2005-06-15 00:54:12 1
1 2005-06-15 18:02:53 1
1 2005-06-15 21:08:46 1
1 2005-06-16 15:18:57 1
1 2005-06-18 08:41:48 1
1 2005-06-18 13:33:59 1
1 2005-06-21 06:24:45 2
1 2005-07-08 03:17:05 2
32 rows, truncated to displaylimit of 10

5. LAG()#

Task: Find the time gap between rentals for each customer

SQL with Window Function:

%%sql
SELECT 
    customer_id, 
    rental_id, 
    rental_date, 
    LAG(rental_date, 5) OVER (PARTITION BY customer_id ORDER BY rental_date) AS prev_rental_date,
    rental_date - LAG(rental_date) OVER (PARTITION BY customer_id ORDER BY rental_date) AS time_gap
FROM 
    rental
WHERE customer_id = 4;
 * postgresql://postgres:***@localhost:5432/dvdrental
22 rows affected.
customer_id rental_id rental_date prev_rental_date time_gap
4 1297 2005-06-15 09:31:28 None None
4 1633 2005-06-16 08:08:40 None 22:37:12
4 1707 2005-06-16 14:01:27 None 5:52:47
4 1735 2005-06-16 15:51:52 None 1:50:25
4 2043 2005-06-17 14:31:12 None 22:39:20
4 2642 2005-06-19 09:39:01 2005-06-15 09:31:28 1 day, 19:07:49
4 7660 2005-07-28 02:10:10 2005-06-16 08:08:40 38 days, 16:31:09
4 7718 2005-07-28 04:37:59 2005-06-16 14:01:27 2:27:49
4 8741 2005-07-29 18:44:57 2005-06-16 15:51:52 1 day, 14:06:58
4 9100 2005-07-30 08:46:09 2005-06-17 14:31:12 14:01:12
22 rows, truncated to displaylimit of 10

6. LEAD()#

Task: Retrieve the next rental date for each rental of each customer.

SQL with Window Function:

%%sql
SELECT 
    customer_id, 
    rental_date, 
    LEAD(rental_date, 1) OVER (PARTITION BY customer_id ORDER BY rental_date) AS next_rental_date
FROM rental;
 * postgresql://postgres:***@localhost:5432/dvdrental
16044 rows affected.
customer_id rental_date next_rental_date
1 2005-05-25 11:30:37 2005-05-28 10:35:23
1 2005-05-28 10:35:23 2005-06-15 00:54:12
1 2005-06-15 00:54:12 2005-06-15 18:02:53
1 2005-06-15 18:02:53 2005-06-15 21:08:46
1 2005-06-15 21:08:46 2005-06-16 15:18:57
1 2005-06-16 15:18:57 2005-06-18 08:41:48
1 2005-06-18 08:41:48 2005-06-18 13:33:59
1 2005-06-18 13:33:59 2005-06-21 06:24:45
1 2005-06-21 06:24:45 2005-07-08 03:17:05
1 2005-07-08 03:17:05 2005-07-08 07:33:56
16044 rows, truncated to displaylimit of 10

Task: Find the week-to-week growth rate in total payment amount

%%sql
SELECT * FROM payment
 * postgresql://postgres:***@localhost:5432/dvdrental
14596 rows affected.
payment_id customer_id staff_id rental_id amount payment_date
17503 341 2 1520 7.99 2007-02-15 22:25:46.996577
17504 341 1 1778 1.99 2007-02-16 17:23:14.996577
17505 341 1 1849 7.99 2007-02-16 22:41:45.996577
17506 341 2 2829 2.99 2007-02-19 19:39:56.996577
17507 341 2 3130 7.99 2007-02-20 17:31:48.996577
17508 341 1 3382 5.99 2007-02-21 12:33:49.996577
17509 342 2 2190 5.99 2007-02-17 23:58:17.996577
17510 342 1 2914 5.99 2007-02-20 02:11:44.996577
17511 342 1 3081 2.99 2007-02-20 13:57:39.996577
17512 343 2 1547 4.99 2007-02-16 00:10:50.996577
14596 rows, truncated to displaylimit of 10
%%sql

WITH weekly_payments AS (
    SELECT 
        DATE_TRUNC('week', payment_date) AS week, 
        SUM(amount) AS total_amount
    FROM 
        payment
    GROUP BY 
        DATE_TRUNC('week', payment_date)
)
SELECT 
    week, 
    total_amount, 
    LEAD(total_amount) OVER (ORDER BY week) AS next_week_amount,
    ROUND((LEAD(total_amount) OVER (ORDER BY week) - total_amount) / total_amount * 100, 2) AS growth_rate
FROM 
    weekly_payments;
 * postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
week total_amount next_week_amount growth_rate
2007-02-12 00:00:00 4923.98 3427.86 -30.38
2007-02-19 00:00:00 3427.86 5358.29 56.32
2007-02-26 00:00:00 5358.29 5443.20 1.58
2007-03-12 00:00:00 5443.20 13085.07 140.39
2007-03-19 00:00:00 13085.07 6562.62 -49.85
2007-04-02 00:00:00 6562.62 7911.84 20.56
2007-04-09 00:00:00 7911.84 8361.11 5.68
2007-04-23 00:00:00 8361.11 5723.89 -31.54
2007-04-30 00:00:00 5723.89 514.18 -91.02
2007-05-14 00:00:00 514.18 None None

7. SUM()#

Task: Calculate the cummulative sum of payment amounts for each customer, ordered by payment date.

For example, if a customer made payments of \(10, \)20, and \(30 on three different dates, the running total after each payment would be \)10, \(30, and \)60, respectively.

SQL with Window Function:

%%sql
SELECT 
    customer_id, 
    payment_date, 
    SUM(amount) OVER (PARTITION BY customer_id ORDER BY payment_date) AS running_total
FROM payment;
 * postgresql://postgres:***@localhost:5432/dvdrental
14596 rows affected.
customer_id payment_date running_total
1 2007-02-14 23:22:38.996577 5.99
1 2007-02-15 16:31:19.996577 6.98
1 2007-02-15 19:37:12.996577 16.97
1 2007-02-16 13:47:23.996577 21.96
1 2007-02-18 07:10:14.996577 26.95
1 2007-02-18 12:02:25.996577 27.94
1 2007-02-21 04:53:11.996577 31.93
1 2007-03-01 07:19:30.996577 36.92
1 2007-03-02 14:05:18.996577 40.91
1 2007-03-02 16:30:04.996577 41.90
14596 rows, truncated to displaylimit of 10

8. AVG()#

Task: Calculate the cummulative average of payment amounts for each customer, ordered by payment date.

SQL with Window Function:

%%sql
SELECT 
    customer_id, 
    payment_date, 
    ROUND(AVG(amount) OVER (PARTITION BY customer_id ORDER BY payment_date), 2) AS running_avg
FROM 
    payment
ORDER BY 
    customer_id, 
    payment_date;
 * postgresql://postgres:***@localhost:5432/dvdrental
14596 rows affected.
customer_id payment_date running_avg
1 2007-02-14 23:22:38.996577 5.99
1 2007-02-15 16:31:19.996577 3.49
1 2007-02-15 19:37:12.996577 5.66
1 2007-02-16 13:47:23.996577 5.49
1 2007-02-18 07:10:14.996577 5.39
1 2007-02-18 12:02:25.996577 4.66
1 2007-02-21 04:53:11.996577 4.56
1 2007-03-01 07:19:30.996577 4.62
1 2007-03-02 14:05:18.996577 4.55
1 2007-03-02 16:30:04.996577 4.19
14596 rows, truncated to displaylimit of 10