Lecture 5: PostgreSQL part 2#

Learning Objectives#

By the end of this lecture, students should be able to apply and practice:

  1. Understand Aggregation Functions:

    • Learn how to use basic SQL aggregation functions such as COUNT(), SUM(), AVG(), MIN(), and MAX().

    • Explore PostgreSQL-specific aggregation functions like STDDEV and REGR_R2.

  2. Grouping Data:

    • Learn how to group data using the GROUP BY clause.

    • Understand the importance of the order of WHERE, GROUP BY, and ORDER BY clauses in SQL queries.

    • Understand the difference between WHERE and HAVING clauses.

  3. Real-World Examples and Exercises:

    • Apply the learned concepts to real-world examples and exercises provided in the lecture notebook.

    • Practice writing and executing SQL queries using the ipython-sql extension in Jupyter Notebook.

Connect to database#

Make sure you edit the credentials.json file with appropriate username and password as in worksheet1

# load the ipython-sql extension
%load_ext sql

import json
import urllib.parse

with open('data/credentials.json') as f:
    login = json.load(f)
    
username = login['user']
password = urllib.parse.quote(login['password'])
host = login['host']
port = login['port']

Let’s connect to your dvdrental database

%sql postgresql://{username}:{password}@{host}:{port}/dvdrental
'Connected: postgres@dvdrental'

Functions and Operators#

Introduction#

SQL provides a variety of functions and operators to perform operations on data. These functions and operators can be categorized into several types, including arithmetic, string, date/time, aggregate, and conditional functions.

Arithmetic Functions and Operators#

Function/Operator

Description

Example Usage

+

Addition

SELECT 5 + 3;

-

Subtraction

SELECT 5 - 3;

*

Multiplication

SELECT 5 * 3;

/

Division

SELECT 5 / 3;

%

Modulus (remainder of division)

SELECT 5 % 3;

Numeric Functions#

Function

Description

Example Usage

ABS

Absolute value

SELECT ABS(-5);

CEIL/CEILING

Smallest integer greater than or equal

SELECT CEIL(4.2);

FLOOR

Largest integer less than or equal

SELECT FLOOR(4.8);

ROUND

Rounds a number to a specified number of decimal places

SELECT ROUND(4.567, 2);

POWER

Raises a number to the power of another number

SELECT POWER(2, 3);

SQRT

Square root

SELECT SQRT(16);

EXP

Exponential (e^x)

SELECT EXP(2);

LN

Natural logarithm

SELECT LN(7.389);

LOG

Logarithm

SELECT LOG(10, 100);

PI

Returns the value of pi

SELECT PI();

RANDOM

Returns a random number

SELECT RANDOM();

Example

%%sql

SELECT
    25 * 2,
    ABS(-2^10),
    ROUND(23.24545, 2),
    SQRT(25),
    PI()
;
 * postgresql://postgres:***@localhost:5432/dvdrental
1 rows affected.
?column? abs round sqrt pi
50 1024.0 23.25 5.0 3.141592653589793

String Functions#

Function

Description

Example Usage

CONCAT

Concatenates two or more strings

SELECT CONCAT('Hello', ' ', 'World');

LOWER

Converts a string to lowercase

SELECT LOWER('HELLO');

UPPER

Converts a string to uppercase

SELECT UPPER('hello');

LENGTH

Returns the length of a string

SELECT LENGTH('Hello');

SUBSTRING

Extracts a substring from a string

SELECT SUBSTRING('Hello World', 1, 5);

TRIM

Removes leading and trailing spaces

SELECT TRIM('  Hello  ');

LTRIM

Removes leading spaces

SELECT LTRIM('  Hello');

RTRIM

Removes trailing spaces

SELECT RTRIM('Hello  ');

REPLACE

Replaces all occurrences of a substring

SELECT REPLACE('Hello World', 'World', 'SQL');

POSITION

Finds the position of a substring

SELECT POSITION('World' IN 'Hello World');

INITCAP

Converts the first letter of each word to uppercase

SELECT INITCAP('hello world');

LPAD

Pads the left side of a string

SELECT LPAD('Hello', 10, '*');

RPAD

Pads the right side of a string

SELECT RPAD('Hello', 10, '*');

String Functions#

%%sql
SELECT 
    CONCAT(first_name, ' ', last_name) AS full_name,
    LENGTH(first_name) AS first_name_length,
    LOWER(first_name) AS first_name_lower,
    UPPER(last_name) AS last_name_upper,
    INITCAP(first_name) AS initcap_name    
FROM customer
LIMIT 10;
 * postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
full_name first_name_length first_name_lower last_name_upper initcap_name
Jared Ely 5 jared ELY Jared
Mary Smith 4 mary SMITH Mary
Patricia Johnson 8 patricia JOHNSON Patricia
Linda Williams 5 linda WILLIAMS Linda
Barbara Jones 7 barbara JONES Barbara
Elizabeth Brown 9 elizabeth BROWN Elizabeth
Jennifer Davis 8 jennifer DAVIS Jennifer
Maria Miller 5 maria MILLER Maria
Susan Wilson 5 susan WILSON Susan
Margaret Moore 8 margaret MOORE Margaret
%%sql
SELECT first_name, 
    SUBSTRING(first_name, 1, 3) AS name_substring,
    REPLACE(first_name, 'a', 'A') AS replaced_name,
    LPAD(first_name, 10, '*') AS left_padded_name,
    RPAD(first_name, 10, '*') AS right_padded_name
FROM customer
LIMIT 10;
 * postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
first_name name_substring replaced_name left_padded_name right_padded_name
Jared Jar JAred *****Jared Jared*****
Mary Mar MAry ******Mary Mary******
Patricia Pat PAtriciA **Patricia Patricia**
Linda Lin LindA *****Linda Linda*****
Barbara Bar BArbArA ***Barbara Barbara***
Elizabeth Eli ElizAbeth *Elizabeth Elizabeth*
Jennifer Jen Jennifer **Jennifer Jennifer**
Maria Mar MAriA *****Maria Maria*****
Susan Sus SusAn *****Susan Susan*****
Margaret Mar MArgAret **Margaret Margaret**

Date/Time Functions#

Function

Description

Example Usage

CURRENT_DATE

Returns the current date

SELECT CURRENT_DATE;

CURRENT_TIME

Returns the current time

SELECT CURRENT_TIME;

CURRENT_TIMESTAMP

Returns the current date and time

SELECT CURRENT_TIMESTAMP;

NOW()

Returns the current date and time

SELECT NOW();

EXTRACT

Extracts a part of a date/time

SELECT EXTRACT(YEAR FROM rental_date);

AGE

Calculates the age between two dates

SELECT AGE(NOW(), rental_date);

TO_CHAR

Converts a date/time to a string

SELECT TO_CHAR(rental_date, 'YYYY-MM-DD');

TO_DATE

Converts a string to a date

SELECT TO_DATE('2023-01-01', 'YYYY-MM-DD');

DATE_TRUNC

Truncates a date/time to a specified precision

SELECT DATE_TRUNC('month', rental_date);

INTERVAL

Adds a specified time interval to a date/time

SELECT rental_date + INTERVAL '1 month';

DATE_PART

Extracts a subfield from a date/time

SELECT DATE_PART('day', rental_date);

Examples

%%sql
SELECT 
    CURRENT_DATE AS current_date,
    CURRENT_TIME AS current_time,
    CURRENT_TIMESTAMP AS current_timestamp,
    NOW() AS current_datetime;
 * postgresql://postgres:***@localhost:5432/dvdrental
1 rows affected.
current_date current_time current_timestamp current_datetime
2024-09-18 12:38:36.831023-07:00 2024-09-18 12:38:36.831023-07:00 2024-09-18 12:38:36.831023-07:00
%%sql
SELECT rental_id, rental_date, 
    EXTRACT(YEAR FROM rental_date) AS rental_year,
    AGE(NOW(), rental_date) AS rental_duration
FROM rental
LIMIT 10;
 * postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
rental_id rental_date rental_year rental_duration
2 2005-05-24 22:54:33 2005 7049 days, 13:44:40.574073
3 2005-05-24 23:03:39 2005 7049 days, 13:35:34.574073
4 2005-05-24 23:04:41 2005 7049 days, 13:34:32.574073
5 2005-05-24 23:05:21 2005 7049 days, 13:33:52.574073
6 2005-05-24 23:08:07 2005 7049 days, 13:31:06.574073
7 2005-05-24 23:11:53 2005 7049 days, 13:27:20.574073
8 2005-05-24 23:31:46 2005 7049 days, 13:07:27.574073
9 2005-05-25 00:00:40 2005 7049 days, 12:38:33.574073
10 2005-05-25 00:02:21 2005 7049 days, 12:36:52.574073
11 2005-05-25 00:09:02 2005 7049 days, 12:30:11.574073
%%sql
SELECT rental_id, rental_date, rental_date + INTERVAL '1 year' AS rental_next_year
FROM rental
LIMIT 10;
 * postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
rental_id rental_date rental_next_year
2 2005-05-24 22:54:33 2006-05-24 22:54:33
3 2005-05-24 23:03:39 2006-05-24 23:03:39
4 2005-05-24 23:04:41 2006-05-24 23:04:41
5 2005-05-24 23:05:21 2006-05-24 23:05:21
6 2005-05-24 23:08:07 2006-05-24 23:08:07
7 2005-05-24 23:11:53 2006-05-24 23:11:53
8 2005-05-24 23:31:46 2006-05-24 23:31:46
9 2005-05-25 00:00:40 2006-05-25 00:00:40
10 2005-05-25 00:02:21 2006-05-25 00:02:21
11 2005-05-25 00:09:02 2006-05-25 00:09:02

Aggregation Functions in SQL#

Aggregation functions in SQL are used to perform calculations on multiple rows of a table’s column and return a single value. These functions are often used with the GROUP BY clause to group rows that have the same values in specified columns into summary rows.

Built-in SQL Aggregation Functions#

Function

Description

Example Usage

COUNT(column)

Returns the number of non-NULL values in a column

SELECT COUNT(customer_id) FROM payment;

COUNT(*)

Returns the number of rows

SELECT COUNT(*) FROM payment;

MIN

Returns the minimum value of a column

SELECT MIN(amount) FROM payment;

MAX

Returns the maximum value of a column

SELECT MAX(amount) FROM payment;

SUM

Returns the sum of a numeric column

SELECT SUM(amount) FROM payment;

AVG

Returns the average value of a numeric column

SELECT AVG(amount) FROM payment;

Note

  • Except for COUNT(*), all aggregation functions ignore NULLs

  • In addition to numbers, MIN() and MAX() also work with strings.

Postgres-Specific Aggregation Functions#

In addition to the standard SQL aggregation functions, Postgres also provides a number of functions of the same kind which can be useful for some statistical calculations (find a comprehensive list in the documentations here)

Function

Description

Example Usage

STDDEV

Returns the standard deviation of a numeric column

SELECT STDDEV(amount) FROM payment;

REGR_R2

Returns the coefficient of determination (R²) of a linear regression

SELECT REGR_R2(y, x) FROM table;

%%sql
SELECT 
    COUNT(customer_id) AS customer_count,
    COUNT(*) AS total_rows,
    MIN(amount) AS min_amount,
    MAX(amount) AS max_amount,
    SUM(amount) AS total_amount,
    ROUND(AVG(amount),2) AS avg_amount,
    STDDEV(amount)::DECIMAL(4,3) AS amount_stddev
FROM payment;
 * postgresql://postgres:***@localhost:5432/dvdrental
1 rows affected.
customer_count total_rows min_amount max_amount total_amount avg_amount amount_stddev
14596 14596 0.00 11.99 61312.04 4.20 2.369
%%sql

SELECT AVG(amount), customer_id
FROM payment
LIMIT 10;
 * postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
avg customer_id
3.8233333333333333 1
4.7592307692307692 2
5.4483333333333333 3
3.7172727272727273 4
3.8471428571428571 5
3.3900000000000000 6
4.6685714285714286 7
3.7291304347826087 8
3.9400000000000000 9
3.9483333333333333 10

Note

Important Notes on Aggregation Functions

  1. Multiple Aggregations:

    • It is valid to have multiple aggregations in a SQL query. For example:

      %%sql
      SELECT COUNT(*), AVG(amount), SUM(amount)
      FROM payment;
      
  2. Aggregations and Regular Columns:

    • It is NOT possible to have both aggregations and regular columns in a single query without using the GROUP BY clause. For example, the following query is invalid:

      %%sql
      SELECT customer_id, COUNT(*)
      FROM payment;
      
    • The correct way to include both aggregations and regular columns is to use the GROUP BY clause:

      %%sql
      SELECT customer_id, COUNT(*)
      FROM payment
      GROUP BY customer_id;
      
  3. Aggregations in WHERE Clause:

    • An aggregation function CANNOT be used in the WHERE clause. For example, the following query is invalid:

      %%sql
      SELECT customer_id, COUNT(*)
      FROM payment
      WHERE COUNT(*) > 5;
      
    • Instead, use the HAVING clause to filter groups based on an aggregation:

      %%sql
      SELECT customer_id, COUNT(*)
      FROM payment
      GROUP BY customer_id
      HAVING COUNT(*) > 5;
      

Groupings#

Introduction#

Grouping in SQL is used to aggregate data across multiple rows and group the results based on one or more columns. The GROUP BY clause is used to group rows that have the same values in specified columns into summary rows. The HAVING clause is used to filter groups based on a condition.

Formal Syntax of Grouping Operation#

The formal syntax of the grouping operation in SQL looks like this:

SELECT
    grouping_columns, aggregated_columns
FROM
    table1
WHERE
    condition
GROUP BY
    grouping_columns
HAVING
    condition
ORDER BY
    grouping_columns;

Grouping with GROUP BY#

Basic Grouping#

The GROUP BY clause groups rows that have the same values into summary rows. It is often used with aggregate functions like COUNT, SUM, AVG, MIN, and MAX.

Example: Grouping by a Single Column

Question: How many payments greater than $5 has each customer made in total?

%%sql
SELECT customer_id, COUNT(*) AS payment_count
FROM payment
WHERE amount > 5
GROUP BY customer_id
LIMIT 10;
 * postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
customer_id payment_count
1 6
2 11
3 10
4 5
5 5
6 6
7 11
8 4
9 4
10 4

QUESTION: What is the average payment amount per customer, ordered from highest to lowest?

%%sql
SELECT customer_id, ROUND(AVG(amount),2) AS average_payment
FROM payment
GROUP BY customer_id
ORDER BY average_payment DESC
LIMIT 10;
 * postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
customer_id average_payment
187 5.62
321 5.52
19 5.49
3 5.45
311 5.39
542 5.30
310 5.30
508 5.29
259 5.16
293 5.13

Filtering Groups with HAVING#

The HAVING clause is used to filter groups based on a condition. It is similar to the WHERE clause, but HAVING is used for groups, whereas WHERE is used for individual rows.

Note

  • WHERE filters rows before grouping

  • HAVING filters rows after (or resulting from) grouping

Example: Using HAVING to Filter Groups

Question: How many customers have made more than 5 payments, each greater than $5

%%sql
SELECT customer_id, COUNT(*) AS payment_count
FROM payment
WHERE amount > 5
GROUP BY customer_id
HAVING COUNT(*) > 5
LIMIT 10;
 * postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
customer_id payment_count
1 6
2 11
3 10
6 6
7 11
11 8
12 7
13 9
15 9
16 8

Question: Which customers have made payments totaling more than \(50, with each payment being greater than \)5?

%%sql
SELECT customer_id, SUM(amount) AS total_amount
FROM payment
WHERE amount > 5
GROUP BY customer_id
HAVING SUM(amount) > 50
LIMIT 10;
 * postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
customer_id total_amount
2 74.89
3 78.90
7 78.89
11 57.92
13 79.91
15 64.91
16 58.92
17 53.93
19 66.92
21 66.92

Multi-Level Grouping#

Multi-level grouping allows you to group data on multiple levels. This can be achieved by specifying multiple columns in the GROUP BY clause.

Example: Grouping by Multiple Columns

Question: Which customer-staff pairs have made the highest number of payments greater than $5, and what is the count of such payments for the top 10 pairs, ordered from highest to lowest?

%%sql
SELECT customer_id, staff_id, COUNT(*) AS payment_count
FROM payment
WHERE amount > 5
GROUP BY customer_id, staff_id
ORDER BY staff_id, customer_id
LIMIT 10;
 * postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
customer_id staff_id payment_count
1 1 4
2 1 4
3 1 4
4 1 4
5 1 3
6 1 4
7 1 5
8 1 4
9 1 1
10 1 2

Note

Importance of the Order of WHERE, GROUP BY, and ORDER BY

  1. WHERE Clause:

    • The WHERE clause is used to filter rows before any grouping is performed. It is applied to individual rows in the table.

    • Example: Filter payments with an amount greater than 5.

      SELECT customer_id, COUNT(*) AS payment_count
      FROM payment
      WHERE amount > 5
      GROUP BY customer_id;
      
  2. GROUP BY Clause:

    • The GROUP BY clause is used to group rows that have the same values in specified columns into summary rows. It is applied after the WHERE clause.

    • Example: Group the filtered payments by customer_id.

      SELECT customer_id, COUNT(*) AS payment_count
      FROM payment
      WHERE amount > 5
      GROUP BY customer_id;
      
  3. HAVING Clause:

    • The HAVING clause is used to filter groups based on a condition. It is applied after the GROUP BY clause.

    • Example: Filter groups where the count of payments is greater than 5.

      SELECT customer_id, COUNT(*) AS payment_count
      FROM payment
      WHERE amount > 5
      GROUP BY customer_id
      HAVING COUNT(*) > 5;
      
  4. ORDER BY Clause:

    • The ORDER BY clause is used to sort the result set. It is applied after the HAVING clause.

    • Example: Sort the result set by customer_id.

      SELECT customer_id, COUNT(*) AS payment_count
      FROM payment
      WHERE amount > 5
      GROUP BY customer_id
      HAVING COUNT(*) > 5
      ORDER BY customer_id;
      

SQL Practice Questions#

  1. Total Rentals by Each Customer:

    • Question: How many rentals has each customer made in total?

%%sql

SELECT customer_id, COUNT(rental_id) AS rental_count
FROM payment
GROUP BY customer_id
LIMIT 10;
 * postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
customer_id rental_count
1 30
2 26
3 24
4 22
5 35
6 25
7 28
8 23
9 20
10 24
  1. Average Rental Duration per Customer:

    • Question: What is the average rental duration (i.e., the difference between return_date and rental_date) per customer, ordered from highest to lowest?

%%sql

SELECT customer_id, AVG(AGE(return_date,rental_date )) AS avg_rental_duration
FROM rental
GROUP BY customer_id
ORDER BY avg_rental_duration DESC
LIMIT 10;
 * postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
customer_id avg_rental_duration
315 6 days, 14:13:22.500000
187 6 days, 10:58:38.571428
321 6 days, 8:56:32.727273
539 6 days, 7:39:57.272727
436 6 days, 7:09:46
532 6 days, 6:59:34.838709
427 6 days, 5:27:05
555 6 days, 2:48:35.294118
448 6 days, 2:13:51.111111
453 6 days, 1:43:13.043478
  1. Customers with High Rental Counts:

    • Question: How many customers have made more than 5 rentals, and what is the count of such rentals for the first 10 customers?

%%sql

SELECT  customer_id, COUNT(*) AS rental_count 
FROM rental
GROUP BY customer_id
HAVING COUNT(*) > 5 
ORDER BY rental_count
LIMIT 10;
 * postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
customer_id rental_count
318 12
61 14
281 14
110 14
136 15
248 15
492 16
464 16
164 16
398 16
  1. Total Rental Duration Greater Than 50 Days:

    • Question: Which customers have rented movies for a total duration (return_date - rental_date) of more than 100 days, and what is the total duration for the first 10 such customers?

%%sql

SELECT customer_id, SUM(AGE(return_date, rental_date)) AS total_rental_duration
FROM rental
WHERE customer_id < 100
GROUP BY customer_id
HAVING SUM(AGE(return_date, rental_date)) > INTERVAL '100 days'
ORDER BY total_rental_duration
LIMIT 10;
 * postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
customer_id total_rental_duration
9 100 days, 7:59:00
48 100 days, 9:22:00
99 101 days, 11:03:00
79 101 days, 18:32:00
17 103 days, 12:50:00
94 106 days, 8:07:00
8 107 days, 15:39:00
67 108 days, 9:32:00
83 109 days, 13:45:00
62 110 days, 6:10:00
  1. Top Customer-Staff Pairs by Rental Count:

    • Question: Which customer-staff pairs have the highest number of rentals, and what is the count of such rentals for the top 10 pairs, ordered from highest to lowest?

%%sql

SELECT customer_id, staff_id, COUNT(*) AS rental_count
FROM rental
GROUP BY customer_id, staff_id
ORDER BY rental_count DESC
LIMIT 10;
 * postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
customer_id staff_id rental_count
236 1 27
144 1 24
138 2 24
148 2 24
526 2 24
38 2 24
137 1 24
459 2 23
107 1 23
259 1 23