Lecture 5: PostgreSQL part 2#
Learning Objectives#
By the end of this lecture, students should be able to apply and practice:
Understand Aggregation Functions:
Learn how to use basic SQL aggregation functions such as
COUNT()
,SUM()
,AVG()
,MIN()
, andMAX()
.Explore PostgreSQL-specific aggregation functions like
STDDEV
andREGR_R2
.
Grouping Data:
Learn how to group data using the
GROUP BY
clause.Understand the importance of the order of
WHERE
,GROUP BY
, andORDER BY
clauses in SQL queries.Understand the difference between
WHERE
andHAVING
clauses.
Real-World Examples and Exercises:
Apply the learned concepts to real-world examples and exercises provided in the lecture notebook.
Practice writing and executing SQL queries using the
ipython-sql
extension in Jupyter Notebook.
Connect to database#
Make sure you edit the
credentials.json
file with appropriate username and password as in worksheet1
# load the ipython-sql extension
%load_ext sql
import json
import urllib.parse
with open('data/credentials.json') as f:
login = json.load(f)
username = login['user']
password = urllib.parse.quote(login['password'])
host = login['host']
port = login['port']
Let’s connect to your dvdrental database
%sql postgresql://{username}:{password}@{host}:{port}/dvdrental
'Connected: postgres@dvdrental'
Functions and Operators#
Introduction#
SQL provides a variety of functions and operators to perform operations on data. These functions and operators can be categorized into several types, including arithmetic, string, date/time, aggregate, and conditional functions.
Arithmetic Functions and Operators#
Function/Operator |
Description |
Example Usage |
---|---|---|
|
Addition |
|
|
Subtraction |
|
|
Multiplication |
|
|
Division |
|
|
Modulus (remainder of division) |
|
Numeric Functions#
Function |
Description |
Example Usage |
---|---|---|
|
Absolute value |
|
|
Smallest integer greater than or equal |
|
|
Largest integer less than or equal |
|
|
Rounds a number to a specified number of decimal places |
|
|
Raises a number to the power of another number |
|
|
Square root |
|
|
Exponential (e^x) |
|
|
Natural logarithm |
|
|
Logarithm |
|
|
Returns the value of pi |
|
|
Returns a random number |
|
Example
%%sql
SELECT
25 * 2,
ABS(-2^10),
ROUND(23.24545, 2),
SQRT(25),
PI()
;
* postgresql://postgres:***@localhost:5432/dvdrental
1 rows affected.
?column? | abs | round | sqrt | pi |
---|---|---|---|---|
50 | 1024.0 | 23.25 | 5.0 | 3.141592653589793 |
String Functions#
Function |
Description |
Example Usage |
---|---|---|
|
Concatenates two or more strings |
|
|
Converts a string to lowercase |
|
|
Converts a string to uppercase |
|
|
Returns the length of a string |
|
|
Extracts a substring from a string |
|
|
Removes leading and trailing spaces |
|
|
Removes leading spaces |
|
|
Removes trailing spaces |
|
|
Replaces all occurrences of a substring |
|
|
Finds the position of a substring |
|
|
Converts the first letter of each word to uppercase |
|
|
Pads the left side of a string |
|
|
Pads the right side of a string |
|
String Functions#
%%sql
SELECT
CONCAT(first_name, ' ', last_name) AS full_name,
LENGTH(first_name) AS first_name_length,
LOWER(first_name) AS first_name_lower,
UPPER(last_name) AS last_name_upper,
INITCAP(first_name) AS initcap_name
FROM customer
LIMIT 10;
* postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
full_name | first_name_length | first_name_lower | last_name_upper | initcap_name |
---|---|---|---|---|
Jared Ely | 5 | jared | ELY | Jared |
Mary Smith | 4 | mary | SMITH | Mary |
Patricia Johnson | 8 | patricia | JOHNSON | Patricia |
Linda Williams | 5 | linda | WILLIAMS | Linda |
Barbara Jones | 7 | barbara | JONES | Barbara |
Elizabeth Brown | 9 | elizabeth | BROWN | Elizabeth |
Jennifer Davis | 8 | jennifer | DAVIS | Jennifer |
Maria Miller | 5 | maria | MILLER | Maria |
Susan Wilson | 5 | susan | WILSON | Susan |
Margaret Moore | 8 | margaret | MOORE | Margaret |
%%sql
SELECT first_name,
SUBSTRING(first_name, 1, 3) AS name_substring,
REPLACE(first_name, 'a', 'A') AS replaced_name,
LPAD(first_name, 10, '*') AS left_padded_name,
RPAD(first_name, 10, '*') AS right_padded_name
FROM customer
LIMIT 10;
* postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
first_name | name_substring | replaced_name | left_padded_name | right_padded_name |
---|---|---|---|---|
Jared | Jar | JAred | *****Jared | Jared***** |
Mary | Mar | MAry | ******Mary | Mary****** |
Patricia | Pat | PAtriciA | **Patricia | Patricia** |
Linda | Lin | LindA | *****Linda | Linda***** |
Barbara | Bar | BArbArA | ***Barbara | Barbara*** |
Elizabeth | Eli | ElizAbeth | *Elizabeth | Elizabeth* |
Jennifer | Jen | Jennifer | **Jennifer | Jennifer** |
Maria | Mar | MAriA | *****Maria | Maria***** |
Susan | Sus | SusAn | *****Susan | Susan***** |
Margaret | Mar | MArgAret | **Margaret | Margaret** |
Date/Time Functions#
Function |
Description |
Example Usage |
---|---|---|
|
Returns the current date |
|
|
Returns the current time |
|
|
Returns the current date and time |
|
|
Returns the current date and time |
|
|
Extracts a part of a date/time |
|
|
Calculates the age between two dates |
|
|
Converts a date/time to a string |
|
|
Converts a string to a date |
|
|
Truncates a date/time to a specified precision |
|
|
Adds a specified time interval to a date/time |
|
|
Extracts a subfield from a date/time |
|
Examples
%%sql
SELECT
CURRENT_DATE AS current_date,
CURRENT_TIME AS current_time,
CURRENT_TIMESTAMP AS current_timestamp,
NOW() AS current_datetime;
* postgresql://postgres:***@localhost:5432/dvdrental
1 rows affected.
current_date | current_time | current_timestamp | current_datetime |
---|---|---|---|
2024-09-18 | 12:38:36.831023-07:00 | 2024-09-18 12:38:36.831023-07:00 | 2024-09-18 12:38:36.831023-07:00 |
%%sql
SELECT rental_id, rental_date,
EXTRACT(YEAR FROM rental_date) AS rental_year,
AGE(NOW(), rental_date) AS rental_duration
FROM rental
LIMIT 10;
* postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
rental_id | rental_date | rental_year | rental_duration |
---|---|---|---|
2 | 2005-05-24 22:54:33 | 2005 | 7049 days, 13:44:40.574073 |
3 | 2005-05-24 23:03:39 | 2005 | 7049 days, 13:35:34.574073 |
4 | 2005-05-24 23:04:41 | 2005 | 7049 days, 13:34:32.574073 |
5 | 2005-05-24 23:05:21 | 2005 | 7049 days, 13:33:52.574073 |
6 | 2005-05-24 23:08:07 | 2005 | 7049 days, 13:31:06.574073 |
7 | 2005-05-24 23:11:53 | 2005 | 7049 days, 13:27:20.574073 |
8 | 2005-05-24 23:31:46 | 2005 | 7049 days, 13:07:27.574073 |
9 | 2005-05-25 00:00:40 | 2005 | 7049 days, 12:38:33.574073 |
10 | 2005-05-25 00:02:21 | 2005 | 7049 days, 12:36:52.574073 |
11 | 2005-05-25 00:09:02 | 2005 | 7049 days, 12:30:11.574073 |
%%sql
SELECT rental_id, rental_date, rental_date + INTERVAL '1 year' AS rental_next_year
FROM rental
LIMIT 10;
* postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
rental_id | rental_date | rental_next_year |
---|---|---|
2 | 2005-05-24 22:54:33 | 2006-05-24 22:54:33 |
3 | 2005-05-24 23:03:39 | 2006-05-24 23:03:39 |
4 | 2005-05-24 23:04:41 | 2006-05-24 23:04:41 |
5 | 2005-05-24 23:05:21 | 2006-05-24 23:05:21 |
6 | 2005-05-24 23:08:07 | 2006-05-24 23:08:07 |
7 | 2005-05-24 23:11:53 | 2006-05-24 23:11:53 |
8 | 2005-05-24 23:31:46 | 2006-05-24 23:31:46 |
9 | 2005-05-25 00:00:40 | 2006-05-25 00:00:40 |
10 | 2005-05-25 00:02:21 | 2006-05-25 00:02:21 |
11 | 2005-05-25 00:09:02 | 2006-05-25 00:09:02 |
Aggregation Functions in SQL#
Aggregation functions in SQL are used to perform calculations on multiple rows of a table’s column and return a single value. These functions are often used with the GROUP BY
clause to group rows that have the same values in specified columns into summary rows.
Built-in SQL Aggregation Functions#
Function |
Description |
Example Usage |
---|---|---|
|
Returns the number of non-NULL values in a column |
|
|
Returns the number of rows |
|
|
Returns the minimum value of a column |
|
|
Returns the maximum value of a column |
|
|
Returns the sum of a numeric column |
|
|
Returns the average value of a numeric column |
|
Note
Except for COUNT(*), all aggregation functions ignore NULLs
In addition to numbers, MIN() and MAX() also work with strings.
Postgres-Specific Aggregation Functions#
In addition to the standard SQL aggregation functions, Postgres also provides a number of functions of the same kind which can be useful for some statistical calculations (find a comprehensive list in the documentations here)
Function |
Description |
Example Usage |
---|---|---|
|
Returns the standard deviation of a numeric column |
|
|
Returns the coefficient of determination (R²) of a linear regression |
|
%%sql
SELECT
COUNT(customer_id) AS customer_count,
COUNT(*) AS total_rows,
MIN(amount) AS min_amount,
MAX(amount) AS max_amount,
SUM(amount) AS total_amount,
ROUND(AVG(amount),2) AS avg_amount,
STDDEV(amount)::DECIMAL(4,3) AS amount_stddev
FROM payment;
* postgresql://postgres:***@localhost:5432/dvdrental
1 rows affected.
customer_count | total_rows | min_amount | max_amount | total_amount | avg_amount | amount_stddev |
---|---|---|---|---|---|---|
14596 | 14596 | 0.00 | 11.99 | 61312.04 | 4.20 | 2.369 |
%%sql
SELECT AVG(amount), customer_id
FROM payment
LIMIT 10;
* postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
avg | customer_id |
---|---|
3.8233333333333333 | 1 |
4.7592307692307692 | 2 |
5.4483333333333333 | 3 |
3.7172727272727273 | 4 |
3.8471428571428571 | 5 |
3.3900000000000000 | 6 |
4.6685714285714286 | 7 |
3.7291304347826087 | 8 |
3.9400000000000000 | 9 |
3.9483333333333333 | 10 |
Note
Important Notes on Aggregation Functions
Multiple Aggregations:
It is valid to have multiple aggregations in a SQL query. For example:
%%sql SELECT COUNT(*), AVG(amount), SUM(amount) FROM payment;
Aggregations and Regular Columns:
It is NOT possible to have both aggregations and regular columns in a single query without using the
GROUP BY
clause. For example, the following query is invalid:%%sql SELECT customer_id, COUNT(*) FROM payment;
The correct way to include both aggregations and regular columns is to use the
GROUP BY
clause:%%sql SELECT customer_id, COUNT(*) FROM payment GROUP BY customer_id;
Aggregations in WHERE Clause:
An aggregation function CANNOT be used in the
WHERE
clause. For example, the following query is invalid:%%sql SELECT customer_id, COUNT(*) FROM payment WHERE COUNT(*) > 5;
Instead, use the
HAVING
clause to filter groups based on an aggregation:%%sql SELECT customer_id, COUNT(*) FROM payment GROUP BY customer_id HAVING COUNT(*) > 5;
Groupings#
Introduction#
Grouping in SQL is used to aggregate data across multiple rows and group the results based on one or more columns. The GROUP BY
clause is used to group rows that have the same values in specified columns into summary rows. The HAVING
clause is used to filter groups based on a condition.
Formal Syntax of Grouping Operation#
The formal syntax of the grouping operation in SQL looks like this:
SELECT
grouping_columns, aggregated_columns
FROM
table1
WHERE
condition
GROUP BY
grouping_columns
HAVING
condition
ORDER BY
grouping_columns;
Grouping with GROUP BY
#
Basic Grouping#
The GROUP BY
clause groups rows that have the same values into summary rows. It is often used with aggregate functions like COUNT
, SUM
, AVG
, MIN
, and MAX
.
Example: Grouping by a Single Column
Question: How many payments greater than $5 has each customer made in total?
%%sql
SELECT customer_id, COUNT(*) AS payment_count
FROM payment
WHERE amount > 5
GROUP BY customer_id
LIMIT 10;
* postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
customer_id | payment_count |
---|---|
1 | 6 |
2 | 11 |
3 | 10 |
4 | 5 |
5 | 5 |
6 | 6 |
7 | 11 |
8 | 4 |
9 | 4 |
10 | 4 |
QUESTION: What is the average payment amount per customer, ordered from highest to lowest?
%%sql
SELECT customer_id, ROUND(AVG(amount),2) AS average_payment
FROM payment
GROUP BY customer_id
ORDER BY average_payment DESC
LIMIT 10;
* postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
customer_id | average_payment |
---|---|
187 | 5.62 |
321 | 5.52 |
19 | 5.49 |
3 | 5.45 |
311 | 5.39 |
542 | 5.30 |
310 | 5.30 |
508 | 5.29 |
259 | 5.16 |
293 | 5.13 |
Filtering Groups with HAVING
#
The HAVING
clause is used to filter groups based on a condition. It is similar to the WHERE
clause, but HAVING
is used for groups, whereas WHERE
is used for individual rows.
Note
WHERE
filters rows before groupingHAVING
filters rows after (or resulting from) grouping
Example: Using HAVING
to Filter Groups
Question: How many customers have made more than 5 payments, each greater than $5
%%sql
SELECT customer_id, COUNT(*) AS payment_count
FROM payment
WHERE amount > 5
GROUP BY customer_id
HAVING COUNT(*) > 5
LIMIT 10;
* postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
customer_id | payment_count |
---|---|
1 | 6 |
2 | 11 |
3 | 10 |
6 | 6 |
7 | 11 |
11 | 8 |
12 | 7 |
13 | 9 |
15 | 9 |
16 | 8 |
Question: Which customers have made payments totaling more than \(50, with each payment being greater than \)5?
%%sql
SELECT customer_id, SUM(amount) AS total_amount
FROM payment
WHERE amount > 5
GROUP BY customer_id
HAVING SUM(amount) > 50
LIMIT 10;
* postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
customer_id | total_amount |
---|---|
2 | 74.89 |
3 | 78.90 |
7 | 78.89 |
11 | 57.92 |
13 | 79.91 |
15 | 64.91 |
16 | 58.92 |
17 | 53.93 |
19 | 66.92 |
21 | 66.92 |
Multi-Level Grouping#
Multi-level grouping allows you to group data on multiple levels. This can be achieved by specifying multiple columns in the GROUP BY
clause.
Example: Grouping by Multiple Columns
Question: Which customer-staff pairs have made the highest number of payments greater than $5, and what is the count of such payments for the top 10 pairs, ordered from highest to lowest?
%%sql
SELECT customer_id, staff_id, COUNT(*) AS payment_count
FROM payment
WHERE amount > 5
GROUP BY customer_id, staff_id
ORDER BY staff_id, customer_id
LIMIT 10;
* postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
customer_id | staff_id | payment_count |
---|---|---|
1 | 1 | 4 |
2 | 1 | 4 |
3 | 1 | 4 |
4 | 1 | 4 |
5 | 1 | 3 |
6 | 1 | 4 |
7 | 1 | 5 |
8 | 1 | 4 |
9 | 1 | 1 |
10 | 1 | 2 |
Note
Importance of the Order of WHERE
, GROUP BY
, and ORDER BY
WHERE
Clause:The
WHERE
clause is used to filter rows before any grouping is performed. It is applied to individual rows in the table.Example: Filter payments with an amount greater than 5.
SELECT customer_id, COUNT(*) AS payment_count FROM payment WHERE amount > 5 GROUP BY customer_id;
GROUP BY
Clause:The
GROUP BY
clause is used to group rows that have the same values in specified columns into summary rows. It is applied after theWHERE
clause.Example: Group the filtered payments by
customer_id
.SELECT customer_id, COUNT(*) AS payment_count FROM payment WHERE amount > 5 GROUP BY customer_id;
HAVING
Clause:The
HAVING
clause is used to filter groups based on a condition. It is applied after theGROUP BY
clause.Example: Filter groups where the count of payments is greater than 5.
SELECT customer_id, COUNT(*) AS payment_count FROM payment WHERE amount > 5 GROUP BY customer_id HAVING COUNT(*) > 5;
ORDER BY
Clause:The
ORDER BY
clause is used to sort the result set. It is applied after theHAVING
clause.Example: Sort the result set by
customer_id
.SELECT customer_id, COUNT(*) AS payment_count FROM payment WHERE amount > 5 GROUP BY customer_id HAVING COUNT(*) > 5 ORDER BY customer_id;
SQL Practice Questions#
Total Rentals by Each Customer:
Question: How many rentals has each customer made in total?
%%sql
SELECT customer_id, COUNT(rental_id) AS rental_count
FROM payment
GROUP BY customer_id
LIMIT 10;
* postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
customer_id | rental_count |
---|---|
1 | 30 |
2 | 26 |
3 | 24 |
4 | 22 |
5 | 35 |
6 | 25 |
7 | 28 |
8 | 23 |
9 | 20 |
10 | 24 |
Average Rental Duration per Customer:
Question: What is the average rental duration (i.e., the difference between return_date and rental_date) per customer, ordered from highest to lowest?
%%sql
SELECT customer_id, AVG(AGE(return_date,rental_date )) AS avg_rental_duration
FROM rental
GROUP BY customer_id
ORDER BY avg_rental_duration DESC
LIMIT 10;
* postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
customer_id | avg_rental_duration |
---|---|
315 | 6 days, 14:13:22.500000 |
187 | 6 days, 10:58:38.571428 |
321 | 6 days, 8:56:32.727273 |
539 | 6 days, 7:39:57.272727 |
436 | 6 days, 7:09:46 |
532 | 6 days, 6:59:34.838709 |
427 | 6 days, 5:27:05 |
555 | 6 days, 2:48:35.294118 |
448 | 6 days, 2:13:51.111111 |
453 | 6 days, 1:43:13.043478 |
Customers with High Rental Counts:
Question: How many customers have made more than 5 rentals, and what is the count of such rentals for the first 10 customers?
%%sql
SELECT customer_id, COUNT(*) AS rental_count
FROM rental
GROUP BY customer_id
HAVING COUNT(*) > 5
ORDER BY rental_count
LIMIT 10;
* postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
customer_id | rental_count |
---|---|
318 | 12 |
61 | 14 |
281 | 14 |
110 | 14 |
136 | 15 |
248 | 15 |
492 | 16 |
464 | 16 |
164 | 16 |
398 | 16 |
Total Rental Duration Greater Than 50 Days:
Question: Which customers have rented movies for a total duration (return_date - rental_date) of more than 100 days, and what is the total duration for the first 10 such customers?
%%sql
SELECT customer_id, SUM(AGE(return_date, rental_date)) AS total_rental_duration
FROM rental
WHERE customer_id < 100
GROUP BY customer_id
HAVING SUM(AGE(return_date, rental_date)) > INTERVAL '100 days'
ORDER BY total_rental_duration
LIMIT 10;
* postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
customer_id | total_rental_duration |
---|---|
9 | 100 days, 7:59:00 |
48 | 100 days, 9:22:00 |
99 | 101 days, 11:03:00 |
79 | 101 days, 18:32:00 |
17 | 103 days, 12:50:00 |
94 | 106 days, 8:07:00 |
8 | 107 days, 15:39:00 |
67 | 108 days, 9:32:00 |
83 | 109 days, 13:45:00 |
62 | 110 days, 6:10:00 |
Top Customer-Staff Pairs by Rental Count:
Question: Which customer-staff pairs have the highest number of rentals, and what is the count of such rentals for the top 10 pairs, ordered from highest to lowest?
%%sql
SELECT customer_id, staff_id, COUNT(*) AS rental_count
FROM rental
GROUP BY customer_id, staff_id
ORDER BY rental_count DESC
LIMIT 10;
* postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
customer_id | staff_id | rental_count |
---|---|---|
236 | 1 | 27 |
144 | 1 | 24 |
138 | 2 | 24 |
148 | 2 | 24 |
526 | 2 | 24 |
38 | 2 | 24 |
137 | 1 | 24 |
459 | 2 | 23 |
107 | 1 | 23 |
259 | 1 | 23 |