Lecture 9: Joins & subqueries in SQL#
Learning Objectives#
By the end of this lecture, students should understand:
What JOIN statements are and why they are used.
The different types of JOINs: INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN.
How to deal with ambiguous column names in JOINs.
How to join more than two tables.
What subqueries are and why they are used.
The different types of subqueries: single-row, multiple-row, and correlated subqueries.
# load the ipython-sql extension
%load_ext sql
%config SqlMagic.displaylimit = 10
import json
import urllib.parse
with open('data/credentials.json') as f:
login = json.load(f)
username = login['user']
password = urllib.parse.quote(login['password'])
host = login['host']
port = login['port']
Let’s connect to your dvdrental database
%sql postgresql://{username}:{password}@{host}:{port}/dvdrental
'Connected: postgres@dvdrental'
JOIN Statements#
JOIN statements are used in SQL to combine rows from two or more tables based on a related column between them. This is essential for querying data that is spread across multiple tables.
Types of JOINs#
INNER JOIN: Returns records that have matching values in both tables.
LEFT JOIN (or LEFT OUTER JOIN): Returns all records from the left table, and the matched records from the right table. The result is NULL from the right side if there is no match.
RIGHT JOIN (or RIGHT OUTER JOIN): Returns all records from the right table, and the matched records from the left table. The result is NULL from the left side when there is no match.
FULL OUTER JOIN: Returns all records when there is a match in either left or right table. Records not matching in either table will have NULLs.
Example Tables#
For this lecture, we will create two simple tables: student
and course
.
Creating the student
Table#
%%sql
DROP TABLE IF EXISTS student CASCADE;
CREATE TABLE student (
student_id SERIAL PRIMARY KEY,
first_name VARCHAR(50),
last_name VARCHAR(50)
);
INSERT INTO student (first_name, last_name) VALUES
('John', 'Doe'),
('Jane', 'Smith'),
('Alice', 'Johnson'),
('Bob', 'Brown')
RETURNING *;
* postgresql://postgres:***@localhost:5432/dvdrental
Done.
Done.
4 rows affected.
student_id | first_name | last_name |
---|---|---|
1 | John | Doe |
2 | Jane | Smith |
3 | Alice | Johnson |
4 | Bob | Brown |
Creating the course
Table#
%%sql
DROP TABLE IF EXISTS course CASCADE;
CREATE TABLE course (
course_id SERIAL PRIMARY KEY,
student_id INT,
course_name VARCHAR(100),
FOREIGN KEY (student_id) REFERENCES student(student_id)
);
INSERT INTO course (student_id, course_name) VALUES
(1, 'Mathematics'),
(2, 'Science'),
(3, 'History'),
(1, 'Literature')
RETURNING *;
* postgresql://postgres:***@localhost:5432/dvdrental
Done.
Done.
4 rows affected.
course_id | student_id | course_name |
---|---|---|
1 | 1 | Mathematics |
2 | 2 | Science |
3 | 3 | History |
4 | 1 | Literature |
Before performing any JOIN operations, let’s display the first few rows of the student
and course
tables.
%%sql
SELECT * FROM student;
* postgresql://postgres:***@localhost:5432/dvdrental
4 rows affected.
student_id | first_name | last_name |
---|---|---|
1 | John | Doe |
2 | Jane | Smith |
3 | Alice | Johnson |
4 | Bob | Brown |
%%sql
SELECT * FROM course;
* postgresql://postgres:***@localhost:5432/dvdrental
4 rows affected.
course_id | student_id | course_name |
---|---|---|
1 | 1 | Mathematics |
2 | 2 | Science |
3 | 3 | History |
4 | 1 | Literature |
INNER JOIN#
The INNER JOIN
keyword selects records that have matching values in both tables.
%%sql
SELECT student.student_id, first_name, last_name, course_name
FROM student
INNER JOIN course
ON student.student_id = course.student_id;
* postgresql://postgres:***@localhost:5432/dvdrental
4 rows affected.
student_id | first_name | last_name | course_name |
---|---|---|---|
1 | John | Doe | Mathematics |
2 | Jane | Smith | Science |
3 | Alice | Johnson | History |
1 | John | Doe | Literature |
You can also use alias to give a table a temporary name to make the query more readable. In the example below we give student
table an alias called s
and the course
table an alias called c
:
%%sql
SELECT *
FROM student AS s
INNER JOIN course AS c
ON s.student_id = c.student_id;
* postgresql://postgres:***@localhost:5432/dvdrental
4 rows affected.
student_id | first_name | last_name | course_id | student_id_1 | course_name |
---|---|---|---|---|---|
1 | John | Doe | 1 | 1 | Mathematics |
2 | Jane | Smith | 2 | 2 | Science |
3 | Alice | Johnson | 3 | 3 | History |
1 | John | Doe | 4 | 1 | Literature |
Dealing with Ambiguous Column Names#
When joining multiple tables, you may encounter ambiguous column names. This happens when two or more tables have columns with the same name. To resolve this, you can use table aliases or fully qualify the column names.
Example with Ambiguous Column Names
Consider the student
and course
tables both having a column named student_id
.
%%sql
SELECT student_id, first_name, last_name, course_name
FROM student
INNER JOIN course ON student.student_id = course.student_id
LIMIT 10;
* postgresql://postgres:***@localhost:5432/dvdrental
(psycopg2.errors.AmbiguousColumn) column reference "student_id" is ambiguous
LINE 1: SELECT student_id, first_name, last_name, course_name
^
[SQL: SELECT student_id, first_name, last_name, course_name
FROM student
INNER JOIN course ON student.student_id = course.student_id
LIMIT 10;]
(Background on this error at: https://sqlalche.me/e/14/f405)
Problem: The column
student_id
is ambiguous because it exists in both tables.
Resolving Ambiguity with Table Aliases
You can use table aliases to differentiate between columns from different tables.
%%sql
SELECT s.student_id, s.first_name, s.last_name, c.course_name
FROM student AS s
INNER JOIN course AS c ON s.student_id = c.student_id
LIMIT 10;
* postgresql://postgres:***@localhost:5432/dvdrental
4 rows affected.
student_id | first_name | last_name | course_name |
---|---|---|---|
1 | John | Doe | Mathematics |
2 | Jane | Smith | Science |
3 | Alice | Johnson | History |
1 | John | Doe | Literature |
LEFT JOIN#
The LEFT JOIN
keyword returns all records from the left table (student), and the matched records from the right table (course). The result is NULL from the right side if there is no match.
%%sql
SELECT * FROM course;
* postgresql://postgres:***@localhost:5432/dvdrental
4 rows affected.
course_id | student_id | course_name |
---|---|---|
1 | 1 | Mathematics |
2 | 2 | Science |
3 | 3 | History |
4 | 1 | Literature |
%%sql
SELECT s.student_id, s.first_name, s.last_name, c.course_name
FROM student AS s
LEFT JOIN course AS c
ON s.student_id = c.student_id;
* postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.
student_id | first_name | last_name | course_name |
---|---|---|---|
1 | John | Doe | Mathematics |
2 | Jane | Smith | Science |
3 | Alice | Johnson | History |
1 | John | Doe | Literature |
4 | Bob | Brown | None |
RIGHT JOIN#
The RIGHT JOIN
keyword returns all records from the right table (course), and the matched records from the left table (student). The result is NULL from the left side when there is no match.
%%sql
SELECT s.student_id, s.first_name, s.last_name, c.course_name
FROM student AS s
RIGHT JOIN course AS c
ON s.student_id = c.student_id;
* postgresql://postgres:***@localhost:5432/dvdrental
4 rows affected.
student_id | first_name | last_name | course_name |
---|---|---|---|
1 | John | Doe | Mathematics |
2 | Jane | Smith | Science |
3 | Alice | Johnson | History |
1 | John | Doe | Literature |
FULL OUTER JOIN#
The FULL OUTER JOIN
keyword returns all records when there is a match in either left (student) or right (course) table. Records not matching in either table will have NULLs.
%%sql
SELECT s.student_id, s.first_name, s.last_name, c.course_name
FROM student AS s
FULL OUTER JOIN course AS c ON s.student_id = c.student_id
;
* postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.
student_id | first_name | last_name | course_name |
---|---|---|---|
1 | John | Doe | Mathematics |
2 | Jane | Smith | Science |
3 | Alice | Johnson | History |
1 | John | Doe | Literature |
4 | Bob | Brown | None |
Joining More Than Two Tables#
You can join more than two tables by chaining multiple JOIN statements. Let’s join the student
, course
, and enrollment
tables to get the student’s name, course name, and enrollment date.
Creating the enrollment
Table
%%sql
DROP TABLE IF EXISTS enrollment CASCADE;
CREATE TABLE enrollment (
enrollment_id SERIAL PRIMARY KEY,
student_id INT,
course_id INT,
enrollment_date DATE,
FOREIGN KEY (student_id) REFERENCES student(student_id),
FOREIGN KEY (course_id) REFERENCES course(course_id)
);
INSERT INTO enrollment (student_id, course_id, enrollment_date) VALUES
(1, 1, '2023-01-01'),
(2, 2, '2023-01-02'),
(3, 3, '2023-01-03'),
(1, 4, '2023-01-04');
* postgresql://postgres:***@localhost:5432/dvdrental
Done.
Done.
4 rows affected.
[]
%%sql
SELECT * FROM enrollment;
* postgresql://postgres:***@localhost:5432/dvdrental
4 rows affected.
enrollment_id | student_id | course_id | enrollment_date |
---|---|---|---|
1 | 1 | 1 | 2023-01-01 |
2 | 2 | 2 | 2023-01-02 |
3 | 3 | 3 | 2023-01-03 |
4 | 1 | 4 | 2023-01-04 |
Joining student
, course
, and enrollment
Tables
%%sql
SELECT s.first_name, s.last_name, s.student_id, c.course_name, c.course_id, e.enrollment_date
FROM student AS s
INNER JOIN enrollment AS e ON s.student_id = e.student_id
INNER JOIN course AS c ON e.course_id = c.course_id
;
* postgresql://postgres:***@localhost:5432/dvdrental
4 rows affected.
first_name | last_name | student_id | course_name | course_id | enrollment_date |
---|---|---|---|---|---|
John | Doe | 1 | Mathematics | 1 | 2023-01-01 |
Jane | Smith | 2 | Science | 2 | 2023-01-02 |
Alice | Johnson | 3 | History | 3 | 2023-01-03 |
John | Doe | 1 | Literature | 4 | 2023-01-04 |
Summary#
INNER JOIN: Returns only the matching rows.
LEFT JOIN: Returns all rows from the left table, and the matched rows from the right table.
RIGHT JOIN: Returns all rows from the right table, and the matched rows from the left table.
FULL OUTER JOIN: Returns all rows when there is a match in either left or right table.
Dealing with Ambiguous Column Names: Use table aliases or fully qualify column names to resolve ambiguity.
Joining More Than Two Tables: Chain multiple JOIN statements to combine data from more than two tables.
By understanding and using these JOIN statements and techniques, you can effectively query and combine data from multiple tables in your database.
Subqueries#
A subquery, also known as an inner query or nested query, is a query within another SQL query. Subqueries can be used to perform operations that require multiple steps, such as filtering data based on the results of another query.
Types of Subqueries#
Single-Row Subqueries: Return a single row and are used with comparison operators like
=
,<
,>
, etc.Multiple-Row Subqueries: Return multiple rows and are used with operators like
IN
,ANY
,ALL
.Correlated Subqueries: Refer to columns in the outer query and are evaluated once for each row processed by the outer query.
%%sql
SELECT * FROM payment
LIMIT 10;
* postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
payment_id | customer_id | staff_id | rental_id | amount | payment_date |
---|---|---|---|---|---|
17503 | 341 | 2 | 1520 | 7.99 | 2007-02-15 22:25:46.996577 |
17504 | 341 | 1 | 1778 | 1.99 | 2007-02-16 17:23:14.996577 |
17505 | 341 | 1 | 1849 | 7.99 | 2007-02-16 22:41:45.996577 |
17506 | 341 | 2 | 2829 | 2.99 | 2007-02-19 19:39:56.996577 |
17507 | 341 | 2 | 3130 | 7.99 | 2007-02-20 17:31:48.996577 |
17508 | 341 | 1 | 3382 | 5.99 | 2007-02-21 12:33:49.996577 |
17509 | 342 | 2 | 2190 | 5.99 | 2007-02-17 23:58:17.996577 |
17510 | 342 | 1 | 2914 | 5.99 | 2007-02-20 02:11:44.996577 |
17511 | 342 | 1 | 3081 | 2.99 | 2007-02-20 13:57:39.996577 |
17512 | 343 | 2 | 1547 | 4.99 | 2007-02-16 00:10:50.996577 |
Single-Row Subqueries#
Single-row subqueries return a single row and are used with comparison operators like =
, <
, >
, etc.
Example: Find all the rentals with the price that is above the average rental price#
This query looks simple. You might be tempted to try
%%sql
SELECT rental_id
FROM payment
WHERE amount > AVG(amount);
* postgresql://postgres:***@localhost:5432/dvdrental
(psycopg2.errors.GroupingError) aggregate functions are not allowed in WHERE
LINE 3: WHERE amount > AVG(amount);
^
[SQL: SELECT rental_id
FROM payment
WHERE amount > AVG(amount);]
(Background on this error at: https://sqlalche.me/e/14/f405)
This doesn’t work because we cannot use an aggregate function within the WHERE
clause.
To answer this question, we would need to querry the data twice:
To retrive the average
amount
To retrive the
rental_id
where theamount
is larger than the averageamount
from the first querry
However, we don’t need to do that in two separate queries and manually take the data coming from the first query and use it in the second. We can use a subquery to do that for us.
First, let’s compute the average amount
in payment:
%%sql
SELECT AVG(amount) FROM payment;
* postgresql://postgres:***@localhost:5432/dvdrental
1 rows affected.
avg |
---|
4.2006056453822965 |
We can use this intermediate information in our original query by embedding the above query in the WHERE
clause of the original query:
%%sql
SELECT rental_id, amount
FROM payment
WHERE amount > (SELECT AVG(amount) FROM payment)
ORDER BY amount
LIMIT 10;
* postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
rental_id | amount |
---|---|
1780 | 4.99 |
3331 | 4.99 |
1649 | 4.99 |
1678 | 4.99 |
2766 | 4.99 |
1547 | 4.99 |
3026 | 4.99 |
1457 | 4.99 |
1475 | 4.99 |
2275 | 4.99 |
Note
A subquery can be used in the SELECT
, FROM
, WHERE
, and HAVING
clauses. Subqueries are most commonly used in the WHERE
clause.
Example: Return the name of all customers whose rental price is larger than $11.00#
%%sql
SELECT customer_id
FROM payment
WHERE amount > 11.00;
* postgresql://postgres:***@localhost:5432/dvdrental
8 rows affected.
customer_id |
---|
362 |
204 |
116 |
195 |
237 |
591 |
592 |
13 |
%%sql
SELECt * FROM customer;
* postgresql://postgres:***@localhost:5432/dvdrental
599 rows affected.
customer_id | store_id | first_name | last_name | address_id | activebool | create_date | last_update | active | |
---|---|---|---|---|---|---|---|---|---|
524 | 1 | Jared | Ely | jared.ely@sakilacustomer.org | 530 | True | 2006-02-14 | 2013-05-26 14:49:45.738000 | 1 |
1 | 1 | Mary | Smith | mary.smith@sakilacustomer.org | 5 | True | 2006-02-14 | 2013-05-26 14:49:45.738000 | 1 |
2 | 1 | Patricia | Johnson | patricia.johnson@sakilacustomer.org | 6 | True | 2006-02-14 | 2013-05-26 14:49:45.738000 | 1 |
3 | 1 | Linda | Williams | linda.williams@sakilacustomer.org | 7 | True | 2006-02-14 | 2013-05-26 14:49:45.738000 | 1 |
4 | 2 | Barbara | Jones | barbara.jones@sakilacustomer.org | 8 | True | 2006-02-14 | 2013-05-26 14:49:45.738000 | 1 |
5 | 1 | Elizabeth | Brown | elizabeth.brown@sakilacustomer.org | 9 | True | 2006-02-14 | 2013-05-26 14:49:45.738000 | 1 |
6 | 2 | Jennifer | Davis | jennifer.davis@sakilacustomer.org | 10 | True | 2006-02-14 | 2013-05-26 14:49:45.738000 | 1 |
7 | 1 | Maria | Miller | maria.miller@sakilacustomer.org | 11 | True | 2006-02-14 | 2013-05-26 14:49:45.738000 | 1 |
8 | 2 | Susan | Wilson | susan.wilson@sakilacustomer.org | 12 | True | 2006-02-14 | 2013-05-26 14:49:45.738000 | 1 |
9 | 2 | Margaret | Moore | margaret.moore@sakilacustomer.org | 13 | True | 2006-02-14 | 2013-05-26 14:49:45.738000 | 1 |
%%sql
SELECT first_name, last_name
FROM customer
WHERE customer_id IN (
SELECT customer_id
FROM payment
WHERE amount > 11.00
);
* postgresql://postgres:***@localhost:5432/dvdrental
8 rows affected.
first_name | last_name |
---|---|
Karen | Jackson |
Victoria | Gibson |
Vanessa | Sims |
Rosemary | Schmidt |
Tanya | Gilbert |
Nicholas | Barfield |
Kent | Arsenault |
Terrance | Roush |
%%sql
SELECT first_name, last_name, amount
FROM customer
INNER JOIN (SELECT customer_id, amount FROM payment WHERE amount > 11.00) AS p
ON customer.customer_id = p.customer_id;
* postgresql://postgres:***@localhost:5432/dvdrental
8 rows affected.
first_name | last_name | amount |
---|---|---|
Nicholas | Barfield | 11.99 |
Rosemary | Schmidt | 11.99 |
Victoria | Gibson | 11.99 |
Vanessa | Sims | 11.99 |
Tanya | Gilbert | 11.99 |
Kent | Arsenault | 11.99 |
Terrance | Roush | 11.99 |
Karen | Jackson | 11.99 |
Multiple-Row Subqueries#
Multiple-row subqueries return multiple rows and are used with operators like IN
, ANY
, ALL
.
Example with IN
#
Return the name of all customers whose rental price is larger than the average rental price
%%sql
SELECT first_name, last_name
FROM customer
WHERE customer_id IN (
SELECT customer_id
FROM payment
WHERE amount > (SELECT AVG(amount) FROM payment)
ORDER BY amount DESC
)
LIMIT 10;
* postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.
first_name | last_name |
---|---|
Vivian | Ruiz |
Wanda | Patterson |
Dan | Paine |
Priscilla | Lowe |
Guy | Brownlee |
Alice | Stewart |
Chris | Brothers |
Kay | Caldwell |
Christina | Ramirez |
Yolanda | Weaver |
Example with ANY
#
The ANY
operator is used to compare a value to any value in a list or subquery.
Find customers who have made a payment greater than any payment made by customer with customer_id
= 1
%%sql
SELECT amount
FROM payment
WHERE customer_id = 1;
* postgresql://postgres:***@localhost:5432/dvdrental
30 rows affected.
amount |
---|
5.99 |
0.99 |
9.99 |
4.99 |
4.99 |
0.99 |
3.99 |
4.99 |
3.99 |
0.99 |
%%sql
SELECT first_name, last_name
FROM customer
WHERE customer_id IN (
SELECT customer_id
FROM payment
WHERE amount > ANY (
SELECT amount
FROM payment
WHERE customer_id = 1
)
);
* postgresql://postgres:***@localhost:5432/dvdrental
599 rows affected.
first_name | last_name |
---|---|
Jared | Ely |
Mary | Smith |
Patricia | Johnson |
Linda | Williams |
Barbara | Jones |
Elizabeth | Brown |
Jennifer | Davis |
Maria | Miller |
Susan | Wilson |
Margaret | Moore |
Example: Using ALL
#
The ALL
operator is used to compare a value to all values in a list or subquery.
Find customers who have made a payment greater than all payments made by customer with customer_id
= 1
%%sql
SELECT DISTINCT customer_id
FROM payment
WHERE amount > ALL (
SELECT amount
FROM payment
WHERE customer_id = 1
)
ORDER BY customer_id;
* postgresql://postgres:***@localhost:5432/dvdrental
101 rows affected.
customer_id |
---|
2 |
3 |
12 |
13 |
21 |
29 |
33 |
45 |
49 |
50 |
%%sql
SELECT first_name, last_name
FROM customer
WHERE customer_id IN (
SELECT customer_id
FROM payment
WHERE amount > ALL (
SELECT amount
FROM payment
WHERE customer_id = 1
)
);
* postgresql://postgres:***@localhost:5432/dvdrental
101 rows affected.
first_name | last_name |
---|---|
Patricia | Johnson |
Linda | Williams |
Nancy | Thomas |
Karen | Jackson |
Michelle | Clark |
Angela | Hernandez |
Anna | Hill |
Janet | Phillips |
Joyce | Edwards |
Diane | Collins |
We can also achive the same result by filter the amount by the maximum amount paid by customer_id
= 1
%%sql
SELECT first_name, last_name
FROM customer
WHERE customer_id IN (
SELECT customer_id
FROM payment
WHERE amount > (
SELECT MAX(amount)
FROM payment
WHERE customer_id = 1
)
);
* postgresql://postgres:***@localhost:5432/dvdrental
101 rows affected.
first_name | last_name |
---|---|
Patricia | Johnson |
Linda | Williams |
Nancy | Thomas |
Karen | Jackson |
Michelle | Clark |
Angela | Hernandez |
Anna | Hill |
Janet | Phillips |
Joyce | Edwards |
Diane | Collins |