Sunday, April 4, 2010

The Basics of SQL Joins in MySQL

When making your first forays into relational database development, you can use simple SQL statements to mine your data easily enough.

However, as your data grows in both size and breadth, you'll need to begin employing more sophisticated strategies for exploiting increasingly complex data relations.

After all, pulling data from a single table is easy, but what if you need to query for data spanning three, four, or even eight different tables?

Retrieving interrelated data stored within multiple tables is most effectively accomplished using a SQL JOIN clause, of which there are several variants.

The JOIN clause relies upon related fields found in two tables to determine the commonality of the data stored within each, producing a data set that you can then easily save or further manipulate.

In this article, I introduce three of the most commonly used JOIN variants: the INNER JOIN, OUTER JOIN, and SELF JOIN, and provides MySQL examples for using them.

Although these examples will be MySQL-specific, you'll be able to use what you learn here within most -- if not all -- other relational database solutions.


The Inner Join
Suppose you were tasked with creating reports that extracted data from a customer relationship management (CRM) application.

Specifically, the sales team wanted a list of all customers and their associated professions. Simplified versions of the customers and professions tables look like this:


CREATE TABLE customers (
 -> id INTEGER UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
 -> name VARCHAR(255) NOT NULL,
 -> profession_id INTEGER UNSIGNED NOT NULL
);

CREATE TABLE professions (
 -> id INTEGER UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
 -> name VARCHAR(255) NOT NULL
);

The shared commonality between these two tables is a profession ID, represented by the id column in the professions table, and the profession_id column in the customers table.

Knowing this, you can use an INNER JOIN to retrieve each customer name and the associated profession name like this:

SELECT * FROM customers INNER JOIN professions ON customers.profession_id = professions.id;

Executing this will produce a result set containing all of the values found in each interrelated row:


+----+------------+---------------+----+---------------+
| id | name       | profession_id | id | name          |
+----+------------+---------------+----+---------------+
|  1 | Acme, Inc. |             1 |  1 | Manufacturing | 
+----+------------+---------------+----+---------------+

Because you're interested only in the customer and profession names, you can revise the query to look like this:

SELECT customers.name, professions.name FROM customers 
  INNER JOIN professions 
    ON customers.profession_id = professions.id;

Executing the revised query produces the following output:

+------------+---------------+
| name       | name          |
+------------+---------------+
| Acme, Inc. | Manufacturing | 
+------------+---------------+

Filtering Records
You can attach other SQL clauses to JOINs to produce sorted or filtered output. For instance, to retrieve only customers whose professions are "Technical writers," you would use this query:

SELECT customers.name, FROM customers 
  INNER JOIN professions 
    ON customers.profession_id = professions.id
  WHERE professions.name = "Technical writer";

The OUTER JOIN
The INNER JOIN will return rows only when a matching value is found in both tables.

However, what if you discovered some data inconsistencies, which arose due to the mass import of customers from another CRM solution, and you needed to know which customers have not yet been assigned a profession?

You can use an OUTER JOIN for that job, of which there are two types: the LEFT OUTER JOIN (or LEFT JOIN) and RIGHT OUTER JOIN (RIGHT JOIN).

The LEFT OUTER JOIN will retrieve all rows in the table located on the "left" side of the JOIN, regardless of whether a related row is found in the table located on the JOIN's "right" side.

In cases where no matching row is found, NULL will serve as a placeholder. For instance, the following LEFT JOIN will produce a list of all customers and professions, even if no profession assignment has been made:

SELECT customers.name, professions.name FROM customers 
LEFT JOIN professions 
  ON customers.profession_id = professions.id;

Suppose the customer "Taylor Made Teapots" lacked a corresponding profession. Executing a LEFT JOIN would produce the following output:

+---------------------+---------------+
| name                | name          |
+---------------------+---------------+
| Acme, Inc.          | Manufacturing | 
| Taylor Made Teapots | NULL          | 
+---------------------+---------------+

What if you wanted to retrieve just a list of customers lacking a profession designation? You can use the WHERE clause in conjunction with the IS NULL predicate, like this:

SELECT customers.name, professions.name FROM customers 
LEFT JOIN professions 
  ON customers.profession_id = professions.id;
WHERE professions.name IS NULL;

The RIGHT JOIN works identically to the LEFT JOIN, except that all rows on the right side of the JOIN will be returned, regardless of whether a shared row is found in the table residing on the left side of the JOIN.

For instance, you could use a RIGHT JOIN to determine which professions are not represented within the customer database:

SELECT professions.name FROM customers 
RIGHT JOIN professions 
  ON customers.profession_id = professions.id 
WHERE customers.name IS NULL;

Executing this RIGHT JOIN produces output similar to the following:

+---------------+
| name          |
+---------------+
| Plumber       | 
| Airline Pilot | 
+---------------+

The SELF JOIN
Believe it or not, it's also possible to join a table to itself. For 
instance, suppose your company instituted a customer referral program, 
providing customers with a cash incentive for inviting other 
organizations to do business with you.

You'd naturally want to track 
referral histories, which you could do easily enough by adding a column 
named referrer_id to the customers table:
CREATE TABLE customers (
 -> id INTEGER UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
 -> referrer_id INTEGER UNSIGNED NOT NULL,
 -> name VARCHAR(255) NOT NULL,
 -> profession_id INTEGER UNSIGNED NOT NULL
);
For each customer, the referrer_id cell will be set to either 0, meaning nobody referred the customer, or an integer representing another customer's primary key:
+----+-------------+-------------------------+---------------+
| id | referrer_id | name                    | profession_id |
+----+-------------+-------------------------+---------------+
|  1 |           0 | Acme, Inc.              |             1 | 
|  2 |           1 | Taylor Made Teapots     |             4 | 
|  3 |           0 | Robin's Egg Farm        |             1 | 
|  4 |           2 | Paul's Plumbing         |             3 | 
|  5 |           2 | Charlie's Chicken Coops |             5 | 
+----+-------------+-------------------------+---------------+
Using the SELF JOIN, you can create a list of customers and their referring counterpart:
SELECT c.name AS "Customer", r.name AS "Referred By" 
FROM customers c, customers r 
WHERE c.referrer_id = r.id;
Executing this SELF JOIN produces output similar to the following:
+-------------------------+---------------------+
| Customer                | Referred By         |
+-------------------------+---------------------+
| Taylor Made Teapots     | Acme, Inc.          | 
| Paul's Plumbing         | Taylor Made Teapots | 
| Charlie's Chicken Coops | Taylor Made Teapots | 
+-------------------------+---------------------+

Conclusion
Mastering JOINs will elevate your ability to effectively manage large datasets. Just like learning to ride a bicycle though, it takes some practice to get used to the unfamiliar syntax. However, after some time you'll wonder how you ever got along without this powerful feature!

No comments:

Post a Comment