Database Basics with Joins and Unions

Most of the people coming from different technical and educational background when they start their Data Science journey they may face problems in understanding databases. For example, what is database? What is Primary key? What are different types of tables like Transaction table and Reference table? What are different types of  joins? What are different types of Data? What is Data Warehousing? What is Data Lake? What is  Data Mining, etc.?

This is a series of articles where we will have crisp and concise contents to cover all these aspects so that you are able to start your Data Science journey

with basic knowledge of database.


In this particular article, we will go through with the basics of database – 


What is Database
What is Structured Data
Primary key as single column vs Composite Primary key
How to Create Table in Database
How Create Table query changes when working with a Primary Key and Foreign Key
Insert Data into the Table
Types of Table 
What is Join and Types of Join
What is Union and how it differs from Join

What is Database

Let’s first understand what is database? You can think of a database as a placeholder for data where data is organized in some particular way. Depending on type of data, i.e whether it is Structured, Semi-Structured or Unstructured data – there can be different ways in which data can be arranged. 

We will first talk about how Structured data is organized. 


What is Structured Data?

Structured data is one which adheres to a schema. A Schema is a set of rules with which data is organized in the database. 


A database can have multiple tables. In tables, data is organized in rows and columns. You can think of excel where data is organized in rows and columns. In tables also, we have data organized in rows and columns with some set of rules.

Now, consider a scenario of a Bank database. In the Bank database, we have considered 2 tables for now – Customer table and Accounts table. Customer table has Customer_ID as the Primary key.

Primary key as single column vs Composite Primary key

A Primary key is a non-null column in the table which uniquely identifies a row in the table. The values in the Primary key column is unique, so the values can’t be repeated in the Primary key column. A Primary key can be a single column or a combination of multiple columns, which is called a Composite Primary key. You can refer Primary key as Composite key at the end of this article – to get more details on it.


How to Create Table in Database

The Customer table contains other columns like  Customer_Name, Account_Number, Account_Type and the Balance_Amount to reflect the amount available in the account.


Customer Table   

Customer_ID 
Customer_Name
Account_Number 
Account_Type
Balance_Amount
Cust100
John Taylor
Acc100
Savings
60000
Cust101
David Williams
Acc101
Current
50000
Cust102
Tracy Smith
Acc 102
Savings
55000
Cust103
Emma Jones
Acc103
52000
Below is the SQL Command to create the Customer table – 
CREATE TABLE Customer (
    Customer_ID varchar(10),
    Customer_Name varchar(40),
    Account_Number varchar(10),
    Account_Type varchar(20),
    Balance_Amount int );
In the above SQL command, we have mentioned the table name and column names along with the data type of the column.
We have another table called Accounts which has Account_Type as Primary key. It has other columns Credit_Limit and Minimum_Balance needed to maintain in a particular Account Type.
Accounts Table
Account_Type
Credit_Limit
Minimum_Balance
Savings
10000
500
Current
20000
2000
Car Loan
500000
20000
Mortgages
1000000
90000

Below is the SQL Command to create the Accounts table – 
CREATE TABLE Accounts (
Account_Type varchar(20),
Credit_Limit int,
Minimum_Balance int );
Now, couple of things to observe here –
1. While creating tables, we have not defined any column or columns as Primary Key in both the tables.
2. The two tables ‘Customer‘ and ‘Accounts‘ are created separately and they have no relation mentioned between them.
3. The data type of Account_Type column in both the tables is defined as varchar(20).

How Create Table query changes when working with a Primary Key and Foreign Key

In the Accounts table, there is a column named Account_Type. This Account_Type column is a Primary key in Accounts table, because if you specify any value of this column, it will uniquely identify a row in the Accounts table.
So, you can create Accounts table by specifying the Account_Type as the Primary key. Your SQL command to create table will change as below –
CREATE TABLE Accounts (
Account_Type varchar(20),
Credit_Limit int,
Minimum_Balance int,
PRIMARY KEY (Account_Type) );

In the Customer table Customer_ID is the Primary key. So, while creating the table, we can specify this column as the Primary key.

Now, notice that in the Customer table, there is a column named Account_Type as well. So, when one table has a column that refers another table’s primary key – then that column is a foreign key. So, in this case, Account_Type column in Customer table is a foreign key.
The way we created two tables earlier, there was no relation between them. But if you want to have a relationship defined between tables, then you can do so while creating tables. 
Use below SQL query to create Customer table by defining its primary and foreign keys.
CREATE TABLE Customer (
Customer_ID varchar(10), 
Customer_Name varchar(40) NOT NULL,
Account_Number varchar(10),
Account_Type varchar(20),
Balance_Amount int,
PRIMARY KEY (Customer_ID),
FOREIGN KEY (Account_Type) REFERENCES Accounts(Account_Type)
);

Now, observe couple of changes made in the SQL Query –
1. A column Customer_ID is defined as a Primary key.
2. Using Foreign Key keyword, an association is created between the Customer and Accounts tables.
3. Customer_Name column is now specified as Not Null. So, this column can’t have any null value in it. This is a constraint which we added on this column.

When a column is defined as a Primary key, there are few restrictions already added on it, without explicitly specifying –
a) it can not have any null value.
b) it must be unique across all the rows.
c) it must uniquely identify a row in the table
Whereas a Foreign key column may have null values in it.

Insert Data into the Table

When you specified Accounts table’s column as a Foreign key for Customers table, it means your Customer table data will refer to Accounts table data. So, you should first insert data into the Accounts table and then into the Customer table.
Add data in the Accounts table using below INSERT INTO command –
INSERT INTO Accounts
Values (‘Savings’, 10000, 500);
Insert data into the Customer table using below –
INSERT INTO Customer
VALUES (‘Cust100’, ‘John Taylor’, ‘Acc100’, ‘Savings’, 60000);

Types of Table

Now, observe the data inside the two tables.
Customer Table

Customer_ID 
Customer_Name
Account_Number 
Account_Type
Balance_Amount
Cust100
John Taylor
Acc100
Savings
60000
Cust101
David Williams
Acc101
Current
50000
Cust102
Tracy Smith
Acc 102
Savings
55000
Cust103
Emma Jones
Acc103
52000

The Customer table is a Transaction table. What is a Transaction table? The data changes frequently in a Transaction table. In the customer table, you have Balance_Amount column. The value in the column changes whenever the customer withdraws or deposits money to his account.
Accounts Table
Account_Type
Credit_Limit
Minimum_Balance
Savings
10000
500
Current
20000
2000
Car Loan
500000
20000
Mortgages
1000000
90000

Whereas the Accounts table is a Reference table. The values in Credit_Limit and Minimum_Balance in a particular type of account do not change frequently. This table is referred to get more details of a customer account based on its account type.

What is Join and Types of Join

Now, let’s understand when would you need to Join data from two or more tables.
For example, if you want to know credit limit of each customer, then you need to join the two tables as one single table won’t be able to give you this answer.
In this example, we have two tables – Customer and Accounts. The Customer table has a foreign key column “Account_Type” which is primary key of table Accounts. So, we have one common column in these two tables.

When we have a common column between two tables, we can use Join to get rows from two tables based on the given criteria.
There are different kinds of joins – 
  1. Inner Join
  2. Left Join
  3. Right Join
  4. Full Outer Join

Inner Join

Inner Join will return only those rows where the common column value matches in both the tables. The visualization can be seen using Venn diagram as below-

Use below SQL query to Inner Join two tables. Here, Table 1 is Customer table and Table 2 is Accounts table.
SELECT * FROM Customer
INNER JOIN Accounts
ON Customer.Account_Type = Accounts.Account_Type;

So, in this case, if we do Inner Join, the result should be –
Customer table columns                                                           Accounts table columns
Now, observe couple of things from the result –

1. There are two Account_Type columns appearing in the result.
2. The left Account_Type column is from the Customer table as Customer table is written in the left (Table1).
3. The right Account_Type column is from the Accounts table as Accounts table is written in the right (Table2).
4. In the result, only those rows are selected where both the tables data matches for column Account_Type.
5. From the Customer table, Customer_ID = ‘Cust103‘ row is not included here as there was no Account_Type mentioned against it.
6. From the Accounts table, Account_Type = ‘Car Loan’ and ‘Mortgages’ rows are not selected as there was no such Account_Type in the Customer table.

Left Join

Left Join will return all the rows from the left table and matching rows from the right table.
The visualization can be seen using Venn diagram as below in which all the rows from Table 1 and matching rows from Table 2 will be resulted as output.

Use below SQL query to Left Join two tables. Here, Table 1 is Customer table and Table 2 is Accounts table.
SELECT * FROM Customer
LEFT JOIN Accounts
ON Customer.Account_Type = Accounts.Account_Type;
So, in this case, if we do Left Join, the result should be –
Customer table columns                                                      Accounts table columns


Now, observe couple of things from the result –
1. There are two Account_Type columns appearing in the result. In the Inner Join result also, we observed that common column was appearing from both the tables.
2. The left Account_Type column is from the Customer table as Customer table is written in the left (Table1).
3. The right Account_Type column is from the Accounts table as Accounts table is written in the right (Table2).
4. In the result, all rows are selected from Customer table which is Table1 (left table).
5. From the Accounts table, Account_Type = ‘Car Loan’ and ‘Mortgages’ rows are not selected as there was no such Account_Type in the Customer table.

Right Join

Right Join will return all the rows from the right table and matching rows from the left table.
The visualization can be seen using Venn diagram as below in which all the rows from Table 2 and matching rows from Table 1 will be resulted as output.

Use below SQL query to Right Join two tables. Here, Table 1 is Customer table and Table 2 is Accounts table.
SELECT * FROM Customer
RIGHT JOIN Accounts
ON Customer.Account_Type = Accounts.Account_Type;
So, in this case, if we do Left Join, the result should be –

Customer table columns                                                          Accounts table columns

Now, observe couple of things from the result –
1. There are two Account_Type columns appearing in the result.
2. The left Account_Type column is from the Customer table as Customer table is written in the left (Table1).
3. The right Account_Type column is from the Accounts table as Accounts table is written in the right (Table2).
4. In the result, all rows are selected from Accounts table which is Table2 (right table).
5. From the Customer table, Customer_ID = ‘Cust103‘ row is not included here as there was no Account_Type mentioned against it.

Full Outer Join

Full Outer Join will return all the rows from the left table and right table.
The visualization can be seen using Venn diagram as below in which all the rows from Table 1 and Table 2 will be resulted as output.

Use below SQL query to Full Outer Join two tables. Here, Table 1 is Customer table and Table 2 is Accounts table.

SELECT * FROM Customer
FULL OUTER JOIN Accounts
ON Customer.Account_Type = Accounts.Account_Type;

So, in this case, if we do Full Outer Join, the result should be –
Customer table columns                                                       Accounts table columns

Now, observe couple of things from the result –
1. There are two Account_Type columns appearing in the result.
2. The left Account_Type column is from the Customer table as Customer table is written in the left (Table1).
3. The right Account_Type column is from the Accounts table as Accounts table is written in the right (Table2).
4. In the result, all rows are selected from Customer and Accounts table which are Table 1 (left table) and Table2 (right table).

What is Union and How it differs from Joins

Union combines all the rows from the two select statements.

For example, we have two Select statements below. Union combines the result of two SELECT statements.

Select Account_Type from Customer where Balance_Amount >51000
Union
Select Account_Type from Accounts where Credit_Limit >5000
Result here will be –
ACCOUNT_TYPE
Car Loan
Current
Mortgages
Savings

Now, observe couple of points here –
1. There is only one column returned as result.
2. In the result – data retrieved from both the tables are added vertically in a single column. So, we can say that in Union, data scales vertically.
3. The data type and number of columns from both the select statements should be the same else this kind of vertical summation won’t be possible.

Execute Full Outer Join With the same criteria

To see the difference between Union and Full Outer Join, we can execute the Full Outer Join statement with the same criteria with which we executed the Union
Select Customer.Account_Type, Accounts.Account_Type from Customer
Full Outer Join Accounts  
on Customer.Account_Type = Accounts.Account_Type
where Accounts.Credit_Limit >5000 or Balance_Amount >51000;
Result here will be –
Customer table Accounts table
ACCOUNT_TYPE
ACCOUNT_TYPE
Savings
Savings
Savings
Savings
Current
Current
Car Loan
Mortgages

Execute Inner Join With the same criteria

To see the difference between Union and Inner Join, we can execute the Inner Join statement with the same criteria with which we executed the Union
Select Customer.Account_Type, Accounts.Account_Type from Customer
Inner Join Accounts  
on Customer.Account_Type = Accounts.Account_Type
where Accounts.Credit_Limit >5000 or Balance_Amount >51000;
the result here will be –

Customer table Accounts table
ACCOUNT_TYPE
ACCOUNT_TYPE
Savings
Savings
Savings
Savings
Current
Current

Execute Left Join With the same criteria

To see the difference between Union and Left Join, we can execute the Left Join statement with the same criteria with which we executed the Union
Select Customer.Account_Type, Accounts.Account_Type from Customer
Left Join Accounts  
on Customer.Account_Type = Accounts.Account_Type
where Accounts.Credit_Limit >5000 or Balance_Amount >51000;
The result here will be –

Customer table Accounts table
ACCOUNT_TYPE
ACCOUNT_TYPE
Savings
Savings
Current
Current
Savings
Savings

Execute Right Join With the same criteria

To see the difference between Union and Right Join, we can execute the Right Join statement with the same criteria with which we executed the Union
Select Customer.Account_Type, Accounts.Account_Type from Customer
Right Join Accounts  
on Customer.Account_Type = Accounts.Account_Type
where Accounts.Credit_Limit >5000 or Balance_Amount >51000;
Result here will be –
Customer table Accounts table
ACCOUNT_TYPE
ACCOUNT_TYPE
Savings
Savings
Current
Current
Savings
Savings
Mortgages
Car Loan

Composite Primary Key

This is where more than one column forms the PRIMARY KEY and uniquely identifies a row in the table.

This is our Customer table which we discussed earlier.

Customer Table 

Customer_ID 
Customer_Name
Account_Number 
Account_Type
Balance_Amount
Cust100
John Taylor
Acc100
Savings
60000
Cust101
David Williams
Acc101
Current
50000
Cust102
Tracy Smith
Acc 102
Savings
55000
Cust103
Emma Jones
Acc103
52000

In the Customer table, consider a scenario that one customer has more than 1 account with the bank. So, our table will look something like below – 
Customer_ID 
Customer_Name
Account_Number 
Account_Type
Balance_Amount
Cust100
John Taylor
Acc100
Savings
60000
Cust101
David Williams
Acc101
Current
50000
Cust102
Tracy Smith
Acc 102
Savings
55000
Cust103
Emma Jones
Acc103
52000
Cust100
John Taylor
Acc104
Car Loan
100000
Cust101
David Williams
Acc105
Savings
40000
In this case if we execute the below SQL Query –
Select * from Customer where Customer_ID = ‘Cust100’
Then it will return 2 rows. So now, the Customer_ID column is not sufficient to uniquely identify the row in the table. In such case, if we consider  Customer_ID + Account_Number then it will uniquely identify a row in the table. So, Customer_ID + Account_Number cumulative becomes the Primary key of the Customer table. This is called Composite Primary key.
With a Composite key, a table can be created as below –
CREATE TABLE Customer (
Customer_ID varchar(10), 
Customer_Name varchar(40) NOT NULL,
Account_Number varchar(10),
Account_Type varchar(20),
Balance_Amount int,
PRIMARY KEY (Customer_ID, Account_Number),
FOREIGN KEY (Account_Type) REFERENCES Accounts(Account_Type)
);

So, this was the basic database walkthrough with details on Join and Union.
The next article in the series is “Basic SQL commands” which you must know if you are planning to start your Data Science journey.
You can comment on the blog if you want details on any other topic of Database or Data Science. I can create a blog on that as well.

Leave a Comment

Your email address will not be published.