Building a Relational Database from a CSV file using PostgreSQL
3 min read

Building a Relational Database from a CSV file using PostgreSQL

Building a Relational Database from a CSV file using PostgreSQL
Photo by Tobias Fischer / Unsplash

Dataset

The data is from the Superstore Sales dataset, which can be found here. This dataset is stored as a csv file and contains example sales transaction data. Metadata is also shared with the file.

Metadata

Row ID: Unique ID for each row.
Order ID: Unique Order ID for each Customer.
Order Date: Order Date of the product.
Ship Date: Shipping Date of the Product.
Ship Mode:  Shipping Mode specified by the Customer.
Customer ID: Unique ID to identify each Customer.
Customer Name: Name of the Customer.
Segment: The segment where the Customer belongs.
Country: Country of residence of the Customer.
City: City of residence of the Customer.
State: State of residence of the Customer.
Postal Code: Postal Code of every Customer.
Region: Region where the Customer belongs.
Product ID: Unique ID of the Product.
Category: Category of the product ordered.
Subcategory: Subcategory of the product ordered.
Product Name: Name of the Product
Sales: Sales of the Product.
Quantity: Quantity of the Product.
Discount: Discount provided.
Profit: Profit/Loss incurred.

Creating a database

CREATE DATABASE superstore_db;

Loading the data to the PostgreSQL

Using the internal Query Tool in pgAdmin, we create a table with an SQL query. Then we populate the table by importing the csv file.

CREATE TABLE superstore(
    Row_ID INT,
    Order_ID VARCHAR(30),
    Order_Date DATE,
    Ship_Date DATE,
    Ship_Mode VARCHAR(20),
    Customer_ID VARCHAR(20),
    Customer_Name VARCHAR(50),
    Segment VARCHAR(20),
    Country VARCHAR(100),
    City VARCHAR(100),
    State VARCHAR(100),
    Postal_Code INT,
    Region VARCHAR(10),
    Product_ID VARCHAR(20),
    Category VARCHAR(30),
    Subcategory VARCHAR(30),
    Product_Name VARCHAR(300),
    Sales NUMERIC,
    Quantity INT,
    Discount NUMERIC,
    Profit NUMERIC);

Creating an Entity Relationship Diagram (ERD)

Basically, the information contained in the data can be divided into four entities: Customers, Products, Shipment, and Sales.

While examining the address information, we see that there is more than one shipping address for each customer. Since there aren't any data logs about the updates on the customer data, we create a separate table for shipment details including the address information for each order. Besides we create a separate table with the customer_id, customer_name, and segment columns.

Click on the image to display the chart in GitMind

Creating the tables

CREATE TABLE customers (
    customer_id VARCHAR(10),
    customer_name VARCHAR(50),
    segment VARCHAR(20),
    CONSTRAINT pk_customer_id PRIMARY KEY(customer_id)
);

CREATE TABLE shipment (
    order_id VARCHAR(20),
    ship_date DATE,
    ship_mode VARCHAR(20),
    country VARCHAR(100),
    city VARCHAR(100),
    state VARCHAR(100),
    postal_code INT,
    region VARCHAR(10),
    customer_id VARCHAR(10),
    CONSTRAINT pk_order_id PRIMARY KEY(order_id),
    CONSTRAINT fk_customer_id FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);

CREATE TABLE products (
    product_id VARCHAR(20),
    product_name VARCHAR(300),
    category VARCHAR(30),
    subcategory VARCHAR(30),
    CONSTRAINT pk_product_id PRIMARY KEY(Product_ID)
);

CREATE TABLE sales (
    order_id VARCHAR(20),
    product_id VARCHAR(20),
    order_date DATE,
    payment NUMERIC,
    quantity INT,
    discount NUMERIC,
    profit NUMERIC
);

Inserting records into the tables

Populating the customers table :

INSERT INTO customers
SELECT
    s.customer_id, 
    customer_name
    segment
    FROM superstore AS s
    INNER JOIN 
    ( SELECT 
        customer_id, max(row_id) AS row_id
      FROM superstore
      GROUP BY customer_id) AS sub
        ON s.row_id = sub.row_id;

Populating the shipment table :

INSERT INTO shipment
SELECT
    order_id,
    ship_date,
    ship_mode,
    country,
    city,
    state,
    postal_code,
    region,
    customer_id
FROM superstore
GROUP BY
    order_id,
    ship_date,
    ship_mode,
    country,
    city,
    state,
    postal_code,
    region,
    customer_id
ORDER BY
    order_id,
    country,
    city,
    state;

Populating the products table :

INSERT INTO products
SELECT
    product_id,
    product_name,
    category,
    subcategory
FROM superstore
GROUP BY
    product_id,
    product_name,
    category,
    subcategory;

Populating the sales table :

INSERT INTO sales
SELECT
    order_id,
    product_id,
    order_date,
    sales AS payment,
    quantity,
    discount,
    profit
FROM superstore;

Adding a surrogate key to the sales table

We'll add a surrogate key to the sales table because the existing attributes are not really suited as the primary key.

ALTER TABLE sales
ADD COLUMN sales_id serial,
ADD CONSTRAINT sales_id_pk PRIMARY KEY(sales_id),
ADD CONSTRAINT order_id_pk FOREIGN KEY(order_id) REFERENCES shipment(order_id),
ADD CONSTRAINT product_id_pk FOREIGN KEY(product_id) REFERENCES products(product_id);