Star vs. Snowflake Schema: SQL Examples & Python Clustering

Snowflake Schema Example

Below are SQL examples demonstrating the structure of a Snowflake schema:

CREATE TABLE sales_fact (
    customer_id VARCHAR(10),
    car_id VARCHAR(10),
    date_id DATE
);
CREATE TABLE customer_dim (
    customer_id INT PRIMARY KEY,
    name VARCHAR(20),
    address VARCHAR(25),
    phone_no INT,
    city VARCHAR(10)
);
CREATE TABLE car_dim (
    car_id INT PRIMARY KEY,
    name VARCHAR(10),
    year INT,
    model VARCHAR(10)
);
CREATE TABLE date_dim (
    date_id INT PRIMARY KEY,
    day_of_week VARCHAR(10),
    month VARCHAR(10),
    year INT
);
CREATE TABLE city_dim (
    state VARCHAR(10),
    zipcode INT
);
CREATE TABLE model_dim (
    model_id INT,
    year INT,
    items_sold INT
);

Star Schema Example

Below are SQL examples demonstrating the structure of a Star schema:

CREATE TABLE sales_fact (
    customer_id VARCHAR(10),
    car_id INT,
    date_id DATE,
    amount MONEY
);
CREATE TABLE car_dim (
    car_id INT PRIMARY KEY,
    name VARCHAR(10),
    year INT,
    model VARCHAR(10)
);
CREATE TABLE customer_dim (
    customer_id INT PRIMARY KEY,
    name VARCHAR(20),
    address VARCHAR(25),
    phone_no INT
);
CREATE TABLE date_dim (
    date_id INT PRIMARY KEY,
    date DATE,
    day_of_week INT,
    year INT,
    month VARCHAR(10)
);

K-Means Clustering with Python

This Python script demonstrates K-Means clustering on the Iris dataset using scikit-learn.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Apply k-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Get cluster labels and cluster centers
labels = kmeans.labels_
centers = kmeans.cluster_centers_

# Reduce the dimensionality for visualization (using PCA)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Plot the clusters
plt.figure(figsize=(8, 6))

scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis', edgecolor='k', s=80)
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='X', s=200, label='Cluster Centers')

plt.title('K-Means Clustering of Iris Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()

Installation

Install the necessary Python libraries using pip:

pip install numpy matplotlib scikit-learn

Data Warehouse Table Examples

Dimension Table: Time

An example of a time dimension table:

time_iddateday_of_weekmonthquarteryear
12023-01-01SundayJanuaryQ12023
22023-01-02MondayJanuaryQ12023

Column Descriptions:

  • time_id: Primary key for the time dimension.
  • date: The specific date.
  • day_of_week: Day of the week.
  • month: Month of the year.
  • quarter: Quarter of the year.
  • year: Year.

Fact Table: HospitalFacts

An example of a fact table for hospital data:

patient_iddoctor_idtime_idadmission_countdiagnosis_codetreatment_codecost
10120112AX500
10220221BY300

Column Descriptions:

  • patient_id: Foreign key to the patient dimension.
  • doctor_id: Foreign key to the doctor dimension.
  • time_id: Foreign key to the time dimension.
  • admission_count: Number of times a patient is admitted.
  • diagnosis_code: Code representing the diagnosis.
  • treatment_code: Code representing the treatment.
  • cost: Cost associated with the admission, diagnosis, and treatment.

html>