Star vs. Snowflake Schema: SQL Examples & Python Clustering
Snowflake Schema Example
Below are SQL examples demonstrating the structure of a Snowflake schema:
CREATE TABLE sales_fact (
customer_id VARCHAR(10),
car_id VARCHAR(10),
date_id DATE
);
CREATE TABLE customer_dim (
customer_id INT PRIMARY KEY,
name VARCHAR(20),
address VARCHAR(25),
phone_no INT,
city VARCHAR(10)
);
CREATE TABLE car_dim (
car_id INT PRIMARY KEY,
name VARCHAR(10),
year INT,
model VARCHAR(10)
);
CREATE TABLE date_dim (
date_id INT PRIMARY KEY,
day_of_week VARCHAR(10),
month VARCHAR(10),
year INT
);
CREATE TABLE city_dim (
state VARCHAR(10),
zipcode INT
);
CREATE TABLE model_dim (
model_id INT,
year INT,
items_sold INT
);
Star Schema Example
Below are SQL examples demonstrating the structure of a Star schema:
CREATE TABLE sales_fact (
customer_id VARCHAR(10),
car_id INT,
date_id DATE,
amount MONEY
);
CREATE TABLE car_dim (
car_id INT PRIMARY KEY,
name VARCHAR(10),
year INT,
model VARCHAR(10)
);
CREATE TABLE customer_dim (
customer_id INT PRIMARY KEY,
name VARCHAR(20),
address VARCHAR(25),
phone_no INT
);
CREATE TABLE date_dim (
date_id INT PRIMARY KEY,
date DATE,
day_of_week INT,
year INT,
month VARCHAR(10)
);
K-Means Clustering with Python
This Python script demonstrates K-Means clustering on the Iris dataset using scikit-learn.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Apply k-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
# Get cluster labels and cluster centers
labels = kmeans.labels_
centers = kmeans.cluster_centers_
# Reduce the dimensionality for visualization (using PCA)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Plot the clusters
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis', edgecolor='k', s=80)
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='X', s=200, label='Cluster Centers')
plt.title('K-Means Clustering of Iris Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()
Installation
Install the necessary Python libraries using pip:
pip install numpy matplotlib scikit-learn
Data Warehouse Table Examples
Dimension Table: Time
An example of a time dimension table:
time_id | date | day_of_week | month | quarter | year |
---|---|---|---|---|---|
1 | 2023-01-01 | Sunday | January | Q1 | 2023 |
2 | 2023-01-02 | Monday | January | Q1 | 2023 |
… | … | … | … | … | … |
Column Descriptions:
time_id
: Primary key for the time dimension.date
: The specific date.day_of_week
: Day of the week.month
: Month of the year.quarter
: Quarter of the year.year
: Year.
Fact Table: HospitalFacts
An example of a fact table for hospital data:
patient_id | doctor_id | time_id | admission_count | diagnosis_code | treatment_code | cost |
---|---|---|---|---|---|---|
101 | 201 | 1 | 2 | A | X | 500 |
102 | 202 | 2 | 1 | B | Y | 300 |
… | … | … | … | … | … | … |
Column Descriptions:
patient_id
: Foreign key to the patient dimension.doctor_id
: Foreign key to the doctor dimension.time_id
: Foreign key to the time dimension.admission_count
: Number of times a patient is admitted.diagnosis_code
: Code representing the diagnosis.treatment_code
: Code representing the treatment.cost
: Cost associated with the admission, diagnosis, and treatment.