Database Concepts: Structure, Management, and SQL Queries
Understanding Databases
A database is an organized collection of data that is stored, managed, and accessed electronically. It allows users to efficiently store, retrieve, update, and manipulate data. Databases are commonly used in applications, websites, and businesses to manage structured information.
Database Types
- Relational Databases (RDBMS): Use structured tables with rows and columns (e.g., MySQL, PostgreSQL, SQL Server).
- NoSQL Databases: Designed for unstructured or semi-structured data (e.g., MongoDB, Cassandra, Redis).
- Cloud Databases: Hosted on cloud platforms for scalability (e.g., Amazon RDS, Google Firebase).
- Graph Databases: Store relationships between entities efficiently (e.g., Neo4j).
- Time-Series Databases: Optimized for time-stamped data (e.g., InfluxDB).
- Key-Value Databases: Store data as key-value pairs (e.g., Redis, DynamoDB).
Key Database Components
- Tables (for relational databases)
- Records (rows) and Fields (columns)
- Indexes for faster searches
- Queries for data retrieval
- Transactions to ensure data integrity
Database Management Systems (DBMS)
A DBMS (Database Management System) is software that allows users to create, manage, and interact with databases. It provides tools for storing, retrieving, updating, and managing data efficiently while ensuring security, integrity, and consistency.
DBMS Functions
- Data Storage & Retrieval: Efficiently stores and retrieves data.
- Data Security: Protects data with authentication and authorization.
- Data Integrity: Ensures accuracy and consistency of data.
- Concurrency Control: Manages multiple users accessing data simultaneously.
- Backup & Recovery: Prevents data loss with regular backups.
DBMS Types
- Relational DBMS (RDBMS): Uses tables (e.g., MySQL, PostgreSQL, Oracle).
- NoSQL DBMS: Handles unstructured data (e.g., MongoDB, Cassandra).
- Hierarchical DBMS: Uses a tree-like structure (e.g., IBM IMS).
- Network DBMS: Uses a graph-like structure for relationships (e.g., IDMS).
Database Normalization Explained
Normalization is the process of organizing data in a database to minimize redundancy and improve data integrity. It involves breaking down large tables into smaller, related tables and establishing relationships between them.
Goals of Normalization
- Eliminate Data Redundancy: Avoid storing duplicate data.
- Ensure Data Integrity: Prevent anomalies during insert, update, or delete operations.
- Improve Data Consistency: Keep data accurate and consistent across tables.
- Optimize Query Performance: Reduce the complexity of data retrieval.
Normal Forms (NF)
Normalization is achieved through a series of stages called Normal Forms:
- First Normal Form (1NF): Eliminates duplicate columns and ensures each column contains atomic (indivisible) values.
- Second Normal Form (2NF): Ensures 1NF and removes partial dependencies (i.e., all non-key attributes depend entirely on the primary key).
- Third Normal Form (3NF): Ensures 2NF and removes transitive dependencies (i.e., non-key attributes must not depend on other non-key attributes).
- Boyce-Codd Normal Form (BCNF): A stricter version of 3NF, ensuring that every determinant is a candidate key.
- Fourth Normal Form (4NF): Ensures BCNF and removes multi-valued dependencies.
- Fifth Normal Form (5NF): Ensures 4NF and resolves join dependencies.
First Normal Form (1NF)
Rule:
- Eliminate duplicate columns.
- Ensure each column contains atomic (indivisible) values.
- Each row must have a unique identifier (Primary Key).
Example (Before 1NF – Unnormalized Table):
Student ID | Name | Subject |
---|---|---|
101 | Farheed | English, Math |
102 | Saroj | English, Nepali |
Problem: The “Subject” column contains multiple values (not atomic).
After applying 1NF:
Student ID | Name | Subject |
---|---|---|
101 | Farheed | English |
101 | Farheed | Math |
102 | Saroj | English |
102 | Saroj | Nepali |
Now, the table follows 1NF since each column contains atomic values.
Second Normal Form (2NF)
Rule:
- Must be in 1NF.
- Remove partial dependencies (a column should depend on the whole primary key, not part of it).
- If a composite primary key exists, all non-key attributes must depend on the entire key.
Example (Before 2NF – Violating Partial Dependency):
(Assuming composite key: Student_ID, Course)
Student ID | Course | Instructor | Instructor Phone No |
---|---|---|---|
101 | Math | Mr. Saroj | 9803252836 |
101 | Science | Mr. Farheed | 9818471478 |
Problem: “Instructor” and “Instructor Phone No” depend only on “Course”, which is part of the composite key, not the entire key.
After applying 2NF:
Student-Course Table:
Student ID | Course |
---|---|
101 | Math |
101 | Science |
Course-Instructor Table:
Course | Instructor | Instructor Phone No |
---|---|---|
Math | Mr. Saroj | 9803252836 |
Science | Mr. Farheed | 9818471478 |
Now, every non-key column depends on the whole primary key of its respective table, ensuring 2NF.
Third Normal Form (3NF)
Rule:
- Must be in 2NF.
- Remove transitive dependencies (when a non-key column depends on another non-key column rather than the primary key).
Example (Before 3NF – Transitive Dependency):
Student ID | Name | Course | Department | Dept Head |
---|---|---|---|---|
101 | Kamal | Math | Science Dept | Mr. Saroj |
102 | Lalit | English | Arts Dept | Mr. Farheed |
Problem: “Dept Head” depends on “Department” (a non-key attribute), which in turn depends on “Student ID” (the primary key). This is a transitive dependency.
After applying 3NF:
Student-Course Table:
Student ID | Name | Course | Department |
---|---|---|---|
101 | Kamal | Math | Science Dept |
102 | Lalit | English | Arts Dept |
Department Table:
Department | Dept Head |
---|---|
Science Dept | Mr. Saroj |
Arts Dept | Mr. Farheed |
Now, there are no transitive dependencies, ensuring 3NF.
Fourth Normal Form (4NF)
Rule:
- Must be in 3NF (or BCNF).
- Eliminate multi-valued dependencies (when two or more independent attributes depend on the same primary key).
Example (Before 4NF – Multi-Valued Dependency):
Student ID | Course | Hobby |
---|---|---|
101 | Math | Football |
101 | Math | Cricket |
101 | English | Football |
101 | English | Cricket |
Problem: “Course” and “Hobby” are independent multi-valued attributes that both depend on “Student ID,” causing redundancy (e.g., the fact that student 101 takes Math is listed twice, once for each hobby).
After applying 4NF:
Student-Course Table:
Student ID | Course |
---|---|
101 | Math |
101 | English |
Student-Hobby Table:
Student ID | Hobby |
---|---|
101 | Football |
101 | Cricket |
Now, the multi-valued dependency is removed by splitting the table, ensuring 4NF.
Why is Normalization Important?
- Eliminates Data Redundancy:
- Redundant data wastes storage space and makes updates more complex.
- Example: Without normalization, a Customer table might store the same address multiple times if the customer places multiple orders.
- Solution: Split into separate Customer and Address tables, linked appropriately.
- Improves Data Integrity and Consistency:
- Ensures that changes in one place are reflected consistently.
- Example: If a customer’s phone number is stored in multiple places, an update might be missed in some locations, leading to inconsistency.
- Solution: Store the phone number in one central customer table and reference it where needed.
- Simplifies Data Modification (Insert, Update, Delete):
- Without normalization: Updating a value (like a department head’s name) might require changes in multiple rows where that department appears.
- With normalization: Data is updated in one place (e.g., the Department table), reducing errors and effort.
- Example: If “Publisher Name” is repeated in a Books table, a name change requires modifying multiple book rows.
- Solution: Create a separate Publisher table and link it via a foreign key in the Books table.
- Improves Query Performance:
- Smaller, well-structured tables generally allow for faster searches, indexing, and joins.
- Example: Searching for all books by “Pearson Publishers” is quicker if there’s an indexed Publisher table linked to the Books table, rather than scanning a potentially huge Books table containing repeated publisher names.
- Maintains Data Integrity with Relationships:
- Enforces Referential Integrity using Foreign Keys.
- Ensures that relationships between tables are valid and prevents orphaned records (e.g., an order record referring to a customer ID that no longer exists).
- Example: A BookIssue table should not allow an entry for a BookId that does not exist in the Book table.
What is a Data Model?
A data model is a conceptual framework that defines how data is structured, stored, and manipulated in a database. It provides a blueprint for designing a database by specifying entities, attributes, relationships, constraints, and rules for organizing data.
Understanding Stored Procedures
A Stored Procedure is a precompiled set of SQL statements that can be stored in the database and executed multiple times by name. It helps improve performance (by reducing network traffic and precompiling execution plans), maintainability (by centralizing logic), and security (by granting execute permissions instead of direct table access) in database operations.
Database Triggers Explained
A Trigger is a special type of stored procedure that automatically executes (or “fires”) when a specified event (INSERT
, UPDATE
, DELETE
) occurs on a specific table or view. Triggers are often used to enforce complex business rules, maintain audit logs, ensure data consistency across related tables, or automate actions based on data modifications.
Union Compatibility in Databases
Union Compatibility refers to a condition in relational algebra and SQL where two tables (or the results of two queries) can be combined using set operations like UNION
, INTERSECT
, and EXCEPT
(or MINUS
). For two relations to be union-compatible, they must satisfy two conditions:
- They must have the same number of attributes (columns).
- The data types of corresponding attributes must be compatible or implicitly convertible.
Data Definition Language (DDL)
DDL (Data Definition Language) refers to a subset of SQL commands used to define, modify, and manage the structure of database objects, such as tables, indexes, views, schemas, users, and permissions. DDL commands do not manipulate the data itself but instead focus on the database schema or structure. Common DDL commands include CREATE
, ALTER
, and DROP
.
Data Manipulation Language (DML)
DML (Data Manipulation Language) refers to a subset of SQL commands used to manipulate and manage the data stored within database tables. Unlike DDL, which deals with the structure, DML allows you to insert, update, delete, and retrieve data from tables. Common DML commands include SELECT
, INSERT
, UPDATE
, and DELETE
.
Database Security Fundamentals
Database Security refers to the collective measures, controls, and practices used to protect a database from unauthorized access, misuse, modification, disclosure, or destruction. It aims to ensure the confidentiality (preventing unauthorized disclosure), integrity (preventing unauthorized modification or deletion), and availability (ensuring authorized users can access data when needed) of the data stored in the database system. Key aspects include authentication, authorization (permissions/privileges), encryption, auditing, and backup/recovery strategies.
Relational Algebra Concepts
Relational Algebra is a procedural query language based on set theory and algebra. It provides a formal foundation for relational database operations and is used to specify how to retrieve and manipulate data stored in relations (tables). It defines a set of operations that take one or more relations as input and produce a new relation as output. Key operations include Selection (σ), Projection (π), Union (∪), Set Difference (-), Cartesian Product (×), Rename (ρ), and Join (⋈).
Functional Dependencies in Databases
A Functional Dependency (FD) is a constraint between two sets of attributes in a relation (table) from a database. For a given relation R, an attribute Y is functionally dependent on an attribute X (denoted as X → Y
) if and only if each X value in R is associated with precisely one Y value in R. In simpler terms, the value of X uniquely determines the value of Y.
Notation:
If attribute set X determines attribute set Y, it is written as:
X → Y
This means that for each unique value (or combination of values) of X, there is only one corresponding value (or combination of values) of Y.
Types of Functional Dependencies
- Trivial Functional Dependency:
A functional dependency
X → Y
is trivial if Y is a subset of X (Y ⊆ X).Example:
{A, B} → A
Since A is already part of the determinant {A, B}, knowing {A, B} trivially determines A. - Non-Trivial Functional Dependency:
A functional dependency
X → Y
is non-trivial if Y is not a subset of X (Y <0xE2><0x8A><0x88> X).Example:
Student_ID → Student_Name
Here, Student_ID determines Student_Name, and “Student_Name” is not part of the determinant “Student_ID.” This is the most common and useful type of FD. - Partial Functional Dependency:
A partial dependency occurs when a non-prime attribute (an attribute not part of any candidate key) is functionally dependent on only a part of a composite candidate key.
Example: Consider a table
Enrollment(Student_ID, Course_ID, Course_Name)
where{Student_ID, Course_ID}
is the primary key.Student ID Course ID Course Name 101 CSE101 Database 102 CSE102 Network Here, we have the FD:
{Student_ID, Course_ID} → Course_Name
(as it’s the primary key).
However, we also likely have:Course_ID → Course_Name
.
Since Course_Name (a non-prime attribute) depends only on Course_ID (part of the primary key), this is a partial dependency.
Problem: Leads to data redundancy and violates 2NF. - Transitive Functional Dependency:
A transitive dependency occurs when there is an indirect relationship that causes a functional dependency. If we have
X → Y
andY → Z
, then we also haveX → Z
transitively, where X is a candidate key (or superkey), and Y is not a candidate key, and Z is a non-prime attribute.Example: Consider a table
StudentDept(Student_ID, Department, Dept_Head)
Student ID Department Dept Head 101 Science Dr. Saroj 102 Arts Dr. Farheed Here:
Student_ID → Department
(A student belongs to one department).
And:Department → Dept_Head
(A department has one head).
Thus,Student_ID → Dept_Head
exists transitively via Department.
Problem: Leads to data redundancy and update anomalies, violating 3NF. - Multivalued Dependency (MVD):
A multivalued dependency exists when the presence of one row in a table implies the presence of one or more other rows in that same table. It occurs when two or more independent, multi-valued attributes are dependent on a single key attribute. Denoted as
X →→ Y
.Example: Consider
EmployeeSkills(Employee, Skill, Hobby)
Employee Skill Hobby John SQL Chess John Python Chess John SQL Football John Python Football Here, for employee ‘John’, the skills {SQL, Python} and hobbies {Chess, Football} are independent. This leads to MVDs:
Employee →→ Skill
andEmployee →→ Hobby
. To represent all combinations, redundancy is introduced.
Problem: Leads to data anomalies and violates 4NF.
Relational Algebra and SQL Query Examples
Given the schema as below:
Sailors(sid, sname, rating, age)
Boats(bid, bname, color)
Reserves(sid, bid, reserve_date)
A. Find name, rating, and age of all sailors
Relational Algebra:
π sname, rating, age (Sailors)
SQL Query:
SELECT sname, rating, age
FROM Sailors;
B. Find the name and color of boats having color ‘blue’
Relational Algebra:
π bname, color (σ color='blue' (Boats))
SQL Query:
SELECT bname, color
FROM Boats
WHERE color = 'blue';
C. Find the name of the boats reserved on date ‘2021-07-20’
Relational Algebra:
π bname (σ reserve_date='2021-07-20' (Reserves ⋈ Boats))
SQL Query:
SELECT b.bname
FROM Boats b
JOIN Reserves r ON b.bid = r.bid
WHERE r.reserve_date = '2021-07-20';
D. Find the names of boats having color ‘green’
Relational Algebra:
π bname (σ color='green' (Boats))
SQL Query:
SELECT bname
FROM Boats
WHERE color = 'green';
E. Use LEFT and RIGHT OUTER JOIN between Sailors and Reserves
LEFT OUTER JOIN (Show all sailors, and their reservations if any):
SELECT s.sid, s.sname, r.bid, r.reserve_date
FROM Sailors s
LEFT OUTER JOIN Reserves r ON s.sid = r.sid;
RIGHT OUTER JOIN (Show all reservations, and the sailor details if available):
SELECT s.sid, s.sname, r.bid, r.reserve_date
FROM Sailors s
RIGHT OUTER JOIN Reserves r ON s.sid = r.sid;
SQL Query Examples: Book Database
Consider the following table schemas:
Book(BookId, Title, Publisher, PublishedDate, Price)
Book_Authors(BookId, AuthorName)
Borrower(CardNo, Name, Address, Contact)
BookIssue(BookId, CardNo, IssuedDate, DueDate)
A. Find the name of the books taken by “Ramesh Sharma”
SQL Query:
SELECT B.Title
FROM Book B
JOIN BookIssue BI ON B.BookId = BI.BookId
JOIN Borrower BR ON BI.CardNo = BR.CardNo
WHERE BR.Name = 'Ramesh Sharma';
B. Find the issued book name and the borrower’s name who have issued date before 15th May 2023 and due date after 15th June 2023
SQL Query:
SELECT B.Title, BR.Name
FROM Book B
JOIN BookIssue BI ON B.BookId = BI.BookId
JOIN Borrower BR ON BI.CardNo = BR.CardNo
WHERE BI.IssuedDate < '2023-05-15'
AND BI.DueDate > '2023-06-15';
C. Find the borrower’s name who has taken the book written by “Laxmi Prashad Devkota”
SQL Query:
SELECT DISTINCT BR.Name
FROM Borrower BR
JOIN BookIssue BI ON BR.CardNo = BI.CardNo
JOIN Book_Authors BA ON BI.BookId = BA.BookId
WHERE BA.AuthorName = 'Laxmi Prashad Devkota';
D. Find the number of books issued on the date 25th April 2023
SQL Query:
SELECT COUNT(*) AS Number_of_Books
FROM BookIssue
WHERE IssuedDate = '2023-04-25';
E. Find the borrower’s name who has taken a book having a price greater than the average price.
SQL Query:
SELECT DISTINCT BR.Name
FROM Borrower BR
JOIN BookIssue BI ON BR.CardNo = BI.CardNo
JOIN Book B ON BI.BookId = B.BookId
WHERE B.Price > (SELECT AVG(Price) FROM Book);