Database Normalization, Transactions, and Concurrency Control

Normalization

Normalization consists of evaluating and correcting table structures to minimize data redundancies, reducing data anomalies, and assigning attributes to tables based on determinants. It involves applying formal normal forms (1NF, 2NF, 3NF). Higher normal forms (NF) are generally preferred over lower normal forms.

Normalization is best performed early in the overall design process, for example, when designing a new database structure.

Potential Issues Leading to Low Normal Forms

  • Primary keys containing NULL values
  • Data inconsistencies (e.g., from mistyping)
  • Insertion and deletion anomalies
  • Data redundancy (often manifested as repeating groups)

Normal Form Goals

  • 1NF: Eliminates repeating groups.
  • 2NF: Eliminates redundant data by removing partial dependencies.
  • 3NF: Eliminates fields not dependent on the primary key by removing transitive dependencies.

Denormalization

Denormalization produces a lower normal form (NF), resulting in increased performance but greater data redundancy. It is sometimes used strategically to optimize read-heavy workloads.

Transaction Management

In transaction management, the ACID properties express the desirable characteristics of a database transaction:

  • Atomicity: A transaction is an atomic unit; it is either fully completed or fully aborted.
  • Consistency: A transaction must transform the database from one consistent state to another.
  • Isolation: Transactions execute independently of one another; the effects of an incomplete transaction are not visible to other transactions.
  • Durability: Once a transaction is committed, its changes are permanent and cannot be undone or lost, even in the event of system failure.
  • Serializability: Ensures that concurrently executing transactions produce the same result as if they were executed sequentially.

Transaction Sequence Termination

A transaction sequence continues until one of the following occurs:

  • A COMMIT statement is reached (successful completion).
  • A ROLLBACK statement is reached (transaction aborted).
  • The end of the program is reached (implicit commit or rollback, depending on the system).
  • The program is abnormally terminated (rollback typically occurs).

Concurrency Control

Concurrency control ensures that simultaneous transaction executions in a multi-user database system occur correctly and maintain data integrity.

Common Concurrency Problems

  • Lost Update: Occurs when two concurrent transactions update the same data item, and one update overwrites the other without considering it.
  • Uncommitted Data (Dirty Read): Occurs when one transaction reads data that has been modified by another transaction that has not yet committed. If the modifying transaction rolls back, the first transaction has read invalid data.
  • Inconsistent Retrievals: Occurs when a transaction reads several values, but another transaction modifies some of them during the first transaction’s execution, leading to inconsistent results within the reading transaction.

A serializable schedule ensures that the outcome of concurrent transactions is equivalent to some serial execution of those transactions. Achieving this requires specific methods.

Locking Methods

The locking method facilitates the isolation of data items used in concurrently executing transactions. A lock guarantees exclusive or shared use of a data item to a current transaction.

  • Pessimistic Locking: Assumes conflicts are likely and locks data items before access to prevent conflicts.
  • Lock Manager: A system component responsible for assigning and managing the locks used by transactions.

Levels of Locking

  • Database-level lock
  • Table-level lock
  • Page-level lock (a disk page)
  • Row-level lock (a specific table row)
  • Field-level lock (a specific attribute value)

Types of Locks

  • Binary Lock: Has two states: locked (1) or unlocked (0).
  • Shared Lock (S): Allows concurrent transactions to read the same data item. Multiple shared locks can exist simultaneously on the same item.
  • Exclusive Lock (X): Reserves access exclusively for the transaction that holds the lock (typically for writing). Only one exclusive lock can exist on an item at a time.

Common lock modes include: Unlocked, Shared (Read), and Exclusive (Write).

Two-Phase Locking (2PL)

2PL defines how transactions acquire and relinquish locks. It guarantees serializability but does not prevent deadlocks.

  1. Growing Phase: The transaction acquires all required locks without releasing any.
  2. Shrinking Phase: The transaction releases its locks and cannot acquire any new ones.

Essentially, 2PL ensures that transactions hold necessary locks during critical operations, preventing issues like dirty reads.

Deadlocks

A deadlock occurs when two or more transactions wait indefinitely for each other to release locks they need. This is also known as a deadly embrace.

Techniques for Handling Deadlocks

  • Deadlock Prevention: Prevents deadlocks by restricting how transactions request locks.
  • Deadlock Detection: Allows deadlocks to occur, detects them, and aborts one or more transactions to resolve the deadlock.
  • Deadlock Avoidance: Transactions obtain all locks they need in advance; if any lock is unavailable, the transaction waits.

Time Stamping Method

Time Stamping assigns a unique, global timestamp to each transaction. Conflicts are resolved based on these timestamps (e.g., older transactions might get priority). This method avoids deadlocks but can be costly in terms of memory usage and processing speed.

Optimistic Concurrency Control (OCC)

Optimistic Concurrency Control (OCC) assumes conflicts are rare and operates in phases:

  1. Read: The transaction reads database values and performs computations, storing changes in a private workspace.
  2. Validate: Before committing, the transaction checks if its operations conflict with other committed transactions that executed concurrently.
  3. Write: If validation succeeds, the transaction’s changes are made permanent. If validation fails, the transaction is rolled back.

The JOIN Operation

The JOIN operation is fundamental to relational databases. It allows data, which has been normalized and stored efficiently across multiple tables, to be combined based on related columns. Without JOIN, querying data across related tables would be impossible, likely forcing the use of redundant, denormalized flat tables.

Dependency Diagrams

Dependency Diagrams visualize dependencies between attributes within a table. A well-structured diagram (indicating good normalization) typically shows clear dependencies flowing from keys. Poorly structured diagrams may reveal partial or transitive dependencies, often indicating lower normal forms and potential data anomalies.

Key Database Design Principles

  • Explicitly define how NULL values should be handled instead of relying on system defaults.
  • Employ standardized date formats (e.g., ISO 8601) for cross-platform consistency.
  • Implement data validation to prevent storing incorrect or invalid values.
  • Data Integrity is Crucial: Avoid inaccurate data entry and improper default values, as these can lead to significant real-world problems.