CPU Pipeline Hazards, Cache Optimization & ISA Concepts

Pipeline Hazards Explained

Pipeline hazards prevent the next instruction from executing during its designated clock cycle. There are three main types:

Structural Hazards

Issue Description: Arise from hardware resource conflicts when the hardware cannot support all possible combinations of instructions executing in parallel in the pipeline. This commonly occurs with uneven service rates in pipeline stages or when resources like read/write ports to the register file are not sufficiently duplicated.

Possible Solutions:

  • Stall: Pause the pipeline.
  • Refactor Pipeline: Redesign the pipeline stages or pipeline a complex stage.
  • Duplicate Resource: Add more hardware, such as splitting Instruction/Data caches (I/D caches) to alleviate memory pressure.
  • Instruction Buffers: Add buffers to hold instructions and alleviate memory pressure.

Data Hazards

Issue Description: Arise when an instruction depends on the data result of a previous instruction that has not yet completed and written back the result. An early pipeline stage attempts to read an operand value before it has been produced by an instruction in a later pipeline stage. This problem is exacerbated by complex out-of-order instruction execution.

Possible Solutions:

  • Stall: Pause the dependent instruction until the result is available.
  • Data Forwarding: Allow the result from a later pipeline stage to be fed directly to an earlier stage needing it, bypassing the register file. This is also called bypassing or short-circuiting.
  • Combined Approach: Sometimes, both forwarding and stalling are necessary to resolve all data conflicts.

Control Hazards

Issue Description: Arise from branches or other instructions that change the program counter (PC). The presence of a (conditional) branch alters the sequential flow, and the correct path is often not known until the branch outcome is resolved in a later pipeline stage.

Possible Solutions:

  • Stall: Pause fetching new instructions until the branch outcome is resolved.
  • Delayed Branch: Redefine the branch instruction so it takes effect only after one or more following instructions (in the branch delay slot) have executed.
  • Branch Prediction: Predict the outcome of the branch (statically or dynamically) and speculatively fetch instructions from the predicted path. Correct if wrong.

Cache Lookup Logic

Cache lookup procedures vary based on organization:

  • Fully Associative: Check if the tag part of the address (n) matches the tag of any block in the cache. If a match is found (at index alpha), return the requested word (d) from that block (cache[alpha].word[d]). Otherwise, it’s a cache miss.
  • Direct Mapped: Use the index part of the address (k) to locate a specific cache line. Check if the tag of this line (cache[k].tag) matches the tag part of the address (n). If it matches, return the requested word (d) from that block (cache[k].word[d]). Otherwise, it’s a cache miss.
  • Set Associative: Use the index part of the address (k) to select a specific set. Check if the tag part of the address (n) matches the tag of any block within that set. If a match is found, return the requested word (d) from the matching block. Otherwise, it’s a cache miss.

CPU Performance Metrics

Understanding CPU performance involves several key formulas:

  • CPU Execution Time = Instruction Count × CPI × Clock Cycle Time
  • CPU Execution Time = Instruction Count × (CPIexecution + Memory Stall Cycles per Instruction) × Clock Cycle Time
  • CPU Execution Time = (CPU Clock Cycles + Memory Stall Cycles) × Clock Cycle Time
  • CPU Execution Time = Instruction Count × (CPIexecution + Memory Accesses per Instruction × Miss Rate × Miss Penalty) × Clock Cycle Time
  • CPU Execution Time = Instruction Count × (CPIexecution + Misses per Instruction × Miss Penalty) × Clock Cycle Time
  • CPU Clock Cycles = Instruction Count × CPI
  • Memory Stall Cycles = Number of Misses × Miss Penalty
  • Memory Stall Cycles = Instruction Count × Misses per Instruction × Miss Penalty
  • Memory Stall Cycles = Instruction Count × Memory Accesses per Instruction × Miss Rate × Miss Penalty
  • Memory Stall Cycles = (Instruction Count × Memory Reads per Instruction × Read Miss Rate × Read Miss Penalty) + (Instruction Count × Memory Writes per Instruction × Write Miss Rate × Write Miss Penalty)
  • AMAT (Average Memory Access Time) = Hit Time + (Miss Rate × Miss Penalty)
  • Misses per Instruction = Memory Accesses per Instruction × Miss Rate

Cache Memory Optimizations

Basic Cache Optimizations

  • Larger Block Sizes: Reduces compulsory miss rate. However: Increases miss penalty (larger transfer), potentially increases capacity misses (fewer blocks total), and conflict misses (fewer blocks per set implies quicker filling).
  • Larger Caches: Reduces capacity miss rate. However: Increases hit time (longer search) and power consumption.
  • Higher Associativity: Reduces conflict miss rate. However: Increases hit time (more complex comparison) and power consumption.
  • Multi-level Caches: Reduces overall memory access time by adding smaller, faster L1 cache and larger, slower L2/L3 caches, reducing the effective miss penalty. Formula: Effective AMAT ≈ Hit TimeL1 + Miss RateL1 × (Hit TimeL2 + Miss RateL2 × Miss PenaltyL2). Reduces power consumption compared to a single large cache.
  • Prioritize Read Misses Over Writes: Reduces miss penalty since read misses often stall the CPU directly. Write misses can often be buffered. Little effect on power consumption.
  • Avoid Address Translation During Cache Indexing: Using virtual addresses for cache indexing (VIVT, VIPT) can reduce hit time by avoiding the TLB lookup on the critical path.

Categories of Optimization

  • Reducing Hit Time: Small and simple caches, way prediction. (Generally decreases power consumption).
  • Increasing Cache Bandwidth: Pipelined caches, multi-banked caches, non-blocking caches. (Varying impact on power).
  • Reducing Miss Penalty: Critical word first, merging write buffers. (Little impact on power).
  • Reducing Miss Rate: Compiler optimizations.
  • Reducing Miss Penalty or Miss Rate via Parallelism: Hardware prefetching, compiler prefetching. (Generally increases power consumption).

Ten Advanced Cache Optimizations

  1. Small and Simple L1 Caches: Reduces hit time and power. The critical timing path involves tag memory access, tag comparison, and set selection. Direct-mapped caches can overlap tag comparison and data transmission. Lower associativity reduces power (fewer lines accessed). Access time increases with size, but hit ratio improves. Energy per read also varies with size and associativity; higher associativity might be better for energy efficiency with larger caches if it significantly reduces misses.
  2. Way Prediction: Reduces hit time by predicting which way (block) in a set will likely contain the data. Predictor bits select the block to try first. Misprediction increases hit time. Prediction accuracy is typically high (>90% for 2-way, >80% for 4-way set associative), often better for I-caches than D-caches. Reduces conflict misses by effectively making the common case faster.
  3. Pipelined Cache Access: Increases cache bandwidth (throughput) by allowing multiple accesses to be in progress simultaneously, though latency for a single access might increase slightly (more clock cycles). Examples: Pentium (1 cycle), Pentium Pro-III (2 cycles), Pentium 4-Core i7 (4 cycles). Faster clocks mitigate the cycle count increase. Makes higher associativity easier to implement but increases the branch misprediction penalty.
  4. Non-blocking Caches: Increases cache bandwidth by allowing the cache to continue servicing hits while a miss is being handled (“hit under miss”, “hit under multiple miss”). Requires L2 cache support. Processors can often hide L1 miss penalty but not L2 miss penalty effectively.
  5. Multi-banked Caches: Increases cache bandwidth by organizing the cache into independent banks that can handle accesses simultaneously. Banks are often interleaved based on block addresses. Examples: ARM Cortex-A8 (1-4 L2 banks), Intel Core i7 (4 L1 banks, 8 L2 banks).
  6. Critical Word First and Early Restart: Reduces miss penalty. The processor often needs only one word from a cache block initially.
    • Critical Word First: Request the missed word from memory first. Send it to the processor immediately upon arrival. The processor continues execution while the rest of the block is loaded into the cache.
    • Early Restart: Request block words in normal order. Send the missed word to the processor as soon as it arrives. The processor continues while the rest of the block is loaded.
    Effectiveness depends on block size and likelihood of accessing other words in the block soon.
  7. Merging Write Buffers: Reduces miss penalty associated with writes. If a store targets a block already pending in the write buffer, update the existing buffer entry instead of creating a new one. The processor continues while the buffer handles writing to memory. If the buffer is full, the pipeline may stall. Not applicable to I/O addresses.
  8. Compiler Optimizations: Software techniques to improve cache performance.
    • Loop Interchange: Reorder nested loops to access memory sequentially (improving spatial locality). Example: Change from column-major access to row-major access in C for a 2D array.
    • Blocking (Tiling): Instead of processing entire rows or columns of large matrices, operate on smaller submatrices (blocks) that fit better in the cache. Improves temporal and spatial locality.
  9. Hardware Prefetching: Reduces miss penalty or miss rate by fetching data into the cache before the processor requests it. Example: Fetch the next sequential block along with the requested block on a miss.
  10. Compiler Controlled Prefetching: Reduces miss penalty or miss rate by having the compiler insert special prefetch instructions ahead of the actual data use.
    • Register Prefetch: Loads data into a register.
    • Cache Prefetch: Loads data into the cache only.

Instruction Set Architectures (ISA) Comparison

Stack Architecture

  • Pros: Good code density (implicit operand addressing); Low hardware requirements; Simple compiler design.
  • Cons: Limited parallelism/pipelining; Stack top bottleneck (data not always ready, needs extra instructions like TOP/SWAP); Difficult for optimizing compilers.

Accumulator Architecture

  • Pros: Very low hardware requirements; Easy to design and understand.
  • Cons: Accumulator is a bottleneck; Limited parallelism/pipelining; High memory traffic.

Memory-Memory Architecture

  • Pros: Requires fewer instructions (especially with 3 operands); Easy compiler design (especially with 3 operands).
  • Cons: Very high memory traffic (especially with 3 operands); Variable clocks per instruction (CPI varies); More complex hardware control.

Memory-Register Architecture

  • Pros: Data can stay in registers; Easy instruction format encoding (using addressing modes); Good code density.
  • Cons: Variable clocks per instruction; May limit the number of general-purpose registers if some are dedicated.

Load-Store (Register-Register) Architecture

  • Pros: Simple, fixed-length instruction encoding; Instructions take similar cycle counts; Relatively easy to pipeline; Promotes compiler optimization.
  • Cons: Higher instruction count; Not all instructions naturally need three register operands; Performance heavily depends on a good compiler.

ISA Notes

Memory-Register and Load-Store are the most common ISAs in modern processors. Load-Store (also called Register-Register for arithmetic operations) is generally the fastest due to pipelining efficiency. Registers are specified by small addresses (e.g., 3-6 bits for 8-64 registers). Load/Store instructions typically have one memory address and one register address. Arithmetic instructions often have three register addresses.

Role of Registers

Advantages

  • Faster access than cache (no complex addressing modes or tags needed).
  • Deterministic access time (no misses).
  • Can be replicated to provide multiple read ports for parallelism.
  • Identified by short identifiers (e.g., 3 bits for 8 registers, 8 bits for 256 registers).
  • Reduce memory traffic significantly.

Disadvantages

  • Must be saved and restored by software during procedure calls and context switches (overhead).
  • Cannot typically take the address of a register (cannot be pointed to directly).
  • Fixed size (cannot efficiently store large structures or strings).
  • Require explicit management by the compiler.