Advanced Computer Architecture: Optimizing Performance and Efficiency

Amdahl’s Law

  • Formula: S = (1 / ((1 – f) + (f / Senhanced))), where f is the fraction, S is the system speedup, and Senhanced is the enhanced speedup factor.
    • Perfect parallelism is limited by serial components.
    • Small sequential portions dominate performance for large N.
    • Example: For f = 0.9 and Senhanced = 10, S = 5.26.

Response Time

  • Definition: Time elapsed from the start to completion of a task.
  • Includes:
    • Disk access, memory latency, I/O, and OS overhead.
  • Formula: Tresponse = Tservice + Tqueuing

Instruction Set Design

  • CISC:
    • Characteristics: Complex instructions, multiple cycles, variable-length formats, direct memory access (mem-to-mem execution).
    • Pros: Compact code, easier for programmers.
    • Cons: Harder to pipeline, higher CPI.
  • RISC:
    • Characteristics: Single-cycle instructions, fixed-length formats, load/store architecture.
    • Pros: Optimized for pipelining, lower CPI.
    • Cons: Larger code size, compiler-dependent optimization.

Pipelining

  • Concurrency:
    • Definition: Overlapping instruction execution to improve throughput.
    • Instruction Cycle Phases:
  1. Instruction Fetch (IF): Fetch opcode and increment PC.
  2. Instruction Decode (ID): Decode opcode, identify operands.
  3. Execute (EX): Perform operation.
  4. Memory Access (MEM): Load/store operations.
  5. Write Back (WB): Update destination register.
Performance:
  • Speedup S = LatencyUnpipelined / LatencyPipelined
  • Example: For n instructions and k-stage pipeline: Tunpipelined = n × k vs. Tpipelined = k + (n – 1)
Pipeline Hazards:
  • Structural: Limited resources cause conflicts.
    • Example: Two instructions needing the same ALU simultaneously.
    • Solution: Add more functional units.
  • Data Hazards:
    • RAW (Read After Write): Instruction reads data before it’s written.
    • WAR (Write After Read): Instruction writes data before it’s read.
    • WAW (Write After Write): Two instructions write to the same location.
    • Solutions: Forwarding, stalling.
  • Control Hazards:
    • Example: Branch instruction causes pipeline flush.
    • Solutions: Branch prediction, delay slots, speculative execution.
Multifunction Systems:
  • Support dynamic reconfiguration for different operations.
  • Example: TI-ACS pipeline switches between arithmetic and logic functions.

Memory Organization

  • Key Metrics:
    • Latency: Time between request and response.
      • Decomposed into: Taccess + Ttransfer.
    • Bandwidth: Number of requests satisfied per unit time.
      • Formula: BW = Requests / Time.
  • Memory Hierarchy:
    • Design Goals:
      • Maximize temporal and spatial locality.
      • Minimize latency through caching.
    • Levels:
LevelSizeLatency (ns)Cost/bit
Registers< 1KB< 1$$$$
L1 Cache16–64KB1–5$$$
L2 Cache256KB–4MB10–20$$
Main MemoryGB-TB100+$
Disk StorageTB+10 ms+~0
Cache Mapping:
  • Direct Mapping:
    • Simple but prone to conflicts.
    • Example: Cache Block = (Memory Block) mod (Cache Size).
  • Associative Mapping:
    • Flexible but hardware-intensive.
    • Any memory block can map to any cache block.
  • Set-Associative Mapping:
    • Trade-off between flexibility and simplicity.
    • Memory block maps to one set, but can use any line within the set.

Superscalar Organization

  • Pipeline Width: Ability to fetch and execute multiple instructions per cycle.
  • Out-of-Order Execution:
    • Decouples fetch, decode, and execute stages.
    • Maintains program order through Reorder Buffers (ROB).
  • Template:
    • Instruction Fetch:
      • Fetch n instructions per cycle.
      • Use I-cache with aligned cache blocks.
    • Instruction Decode:
      • Pre-decoding for immediate operand recognition.
    • Reservation Stations:
      • Hold instructions until dependencies are resolved.
    • Execution:
      • Integer units, floating-point units, load/store units.
    • Retirement:
      • Commit results to registers in program order.
      • Handle precise interrupts for exceptions.

Improving Performance

Maximizing Throughput

  • Branch Prediction:
    • Static: Fixed outcome (e.g., always taken).
    • Dynamic:
      • Based on history.
      • 2-Bit Predictors:
        • State machine: Strongly/weakly taken or not taken.
        • Improves accuracy over simple prediction.
  • Register Renaming:
    • Solves WAW/WAR hazards.
    • Allocates unique physical registers for logical names.
  • Tomasulo Algorithm:
    • Hardware mechanism for dynamic scheduling.
    • Resolves dependencies using reservation stations and common data buses.

Reducing Access Gap

  • Load/Store Bypassing:
    • Directly forward results to dependent instructions.
  • Memory Interleaving:
    • Split memory into multiple banks to allow simultaneous access.

Case Studies

  • PowerPC 620:
    • Features:
      • 64-bit architecture.
      • Out-of-order execution with speculative branch prediction.
      • 32 general-purpose and 32 floating-point registers.
      • Unified L1 caches for instructions and data.
  • Intel P6:
    • Innovations:
      • Dynamic execution (speculative and out-of-order).
      • Advanced branch prediction with BTBs and 2-level predictors.
      • Super pipelining with over 10 stages.