Advanced Computer Architecture: Optimizing Performance and Efficiency
Posted on Dec 16, 2024 in Computers
Amdahl’s Law
- Formula: S = (1 / ((1 – f) + (f / Senhanced))), where f is the fraction, S is the system speedup, and Senhanced is the enhanced speedup factor.
- Perfect parallelism is limited by serial components.
- Small sequential portions dominate performance for large N.
- Example: For f = 0.9 and Senhanced = 10, S = 5.26.
Response Time
- Definition: Time elapsed from the start to completion of a task.
- Includes:
- Disk access, memory latency, I/O, and OS overhead.
- Formula: Tresponse = Tservice + Tqueuing
Instruction Set Design
- CISC:
- Characteristics: Complex instructions, multiple cycles, variable-length formats, direct memory access (mem-to-mem execution).
- Pros: Compact code, easier for programmers.
- Cons: Harder to pipeline, higher CPI.
- RISC:
- Characteristics: Single-cycle instructions, fixed-length formats, load/store architecture.
- Pros: Optimized for pipelining, lower CPI.
- Cons: Larger code size, compiler-dependent optimization.
Pipelining
- Concurrency:
- Definition: Overlapping instruction execution to improve throughput.
- Instruction Cycle Phases:
- Instruction Fetch (IF): Fetch opcode and increment PC.
- Instruction Decode (ID): Decode opcode, identify operands.
- Execute (EX): Perform operation.
- Memory Access (MEM): Load/store operations.
- Write Back (WB): Update destination register.
Performance:
- Speedup S = LatencyUnpipelined / LatencyPipelined
- Example: For n instructions and k-stage pipeline: Tunpipelined = n × k vs. Tpipelined = k + (n – 1)
Pipeline Hazards:
- Structural: Limited resources cause conflicts.
- Example: Two instructions needing the same ALU simultaneously.
- Solution: Add more functional units.
- Data Hazards:
- RAW (Read After Write): Instruction reads data before it’s written.
- WAR (Write After Read): Instruction writes data before it’s read.
- WAW (Write After Write): Two instructions write to the same location.
- Solutions: Forwarding, stalling.
- Control Hazards:
- Example: Branch instruction causes pipeline flush.
- Solutions: Branch prediction, delay slots, speculative execution.
Multifunction Systems:
- Support dynamic reconfiguration for different operations.
- Example: TI-ACS pipeline switches between arithmetic and logic functions.
Memory Organization
- Key Metrics:
- Latency: Time between request and response.
- Decomposed into: Taccess + Ttransfer.
- Bandwidth: Number of requests satisfied per unit time.
- Formula: BW = Requests / Time.
- Memory Hierarchy:
- Design Goals:
- Maximize temporal and spatial locality.
- Minimize latency through caching.
- Levels:
Level | Size | Latency (ns) | Cost/bit |
---|
Registers | < 1KB | < 1 | $$$$ |
L1 Cache | 16–64KB | 1–5 | $$$ |
L2 Cache | 256KB–4MB | 10–20 | $$ |
Main Memory | GB-TB | 100+ | $ |
Disk Storage | TB+ | 10 ms+ | ~0 |
Cache Mapping:
- Direct Mapping:
- Simple but prone to conflicts.
- Example: Cache Block = (Memory Block) mod (Cache Size).
- Associative Mapping:
- Flexible but hardware-intensive.
- Any memory block can map to any cache block.
- Set-Associative Mapping:
- Trade-off between flexibility and simplicity.
- Memory block maps to one set, but can use any line within the set.
Superscalar Organization
- Pipeline Width: Ability to fetch and execute multiple instructions per cycle.
- Out-of-Order Execution:
- Decouples fetch, decode, and execute stages.
- Maintains program order through Reorder Buffers (ROB).
- Template:
- Instruction Fetch:
- Fetch n instructions per cycle.
- Use I-cache with aligned cache blocks.
- Instruction Decode:
- Pre-decoding for immediate operand recognition.
- Reservation Stations:
- Hold instructions until dependencies are resolved.
- Execution:
- Integer units, floating-point units, load/store units.
- Retirement:
- Commit results to registers in program order.
- Handle precise interrupts for exceptions.
Improving Performance
Maximizing Throughput
- Branch Prediction:
- Static: Fixed outcome (e.g., always taken).
- Dynamic:
- Based on history.
- 2-Bit Predictors:
- State machine: Strongly/weakly taken or not taken.
- Improves accuracy over simple prediction.
- Register Renaming:
- Solves WAW/WAR hazards.
- Allocates unique physical registers for logical names.
- Tomasulo Algorithm:
- Hardware mechanism for dynamic scheduling.
- Resolves dependencies using reservation stations and common data buses.
Reducing Access Gap
- Load/Store Bypassing:
- Directly forward results to dependent instructions.
- Memory Interleaving:
- Split memory into multiple banks to allow simultaneous access.
Case Studies
- PowerPC 620:
- Features:
- 64-bit architecture.
- Out-of-order execution with speculative branch prediction.
- 32 general-purpose and 32 floating-point registers.
- Unified L1 caches for instructions and data.
- Intel P6:
- Innovations:
- Dynamic execution (speculative and out-of-order).
- Advanced branch prediction with BTBs and 2-level predictors.
- Super pipelining with over 10 stages.