Advanced Computer Architecture: Optimizing Performance and Efficiency

Posted on Dec 16, 2024 in Computers

Amdahl’s Law

Formula: S = (1 / ((1 – f) + (f / S_enhanced))), where f is the fraction, S is the system speedup, and S_enhanced is the enhanced speedup factor.
- Perfect parallelism is limited by serial components.
- Small sequential portions dominate performance for large N.
- Example: For f = 0.9 and S_enhanced = 10, S = 5.26.

CISC:
- Characteristics: Complex instructions, multiple cycles, variable-length formats, direct memory access (mem-to-mem execution).
- Pros: Compact code, easier for programmers.
- Cons: Harder to pipeline, higher CPI.
RISC:
- Characteristics: Single-cycle instructions, fixed-length formats, load/store architecture.
- Pros: Optimized for pipelining, lower CPI.
- Cons: Larger code size, compiler-dependent optimization.

Concurrency:
- Definition: Overlapping instruction execution to improve throughput.
- Instruction Cycle Phases:

Performance:

Speedup S = Latency_Unpipelined / Latency_Pipelined
Example: For n instructions and k-stage pipeline: T_unpipelined = n × k vs. T_pipelined = k + (n – 1)

Pipeline Hazards:

Structural: Limited resources cause conflicts.
- Example: Two instructions needing the same ALU simultaneously.
- Solution: Add more functional units.
Data Hazards:
- RAW (Read After Write): Instruction reads data before it’s written.
- WAR (Write After Read): Instruction writes data before it’s read.
- WAW (Write After Write): Two instructions write to the same location.
- Solutions: Forwarding, stalling.
Control Hazards:
- Example: Branch instruction causes pipeline flush.
- Solutions: Branch prediction, delay slots, speculative execution.

Multifunction Systems:

Key Metrics:
- Latency: Time between request and response.
  - Decomposed into: T_access + T_transfer.
- Bandwidth: Number of requests satisfied per unit time.
  - Formula: BW = Requests / Time.
Memory Hierarchy:
- Design Goals:
  - Maximize temporal and spatial locality.
  - Minimize latency through caching.
- Levels:

Level	Size	Latency (ns)	Cost/bit
Registers	< 1KB	< 1	$$$$
L1 Cache	16–64KB	1–5	$$$
L2 Cache	256KB–4MB	10–20	$$
Main Memory	GB-TB	100+	$
Disk Storage	TB+	10 ms+	~0

Cache Mapping:

Direct Mapping:
- Simple but prone to conflicts.
- Example: Cache Block = (Memory Block) mod (Cache Size).
Associative Mapping:
- Flexible but hardware-intensive.
- Any memory block can map to any cache block.
Set-Associative Mapping:
- Trade-off between flexibility and simplicity.
- Memory block maps to one set, but can use any line within the set.

Pipeline Width: Ability to fetch and execute multiple instructions per cycle.
Out-of-Order Execution:
- Decouples fetch, decode, and execute stages.
- Maintains program order through Reorder Buffers (ROB).
Template:
- Instruction Fetch:
  - Fetch n instructions per cycle.
  - Use I-cache with aligned cache blocks.
- Instruction Decode:
  - Pre-decoding for immediate operand recognition.
- Reservation Stations:
  - Hold instructions until dependencies are resolved.
- Execution:
  - Integer units, floating-point units, load/store units.
- Retirement:
  - Commit results to registers in program order.
  - Handle precise interrupts for exceptions.

Branch Prediction:
- Static: Fixed outcome (e.g., always taken).
- Dynamic:
  - Based on history.
  - 2-Bit Predictors:
    - State machine: Strongly/weakly taken or not taken.
    - Improves accuracy over simple prediction.
Register Renaming:
- Solves WAW/WAR hazards.
- Allocates unique physical registers for logical names.
Tomasulo Algorithm:
- Hardware mechanism for dynamic scheduling.
- Resolves dependencies using reservation stations and common data buses.

Load/Store Bypassing:
- Directly forward results to dependent instructions.
Memory Interleaving:
- Split memory into multiple banks to allow simultaneous access.

PowerPC 620:
- Features:
  - 64-bit architecture.
  - Out-of-order execution with speculative branch prediction.
  - 32 general-purpose and 32 floating-point registers.
  - Unified L1 caches for instructions and data.
Intel P6:
- Innovations:
  - Dynamic execution (speculative and out-of-order).
  - Advanced branch prediction with BTBs and 2-level predictors.
  - Super pipelining with over 10 stages.