Computer Architecture: Concepts and Evolution
Chapters on Computer Architecture
Chapter 1
1. The Computer Revolution
Progress in Technology: Enables novel applications like genomics, web services, and search engines.
Applications: Embedded in automobiles, smartphones, and more.
Pervasiveness: Computers are integral to various aspects of life.
2. Classes of Computers
Personal Computers: General-purpose, cost-performance tradeoffs.
Server Computers: Network-based, high capacity and reliability.
Supercomputers: High-end calculations, small market fraction.
Embedded Computers: Hidden in systems, strict cost and power constraints.
3. The PostPC Era
Personal Mobile Devices (PMDs): Battery-operated, internet-connected devices (e.g., smartphones, tablets).
Cloud Computing: Uses Warehouse-Scale Computers (WSC).
Software as a Service (SaaS): Splits software between PMDs and the cloud.
4. Key Concepts and Learning Goals
Translate programs into machine language.
Execute programs and optimize performance.
Design hardware for improved performance and parallel processing.
5. Understanding Performance
Algorithm: Determines the number of operations.
Programming Language & Compiler: Affect the number of instructions.
Processor/Memory: Influence instruction execution speed.
I/O System: Determines I/O operation speed.
6. Seven Great Ideas
1. Use abstraction to simplify design.
2. Optimize for common cases.
3. Leverage parallelism.
4. Utilize pipelining.
5. Employ prediction.
6. Build hierarchical memories.
7. Ensure dependability via redundancy.
7. Computer Components
Standard Components: Input/output devices, processors, memory, and storage.
Below Your Program:
Application Software: High-level languages.
System Software: Compilers and operating systems.
Hardware: Executes machine-level instructions.
8. Inside the Processor
Datapath: Executes data operations.
Control: Manages sequencing of tasks.
Cache Memory: Small, fast memory for immediate data access.
9. Performance Metrics
Response Time: Time to complete a task.
Throughput: Tasks completed per unit time.
Relative Performance: Performance = 1 / Execution Time.
10. CPU Performance
CPU Time: CPU Time = Clock Cycles x Clock Cycle Time.
Optimized by reducing cycles or increasing clock rate.
CPI (Cycles Per Instruction): CPI = Clock Cycles / Instruction Count.
11. Trends and Benchmarks
Technology Trends: Semiconductor advancements increase capacity and performance.
Power limitations (Power Wall) drive innovation.
Benchmarks: SPEC CPU benchmarks measure performance with standard workloads.
12. Key Pitfalls and Fallacies
Amdahl’s Law: Optimization’s impact is limited by the portion of the system that remains unaffected.
Low Power at Idle: Systems consume substantial power even at low workloads.
MIPS Metric Fallacy: Misleading due to varying instruction complexities across systems.
13. Concluding Remarks
Advancements: Hierarchical abstraction drives performance and design.
Execution Time: The ultimate metric for performance.
Future: Parallelism and energy efficiency will dominate innovations.
Chapter 2
1. Instruction Sets
Definition: Repertoire of computer instructions.
MIPS Instruction Set: Used for teaching; widely applied in consumer electronics, networking, & storage.
2. Arithmetic Operations
Design Principle 1: Simplicity favors regularity.
Example: add t0, g, h -> t0 = g + h.
C Code Example:
f = (g + h) – (i + j);
Compiled MIPS:
add t0, g, h
add t1, i, j
sub f, t0, t1
3. Registers & Operands
Registers: 32 registers, numbered $0-$31.
Temporary: $t0-$t9, Saved: $s0-$s7.
Design Principle 2: Smaller is faster.
Example: add $t0, $s1, $s2 # t0 = s1 + s2
4. Memory Operands
Byte Addressed: Each address identifies an 8-bit byte.
Load/Store Example: g = h + A[8];
lw $t0, 32($s3)
add $s1, $s2, $t0
5. Immediate Operands
Immediate Example: addi $s3, $s3, 4 # $s3 = $s3 + 4
6. Binary Numbers
Unsigned Range: 0 to 2^n-1.
Signed Range (2’s Complement): -2^(n-1) to 2^(n-1)-1.
7. Logical Operations
Bitwise Operations:
&: & $t0, $t1, $t2
OR: or $t0, $t1, $t2
NOT: nor $t0, $t1, $zero
sll (Shift Left): Multiplies by 2^i.
srl (Shift Right): Divides unsigned by 2^i.
8. Control Flow
Conditional Branches: beq $s1, $s2, Label # if (s1 == s2) branch
Unconditional Jump: j Label
9. Procedures
Steps:
1. Place arguments in $a0-$a3.
2. Call using jal.
3. Return with jr $ra.
Stack Frame:
Save: sw $ra, 4($sp).
Restore: lw $ra, 4($sp).
10. Instruction Representation
R-Format:
Example:
add $t0, $s1, $s2
11. Addressing Modes
Register: Operands in registers.
Immediate: Constants in instructions.
PC-Relative: Offset added to PC.
Base: Offset added to base address.
12. Synchronization
Atomic Swap Example:
ll $t1, 0($s1)
sc $t0, 0($s1)
beq $t0, $zero, retry
13. Compiling High-Level Code
If Statement:
if (i == j) f = g + h; else f = g – h;
bne $s3, $s4, Else
add $s0, $s1, $s2
j Exit
Else: sub $s0, $s1, $s2
14. Optimization Insights
Compiler Optimizations:
Reduce instruction count & CPI.
Optimize algorithms.
Chapter 3
1. Integer Operations
Addition: Overflow occurs if the result exceeds the signed range.
Positive + Positive -> Overflow if sign bit = 1.
Negative + Negative -> Overflow if sign bit = 0.
Subtraction: Performed as addition of the negated operand.
Overflow occurs if the result exceeds the signed range.
Positive – Negative -> Overflow if sign bit = 1.
Negative – Positive -> Overflow if sign bit = 0.
2. Dealing with Overflow
Languages like C: Ignore overflow using addu, subu.
Languages like Ada, Fortran: Raise an exception using add, addi, sub.
Handling Overflow: Use exception handlers.
Save PC in EPC register, jump to handler.
3. Arithmetic for Multimedia
SIMD (Single Instruction, Multiple Data): Operates on vectors (e.g., 8×8-bit, 4×16-bit).
Uses partitioned carry chain for vector operations.
Saturating Arithmetic: On overflow, returns max representable value.
4. Multiplication
MIPS Instructions:
mult rs, rt: 64-bit product in HI (MSBs) & LO (LSBs).
mfhi rd, mflo rd: Move results to rd.
mul rd, rs, rt: Outputs least-significant 32 bits.
5. Division
Steps:
Check for divisor = 0.
Long division approach: Generate quotient bits one at a time.
Restore divisor if remainder < 0.
Use absolute values for signed division; adjust signs of quotient/remainder.
MIPS Instructions:
div rs, rt: Result stored in HI (remainder) & LO (quotient).
Use mfhi, mflo to access results.
6. Floating-Point Representation
IEEE 754 Standard:
Single Precision (32-bit): 1 sign, 8 exponent, 23 fraction bits.
Double Precision (64-bit): 1 sign, 11 exponent, 52 fraction bits.
Exponent bias: Single = 127, Double = 1023.
Format: (-1)^S * (1.Fraction) * 2^(Exponent – Bias).
Ranges:
Single Precision: ±1.2 x 10^-38 to ±3.4 x 10^38.
Double Precision: ±2.2 x 10^-308 to ±1.8 x 10^308.
7. Parallelism & Subword Operations
Subword Parallelism: SIMD for graphics/audio processing.
Examples: 16×8-bit, 8×16-bit, 4×32-bit operations.
Matrix Multiplication: Optimized using SIMD & pipelining for performance.
8. Floating-Point Accuracy
Rounding Modes: IEEE 754 defines guard, round, & sticky bits.
Options to fine-tune numerical precision.
Programming Considerations: Balance hardware complexity, performance, & accuracy.
Validate parallel computations for associativity.
Chapter 4
1. CPU Performance Factors
Instruction Count: Determined by ISA and compiler.
CPI (Cycles per Instruction) and Cycle Time: Determined by CPU hardware.
2. Instruction Execution
1. Fetch instruction from memory.
2. Decode and read registers.
3. Perform operation via ALU.
4. Access memory if needed.
5. Write results to the register file.
3. Key Processor Components
Datapath: Registers, ALU, multiplexers, memory.
Control Unit: Directs operations and control signals.
Multiplexers: Resolve data path conflicts.
4. Building a Datapath
Instruction Fetch: Increment PC by 4.
R-Format Instructions: Perform arithmetic/logical operations.
Load/Store: Use ALU to compute memory address.
Branch Instructions: Calculate target address, check Zero flag.
5. Pipelining
Stages:
1. IF: Fetch instruction.
2. ID: Decode and read registers.
3. EX: Execute or calculate address.
4. MEM: Access memory.
5. WB: Write back result.
Pipeline Performance: Improves throughput; CPI approaches 1.
6. Hazards in Pipelining
Structural Hazards: Resource conflicts (resolved with separate instruction/data memories).
Data Hazards: Dependency on previous instruction results.
Solutions: Forwarding, stalls, reordering instructions.
Control Hazards: Branch prediction (static or dynamic).
7. Branch Prediction
Static: Predict ‘not taken’ or based on loop behavior.
Dynamic: Use hardware to track branch history and update predictions.
Penalty: Incorrect predictions stall the pipeline.
8. Exception Handling
Exceptions: Events requiring change in control flow (e.g., overflow).
Interrupts: External events requiring service.
Handling: Save PC in EPC register.
Identify cause (Cause register).
Jump to exception handler.
9. Instruction-Level Parallelism (ILP)
Multiple Issue:
Static: Compiler schedules instructions.
Dynamic: CPU schedules instructions, avoiding hazards.
Speculation: Execute instructions ahead of control dependencies, roll back if incorrect.
10. Performance Optimizations
Pipeline Depth: More stages reduce per-stage work but increase hazards.
Loop Unrolling: Expose parallelism by replicating loop bodies.
Register Renaming: Avoid data hazards with temporary registers.
11. Key Challenges
Dependencies: Data, structural, control.
Power Efficiency: Trade-offs between complexity and power consumption.
Instruction Set Design: Must align with pipeline needs.
12. Conclusion
Pipelining: Enhances throughput via parallelism.
Hazards: Addressed with advanced techniques like forwarding, prediction, and dynamic scheduling.
Future Directions: Focus on power-efficient designs and maximizing ILP.
Chapter 5
1. Principle of Locality
Temporal Locality: Recently accessed items are likely to be accessed again soon.
Spatial Locality: Items near recently accessed ones are likely to be accessed soon.
2. Memory Hierarchy Levels
Cache Memory: Fast, close to CPU (SRAM).
Main Memory: DRAM.
Disk Storage: Large, slow.
Key Metrics:
Hit: Data found in upper level.
Miss: Data copied from lower to upper level.
Miss Penalty: Time to retrieve a block from a lower level.
3. Cache Basics
Direct Mapped Cache: One location per block.
Associative Cache: Block can be stored in multiple locations.
Set Associative Cache: N locations for a block.
Replacement Policies:
Least Recently Used (LRU): Replace the least-used block.
Random: Simpler, approximates LRU.
4. Improving Cache Performance
Average Memory Access Time (AMAT): AMAT = Hit Time + (Miss Rate x Miss Penalty).
Techniques: Increase cache size, block size, and associativity.
5. Write Strategies
Write-Through: Update cache and lower levels.
Write-Back: Update lower levels only when block is replaced.
Write Allocation: Fetch block on a write miss.
6. Multilevel Caches
Primary Cache (L1): Small, fast, minimal hit time.
Secondary Cache (L2): Larger, reduces main memory access.
7. Virtual Memory
Uses main memory as a cache for secondary storage.
Pages: Fixed-size blocks in virtual memory.
Page Fault: Occurs when data is not in main memory.
8. Dependability
Fault Types:
Soft Errors: Non-permanent.
Hard Errors: Permanent.
Error-Correcting Codes (ECC): Detect and correct bit errors.
9. Cache Coherence in Multiprocessors
Snooping Protocols: Monitor shared memory operations.
Directory-Based Protocols: Use a directory to track block sharing.
10. Advanced Techniques
Blocking: Optimize memory accesses in algorithms.
Prefetching: Load data before it is needed.
Non-Blocking Caches: Allow multiple cache misses to overlap.
11. Key Challenges
Power Consumption: Balancing performance and efficiency.
Memory Access Patterns: Optimizing software for cache usage.
12. Concluding Remarks
Caching: Key to achieving fast, large memories.
Principle of Locality: Core to memory hierarchy design.
Important Keywords and Definitions
Memory Hierarchy
Temporal Locality: The concept that recently accessed memory locations are likely to be accessed again soon.
Spatial Locality: The concept that memory locations near recently accessed addresses are likely to be accessed in the near future.
Cache Memory
Cache Memory: A small, fast memory located close to the CPU to store frequently accessed data.
Direct Mapped Cache: A cache structure where each block of main memory maps to only one location in the cache.
Associative Cache: A cache structure where a block can be placed in any location, providing flexibility but increasing complexity.
Set Associative Cache: A hybrid cache organization that divides the cache into sets, where a block can be mapped to any location within a set.
Least Recently Used (LRU): A cache replacement policy that removes the least recently accessed block.
Write Strategies
Write-Through: A cache write strategy where data is written to both the cache and main memory simultaneously.
Write-Back: A cache write strategy where data is written to main memory only when it is evicted from the cache.
Write Allocation: A policy where a block is loaded into the cache on a write miss before writing to it.
Virtual Memory
Virtual Memory: A memory management technique where secondary storage (e.g., disk) is used to extend the apparent size of main memory.
Page Fault: An event that occurs when a program accesses a page not currently in main memory, requiring it to be loaded from secondary storage.
Dependability
Error-Correcting Codes (ECC): Techniques used to detect and correct errors in memory or data transmission.
Cache Coherence
Snooping Protocols: Methods used in multiprocessor systems to monitor and maintain consistency of shared cache lines.
Performance Techniques
Blocking: A technique to optimize memory accesses in algorithms by processing blocks of data that fit into the cache.
Prefetching: The process of loading data into the cache before it is requested by the processor.
Non-Blocking Caches: Caches that allow multiple memory operations to proceed simultaneously.
Processor Performance
Instruction-Level Parallelism (ILP): A measure of how many instructions can be executed simultaneously in a pipeline.
Multicore Processors: CPUs with multiple processing cores on a single chip to enhance parallelism and performance.
Pipeline Optimizations
Branch Prediction: A technique to guess the outcome of a branch instruction to reduce pipeline stalls.
Pipeline Stalls: Delays in the pipeline caused by dependencies or unresolved instructions.
Forwarding: A technique to resolve data hazards by using intermediate results directly from the pipeline
Answer to the Questions:
Fastest and Slowest Memory in Traditional Computer Architecture:
- Fastest Memory: Cache Memory (especially L1 cache) because it is small and located close to the CPU, designed for rapid data retrieval.
- Slowest Memory: Disk Storage (e.g., Hard Disk Drives) due to its mechanical nature and significant latency compared to electronic memories like RAM or cache.
Difference Between Direct Mapping and Associative Mapping of Memory:
- Direct Mapping: Each block of main memory maps to exactly one cache line. It is simple and fast but can lead to frequent conflicts if multiple blocks map to the same line.
- Associative Mapping: A block of main memory can map to any cache line. This approach reduces conflicts and improves cache utilization but increases the complexity of searching for a block in the cache.
.data
promptB: .asciiz “Enter the base (B) of the triangle: “
promptH: .asciiz “Enter the height (H) of the triangle: “
result: .asciiz “The area of the triangle is: “
.text
.globl main
main:
# Prompt the user for the base (B)
li $v0, 4 # syscall for printing string
la $a0, promptB # load address of promptB into $a0
syscall
li $v0, 5 # syscall for reading an integer
syscall
move $t0, $v0 # store the base (B) in $t0
# Prompt the user for the height (H)
li $v0, 4 # syscall for printing string
la $a0, promptH # load address of promptH into $a0
syscall
li $v0, 5 # syscall for reading an integer
syscall
move $t1, $v0 # store the height (H) in $t1
# Calculate the area: A = (B * H) / 2
mul $t2, $t0, $t1 # $t2 = B * H
srl $t3, $t2, 1 # $t3 = (B * H) / 2 (using bit shift for division by 2)
# Print the result
li $v0, 4 # syscall for printing string
la $a0, result # load address of result message into $a0
syscall
li $v0, 1 # syscall for printing integer
move $a0, $t3 # move the calculated area to $a0
syscall
# Exit program
li $v0, 10 # syscall for exiting program
syscall