Essential Concepts in Computer Architecture and Systems

Posted on Dec 19, 2024 in Computers

Essential Concepts in Computer Architecture

ISA: Instruction Set Architecture. The set of instructions a CPU supports. Common ISAs: x86, MIPS, ARM, POWER, SPARC.

CISC: Complex Instruction Set, various instruction sizes. Example: x86.

RISC: Reduced Instruction Set, primitive operations. Example: MIPS.

The time for a circuit to stabilize is determined by circuit complexity, transistor properties, wire length, and materials used.

Cost per die = (Cost per wafer) / (Dies per wafer * yield) Dies per wafer = wafer area / die area

Yield = 1 / (1 + die area * defect rate) or 1 / (1 + die area * defect rate)^N where N is the complexity.

Instructions per second = IPS = clock frequency. IPS = 1 / clock period or clock frequency / CPI or 1 / (CPI * clock period)

A Cycle is the period between two rising clock edges.

Clock Period (or Cycle time) is the length of time for one cycle.

Clock Rate (or frequency) is the number of cycles per second or 1 / clock period.

A basic CPU can issue one instruction per cycle.

Performance_x = 1 / execution time_x. Speedup = Performance_x/ Performance_yor execution time_y / execution time_x.

Execution Time = Clock Cycles * Clock Period

Energy Consumption Formula: E = P * T, where E is energy, P is power, and t is time.

Clock Cycles = Clock Frequency * Execution Time or Number of Instructions * CPI

Moore’s Law: Transistor density will double approximately every two years.

Dennard Scaling: As transistors shrink, power density remains constant.

Dynamic Power: Activity * Capacitance * Voltage^2 * Frequency.

Response Time: How long it takes to complete one task.

Throughput: How many tasks are completed per unit of time.

Amdahl’s Law: A design change that speeds up X% of a workload can at most reduce a response time by X%.

Amdahl’s Law: T_opt= T_Affected/ Improvement factor + T_Unaffected.

Most Significant Bit (MSB) is on the left, and Least Significant Bit (LSB) is on the right.

1 Byte = 8 bits. Max Value a byte can hold is 255.

Binary starts at 2⁰ and goes 2¹, 2², 2³, 2⁴, 2⁵….

This would be an example of 8-Bit Binary:

128	64	32	16	8	4	2	1
0	0	0	0	0	0	0	0

Two’s Complement: If MSB is 0, the number is positive. If the MSB is 1, then the number is negative.

Floating Point: Bits are Divided: Sign bit, Fraction, Exponent: (-1)^Sign * Fraction * 2^Exponent

Single Precision (32 bits): 1 sign bit(s), 8 exponent bits (E), 23 Fractional bits(F): (-1)^Sign * Fraction * 2^{Exponent – 127}

Double Precision (8): 1 sign bit(s), 4 exponent bits(E), 4 Fractional bits(F): (-1)^Sign * Fraction * 2^{Exponent – 7}

MIPS Instructions and Architecture

An Instruction is a simple directive that the CPU can execute.

add –> Add sub –> Subtract slt –> Set Less Than addi –> Add Immediate subi –> Subtract Immediate

MIPS Arithmetics: add a,b,c –> a = b + c AND sub a,b,c –> a = b – c

You cannot perform more than one add per instruction. For example, x + y + z “is a no go.”

MIPS has 32 Registers: $t0 – $t9 (Temporaries), $s0 – $s7 (saved registers), $zero (always holds 0).

An Immediate is a constant value hard-coded into an instruction. Example:

addi a,b, imm –> a = b + imm AND subi a,b,imm –> a = b – imm (Where a and b must still be registers.

Array Size = Element Size * Number of Elements. Example: int arr[6] = 24 bytes (4 * 6) AND double arr[2] = 16 bytes (8 * 2)

The address of element index i is: base address + i * element size. Example: starting at address 40: arr[3] = 40 + (4 * 3) = 52

Load Word (4 Bytes from Memory –> Register lw r, address # r = memory[address]

Store Word (4 bytes from Register –> Memory sw r, address # memory[address] = r

General Format of LW: lw destination(register), offset(constant)(base(register))

Example of lw: Load Memory at address 8 + $t0, put into $s1: lw $s1, 8($t0)

General Format of SW: sw source(register), offset(constant)(base(register))

Example of sw: Put $s1’s value into memory at address 8 + $t0: sw $s1, 8($t0)

MIPS Example:

arr[3] += 1; in MIPS: arr is type int[] and $s3 holds address of arr

lw $t0, 12($s3) # load arr[3] (3×4 byte offset)

addi $t0, $t0, 1 # add 1 to it

sw $t0, 12($s3) # store it back

Data Segment: Stores data. Text Segment: Stores Instructions.

Example: .data

.word 65

str:

.asciiz “Hello”

.text

lw…..

add….

Boolean Logic and Bitwise Operations

Boolean Logic: Treat 0 as False and 1 as True

Summary of the common Boolean logic gates with symbols and ...

Bit Wise Operations. Bitwise Negation(not): not (0101) = 1010.

9k=

Shamt (shift amount) is a constant between 0-32

Example: int X = -5249578 and int Y = X & X. Y = -5249578

Example: int X = 12; // binary = 1100 and int Y = 5; // binary = 0101…. The result would be 1100 + 0101 = 1101 which would be 13

Large immediate means you need to use multiple instructions.

Example: int x = -1430532899; // 0xAABBCCDD;

2Q==

MIPS Control Flow Examples

MIPS IF Statement example: MIPS FOR LOOP:

# Initialize x = 5 and y = 10 Java: li $s0, 0 # initialize $s0(x) to 0

li $t0, 5 # $t0 holds x int x = 5; top: # Loop Label

li $t1, 10 # $t1 holds y int y = 10; beq $s0, $s1, done # if $s0(x) == $s1(y), exit loop

# First condition: if (x

blt $t0, $t1, if_block # If x

# Second condition: else if (x == y) } else if (x == y) { j top #jump to loop

beq $t0, $t1, else_if_block # If x == y, jump to else_if_block x = 2; done: # exit point loop

# End of conditions }

j end_if_else

# if block: x = 1

if_block:

li $t0, 1 # x = 1

j end_if_else # Skip to end

# else if block: x = 2

else_if_block:

li $t0, 2 # x = 2

# End of if-else if chain

end_if_else:

The ASCII value is a numerical representation of a character in the ASCII (American Standard Code for Information Interchange) encoding system.

32 = Blank Space 48 – 57 = 0 – 9 65 – 90 = A – Z (Upper Case) 97 – 112 = a – z (Lower Case)

String are just an array of characters (1 byte each) by convention, strings end with a null terminator (a zero byte)

This is how programs know where the strings end.

Ex: Hello: This will end in a zero to show that the computer is done.

1	2	3	4	5	6
104	101	108	108	111	0

Each 1 Byte each (not full ints) so Bytes total for strings “hello”. The zero counts as a byte.

2Q==

Single-Cycle CPU: Pro: simple, few transistors. Cons: Lots of components doing nothing most of the time.

Pipelining: Pipelining is a form of parallelism – simultaneously working on multiple tasks.

5-Stage Pipeline: instruction memory, register read, ALU, data memory, register write.

X67pnKQFTJ89y0my6XC51jA4+hbAJAAAAetCFTc+6e6fTqSyXy9pygk9jjA4AAAB60LWXnq83m0253W5ltVoJmvhQpfwAiBCa5JDUD1sAAAAASUVORK5CYII=

Control lines are signals used in digital circuits, particularly in microprocessors and control units, to manage and direct the operation of various components within the system. They help to orchestrate the flow of data and control signals between different parts of a processor, memory, and other components, ensuring that the system operates as intended.

A data dependency occurs when one instruction in a sequence depends on the result of a previous instruction. In a pipeline, this dependency can cause delays (or hazards) because the processor has to wait for data to be produced by a previous instruction before it can proceed with the next one. Ways to handle data dependencies include stalling, forwarding, and out-of-order execution, register renaming, adding a nop.

A data hazard occurs in a pipelined processor when there is a conflict between instructions that depend on each other’s data, potentially causing incorrect results or delays in execution. In essence, data hazards arise when one instruction requires data that is being produced by a previous instruction, but that data is not yet available because the earlier instruction is still being processed. Hazard must be eliminated for correct execution of a program.

Example of using RAW (read-after-write) data hazard can be handled in a pipelined processor with data forwarding (no stall) to resolve the hazard:

I1: R1 = R2 + R3 // Instruction 1 writes to R1

I2: R4 = R1 + R5 // Instruction 2 reads from R1

Cycle	I1 (IF)	I1 (ID)	I1 (EX)	I1 (MEM)	I2 (WB)	I2 (IF)	I2(ID)
1	IF
2	ID	IF
3	EX	ID	IF
4	MEM	EX	ID	IF
5	WB	MEM	EX	ID	IF
6		WB	MEM	EX	ID	IF
7			WB	MEM	EX	ID
8				WB	MEM	EX	ID

Two terms to help identify hazards:

Point of Production (PoP) – the point in time at which a value is produced.

Point of Consumption (PoC) – the point in time at which a value is needed.

Identifying Data Hazards:

1. Identify all data dependencies (all are not necessarily hazards)

2. identify the PoP and the PoC – Highly dependent on the type of instruction

add consumes before the ALU and produces after the ALU

lw consumes before the ALU and produces after the MEM

sw has two PoC ( before ALU and before MEM), and no PoP

3. If the PoC is before the Pop, its a hazard.

Data Hazard: occurs when an instruction depends on the result of a previous instruction that hasn’t completed yet. In simpler terms, it’s a situation where data required by an instruction is not ready in time because another instruction is still working on it.

Structural Hazard: occurs when multiple instructions need to use the same hardware resource (like a register file, memory, or ALU) at the same time. This creates a resource conflict because the hardware can’t handle the requests simultaneously.

Control Hazard: occurs when the processor doesn’t know which instruction to fetch next because the decision depends on the outcome of a previous instruction (often a branch or jump instruction). The processor may need to wait until it knows which direction to take for the next instruction.

FIXES FOR THEM:

Data Hazards: Use data forwarding or pipeline stalls (NOPs). Data forwarding is more efficient but may not be available in all architectures.

Structural Hazards: Solve by adding more resources (e.g., extra memory or ALUs) or pipeline scheduling to ensure instructions don’t compete for the same resource.

Control Hazards: Use branch prediction, delayed branching, or pipeline stalls (NOPs). Branch prediction is the most effective but requires sophisticated hardware support.

In MIPS, delayed branching is a technique used to handle control hazards that occur due to branch instructions. In a traditional pipeline, when a branch instruction is encountered, the processor must wait to determine whether the branch will be taken or not before fetching the next instruction. This introduces a delay because the instruction following the branch is not immediately executed, creating a pipeline “bubble.” With delayed branching, MIPS avoids this delay by executing the instruction immediately following the branch instruction (the “delay slot”) regardless of whether the branch is taken. This instruction is executed even before the branch decision is made, which helps keep the pipeline busy and minimizes idle cycles.

The main goal of branch prediction is to reduce the performance penalties associated with branch instructions, where the processor doesn’t know whether the branch will be taken or not until it evaluates the condition of the branch. This is 96% accurate on average.

Instruction-Level Parallelism (ILP) is a type of parallelism in which we process multiple instructions (from the same program) at once, pipelining is a prime example of this. Our 5 stage pipeline exhibits 5-way parallelism.

Two Instructions sequences are equivalent if they end in the same state.

The Compiler cannot always find the best ordering, it cannot predict branches, it cannot predict stall length. There are too many different microarchitectures.

Compilers are limited because they operate on static information (info known before the program runs)

The CPU can operate on dynamic information, such as recent branch history.

The Reorder Buffer (ROB) is a key component in modern pipelined processors, particularly in out-of-order execution architectures. It plays a critical role in ensuring that the processor can execute instructions out of order while maintaining the correct program order and data integrity when writing results back to registers or memory.

ROB Ex:

add $t0, $t1, $t2 # R1 = R2 + R3

mul $t3, $t4, $t5 # R2 = R3 * R4 these intructions can be executed out of order, but the result

sub $t6, $t7, $t8 # R3 = R4 – R5 must be written back in program order.

Identify all datadependencies(all are not necessarily hazards)

dentify all datadependencies(all arIdentify all datadependencies

A superscalar architecture is a type of CPU design that allows the execution of multiple instructions per clock cycle by using multiple execution units. This design aims to increase the instruction-level parallelism (ILP) of a processor, making it capable of executing several instructions simultaneously, as opposed to a traditional scalar processor that executes one instruction per clock cycle.

The main difference between a scalar and a superscalar processor is that while a scalar processor executes one instruction at a time in sequence, a superscalar processor has multiple execution units (like ALUs, floating-point units, load/store units) that can process multiple instructions in parallel.

Cycles per element (CPE) is a measure used to evaluate the performance of a processor or system, particularly in the context of executing a given instruction or operation. It indicates the number of clock cycles required to process a single element, such as an instruction, data element, or task. CPE can be used to assess how efficiently the processor executes specific operations, taking into account its architecture, pipeline design, and various stages of processing.

Bottleneck refers to any component or stage in a system that limits the overall performance because it restricts the flow of work or data. In the case of pipelining, superscalar architectures, and cycles per element (CPE), bottlenecks can occur at various points in the pipeline or execution flow. Identifying and addressing these bottlenecks is crucial for improving system performance.

Loop unrolling is an optimization technique used to increase the execution speed of a loop by decreasing the overhead of loop control and increasing the instruction-level parallelism. It works by manually expanding the loop body and performing multiple iterations within a single iteration of the loop.

Ex of Loop Unrolling:

for (int i = 0; i A[i] = B[i] + C[i]; for (int i = 0; i A[i+1] = B[i+1] + C[i+1]; A[i] = B[i] + C[i]; }
A[i+2] = B[i+2] + C[i+2];
A[i+3] = B[i+3] + C[i+3];
}

DRAM: Main memory is Dynamic Random Access Memory. Made out of 1 Capacitor + 1 Transistor per bit. Different fabrication process

On-chip memories are made out of Static Random Access Memory (SRAM). Made out of just transistors, its much faster than DRAM

SRAM: Fast, Bulky, Costly ($)

DRAM: Slow, Dense, Cheap ($)

Data caches are a type of memory cache used to store data that is frequently accessed by the CPU. They help improve performance by reducing the time it takes to fetch data from main memory (RAM), which is much slower than the processor’s cache. Caches act as intermediaries between the fast CPU and the slower main memory, storing copies of the data that the processor is likely to use again.

Caches are designed around two phenomena:

Temporal Locality: recently used data is likely to be used again soon.

Spatial Locality: Nearby data is likely to be accessed next.

eUYyn7x6nEgAAAABJRU5ErkJggg==

Hit Rate is the percentage of accesses that hit, it is a general measure of the caches effectiveness

In a pipelined processor, multiple stages of instruction execution are overlapped. This means that while one instruction is in the execution stage, another can be in the decode stage, and another can be fetched. Each stage of the pipeline works on a different instruction, allowing new instructions to be started on every clock cycle, thus increasing throughput.

In a single-cycle processor, each instruction is executed in one cycle, meaning only one instruction can be processed per cycle. Therefore, new instructions cannot begin until the previous one has completed.

A pipelined processor is an efficient way to increase instruction throughput by overlapping the execution of multiple instructions. While it offers significant performance improvements, it introduces complexity due to pipeline hazards, increased latency for individual instructions, and the need for advanced techniques like forwarding, branch prediction, and out-of-order execution to mitigate these issues. Pipelining is a fundamental concept in modern processor design, allowing processors to perform a large number of instructions per second while managing complex execution tasks.

In general, pipelining improves instruction throughput

Data-Level Parallelism (DLP): DLP refers to performing the same operation on multiple data elements simultaneously. This type of parallelism is useful when there are many data elements that can be processed in the same way, such as in vectorized operations or array computations.

Task-Level Parallelism (TLP): TLP refers to the parallelism that comes from dividing a program into separate tasks (or threads) that can run concurrently. Each task may perform different operations, but they can be executed simultaneously on different cores or processors.

Instruction-Level Parallelism (ILP): ILP refers to the ability to execute multiple instructions simultaneously within a single processor. This type of parallelism exploits the fact that many instructions in a program can be executed independently of each other.

Hex	Binary
0	0000
1	0001
2	0010
3	0011
4	0100
5	0101
6	0110
7	0111
8	1000
9	1001
A	1010
B	1011
C	1100
D	1101
E	1110
F	1111

SIMD: Single Instruction Multiple Data is one implementation of data-level parallelism

Instruction inputs/outputs are vectors instead of single values, e.g. x = y = addv a, x, y a = SIMD: GrayScale for every 4th pixel: total = ; total += ; total += ; total += ; gray = total / ; gray is now a vector containing the grayscale value for 4 different pixels and is 4x faster than non-SIMD version The huge advantage of SIMD is we dont need to parallelize everything, fetch and decode a single instruction to do 32 adds. This will save lots of transistors (die area). Many cache coherence protocols are based on “snooping”. Each Cache watches the others memory requests, everyone is hooked up to a common bus.

MESI (Modified, Exclusive, Shared, Invalid) is a widely-used cache coherence protocol that ensures that copies of data in different caches of a multiprocessor system remain consistent with each other. It is used in systems with multiple processors (or cores), each having its own cache. The MESI protocol manages the states of cache lines to maintain the consistency of data between caches and memory, preventing issues like stale data or data corruption.

The MESI protocol defines four states for each cache line in a cache, which help to control the flow of data in the system:

1. Modified (M) Description: The cache line is modified and contains data that is different from the main memory. This means the cache has a copy of the data that has been updated, but the change has not yet been written back to the main memory.
Implication: The data is only in the current cache, and no other cache has this copy. The processor that owns this cache is the only one with the most up-to-date data.
Action: Before the cache line is evicted, it must be written back to the main memory to ensure the memory has the updated value.
2. Exclusive (E) Description: The cache line is not modified and is exclusive to the current cache, meaning no other cache has this line. The data in the cache is identical to the data in main memory. Implication: The cache has the only copy of this line and has not yet been modified. This state is a temporary state between a read miss and a potential write. Action: If the processor writes to the cache line, it will transition to the Modified (M) state.
3. Shared (S) Description: The cache line is shared among multiple caches, and the data is consistent with main memory. More than one processor has a copy of this line, but no processor has modified it. Implication: The data in the cache is valid and up-to-date, but since the data is shared, any processor can read the data, but no processor can write to it. Action: If one processor writes to the line, the cache line will transition to Modified (M) in the writer’s cache and Invalid (I) in others.
4. Invalid (I) Description: The cache line is invalid and does not contain valid data. It may have been invalidated due to a write operation by another processor or because it was never updated. Implication: The data in the cache is no longer valid, and the processor must read from the main memory or another cache to retrieve fresh data.
Action: When a cache line is invalid, it will be refilled with data from memory when needed.

Essential Concepts in Computer Architecture and Systems