Performance Analysis of Single-Cycle and Multi-Cycle Processors
1. Single-Cycle and Multi-Cycle Processor Implementations
Program A | Program B |
lw $6, 80($1) sw $3, 100($6) add $4, $3, $4 beq $3, $2, 50 add $5, $5, $5 lw $6, 400($5) | sub $3, $2, $1 beq $4, $3, 36 add $4, $3, $4 beq $3, $2, 20 add $5, $5, $5 beq $2, $3, 128 |
Two programs, A and B, shown in the above table, are going to be executed on two different machines. One of the machines is a single-cycle machine (Figure 1), and the other is a multi-cycle machine (Figure 2).
Assume that operational delay is 4ns for memory, 3ns for the instruction register, 4ns for registers (register file), 4ns for ALUs, and zero for all the other components in the architecture. Also, assume that none of the branches in programs A and B are taken.
4.a. Execution Time of Programs A and B on a Single-Cycle Machine
CPU Execution time = Instruction count x CPI x Clock Cycle Time
Clock-cycle time = delay produced by the longest instruction, i.e., lw, which uses all the functional units. So, clock-cycle time for a single cycle = 4+4+4+4+4 = 20 ns.
Each program uses 6 instructions, and the CPI is always one for a single-cycle machine.
CPU Execution time A = CPU Execution time B = 6 x 20 = 120 ns.
(NOTE: Clock cycle time depends on the longest instruction that the machine can perform. For this single-cycle machine, it is 20 ns for lw. The cycle time is not dependent on the program; for program B, though there are no lw instructions, the cycle time is still 20 ns.)
Program A | Program B | |
Execution Time (in ns) on single-cycle machine | 120 ns | 120 ns |
1.b. Execution Time of Programs A and B on a Multi-Cycle Machine
CPU Execution time = ∑ Instruction count type x CPI x Clock Cycle Time
Clock-cycle time = delay produced by the longest functional unit, i.e., 4 ns for memory or ALUs or registers.
CPI for R-type instructions: 4
CPI for sw: 4
CPI for lw: 5
CPI for beq: 3
Program A has 2 R-type, 1 sw, 2 lw & 1 beq = 4 x (2×4+1×4+2×5+1×3) = 100 ns
Program B has 3 R-type & 3 beq. = 4 x (3×4+3×3) = 84 ns
(NOTE: beq requires 3 cycles to be implemented – fetch, decode, and execution. In the EX stage, the ALU compares the two register contents.)
Program A | Program B | |
Execution Time (in ns) on multi-cycle machine | 100 ns | 84 ns |
1.c. Machine Performance Comparison and Speedup
Program A | Program B | |
Which machine is faster? (single-cycle or multi-cycle) | Multi-cycle | Multi-cycle |
Speedup of multi-cycle | 120 / 100 = 1.2 | 120 / 84 = 1.42 |
2. Processor Pipeline
Consider the following instructions executed on the pipelined processor of Figure 6.17 in the text (reprinted below).
sub $2, $1, $1
lw $1, 840($3)
add $3, $1, $2
beq $5, $3, 20
sw $2, 600($3)
2.a. Instruction Operands
Instruction | rd | rs | rt | Immediate operand (in decimal) |
sub $2, $1, $1 | $2 | $1 | $1 | X |
lw $1, 840($3) | X | $3 | $1 | 840 |
add $3, $1, $2 | $3 | $1 | $2 | X |
beq $5, $3, 20 | X | $5 | $3 | 20 |
sw $2, 600($3) | X | $3 | $2 | 600 |
Grading: 3points for trying, +2pt for each correct row in the table.
2.b. (9 points) Point out and classify (data or structural or control) the hazards in the above code.
Two data hazards:
- Dependency between second and third instructions because of $1
- Dependency between fourth and third instructions because of $3
One control hazard:
- Dependency between fifth and fourth instructions because execution of the fifth instruction depends on the result of the fourth instruction
Grading: 3 points for trying. For each hazard, +1pt if the dependency is identified correctly, +1pt if it is classified correctly.
2.c. (11 points) Suppose that each pipeline stage takes one unit of time. Also assume that a non-pipelined implementation takes 5 units of time per instruction. For the above code, calculate the speedup gained by the pipelined architecture compared to the non-pipelined implementation if the following assumptions hold:
- Instructions are stalled just sufficiently to avoid hazards;
- The “beq” on line 4 is not taken;
- The register file cannot be read and written in the same cycle.
Fill in this table to find the answer. The first row has been filled for you:
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | |
sub $2, $1, $1 | F | D | E | M | W | |||||||||||||||
lw $1, 840($3) | F | D | E | M | W | |||||||||||||||
add $3, $1, $2 | F | D | D | D | D | E | M | W | ||||||||||||
beq $5, $3, 20 | F | D | D | D | D | E | M | W | ||||||||||||
sw $2, 600($3) | F | D | E | M | W |
No pipeline: 25 unit times
With pipeline: 15 unit times
Speed gain: 25/15 = 1.67
Grading: 11 for perfect answer, 3pt for trying, +3 points for correct execution time, +3pts for correct execution time with pipeline.
3. (33 points) Cache Systems. A computer has a 32-bit physical address space, and it is byte addressed. It has a direct-mapped 1MB cache with 32-byte blocks. The cache is mapped to physical addresses.
3.a. (3 points) How many bits are required for the block byte offset?
Block size = 32 bytes. So, 5 bits are needed to specify one byte within a 32-byte block.
Grading: 2 points for clearly explained correct answer, 1 point for right idea but wrong answer, 0 for only incorrect answer.
3.b. (3 points) How many blocks does the cache hold?
Total cache size = 1 MB = 220 bytes.
1 block = 32 bytes.
So, number of blocks = 220/32 = 220/25 = 215 = 32 K blocks.
Grading: 2 points for clearly explained correct answer, 1 point for right idea but wrong answer, 0 for only incorrect answer.
3.c. (3 points) How many bits are required for the cache index?
Since we have 32 K blocks, a direct mapped cache has one “row” per block.
Now, 215 = 32 K, so 15 bits are needed to specify a row. This is the cache index.
Grading: 2 points for clearly explained correct answer, 1 point for right idea but wrong answer, 0 for only incorrect answer.
3.d. (5 points) How many bits are required for each tag?
We make ourselves a little diagram like the one below. We have a total of 32 bits to work with. The bits left over after allocating byte offset and index are the tag bits. Answer = 12 bits.
32 – 15 – 5 = 12 bits for Tag | 15 bits index (from part 2c) | 5 bits byte offset (from Part 2a) |
Grading: 3 points for clearly explained correct answer, 2 or 1 points for right idea but wrong answer, 0 for only incorrect answer.
3.e. (5 points) How many total bits of storage are required to build the cache? Assume that each cache entry has 1 valid bit, and 1 dirty bit.
Each “row” in our cache contains one block (32 bytes = 32 x 8 = 256 bits), plus 12 bits of tag, plus 1 valid bit, plus 1 dirty bit. This adds up to (256 + 12 + 1 + 1) = 270 bits.
We have 32K = 215 rows in our cache, so the total number of bits = 215 x 270 = 8847360 bits.
Grading: 2 points for clearly explained correct answer, 1 point for right idea but wrong answer, 0 for only incorrect answer.
3.f. (3 points) What is the block byte offset for memory address C0C0 261016?
The byte offset is the rightmost (i.e., least significant) 5 bits. In this case, they are 100002 or 1016.
Grading: 2 points for clearly explained correct answer, 1 point for right idea but wrong answer, 0 for only incorrect answer.
3.g. (3 points) What is the cache index for address C0C0 261016?
The cache index is the 15 bits to the left of the byte offset (see diagram above in part 2d). In this case, the answer is 0000010011000002 = 026016.
Grading: 2 points for clearly explained correct answer, 1 point for right idea but wrong answer, 0 for only incorrect answer.
3.h. (3 points) What is the tag for address C0C0 261016?
The tag bits are the leftmost 12 bits (i.e., 3 hex digits, also known as “nibbles”). The answer, therefore, is 1100000011002 = C0C16.
Grading: 2 points for clearly explained correct answer, 1 point for right idea but wrong answer, 0 for only incorrect answer.
3.i. (5 points) If the 1MB cache were organized as a 8-way set-associative cache of 16-byte blocks, what would the tag be for address C0C0 261016?
In this case, we have 1/8th as many rows to work with, compared to the direct-mapped cache. So the number of rows = 32K/8 = 4K. Since 4K = 212, we need 12 bits for cache index. That leaves us with 32 – 12 – 5 = 15 bits for the tags. For the given address, we pull out the leftmost 15 bits as the tag. So the answer is C0C016/2 = 110 0000 0110 0000 = 606016 = 24,67210.
Grading: 3 points for clearly explained correct answer, 1 or 2 points for right idea but wrong answer, 0 for only incorrect answer.