Performance Analysis of Single-Cycle and Multi-Cycle Processors

Posted on Jan 12, 2025 in Computers

1. Single-Cycle and Multi-Cycle Processor Implementations

Program A

Program B

lw $6, 80($1)

sw $3, 100($6)

add $4, $3, $4

beq $3, $2, 50

add $5, $5, $5

lw $6, 400($5)

sub $3, $2, $1

beq $4, $3, 36

add $4, $3, $4

beq $3, $2, 20

add $5, $5, $5

beq $2, $3, 128

Two programs, A and B, shown in the above table, are going to be executed on two different machines. One of the machines is a single-cycle machine (Figure 1), and the other is a multi-cycle machine (Figure 2).

Assume that operational delay is 4ns for memory, 3ns for the instruction register, 4ns for registers (register file), 4ns for ALUs, and zero for all the other components in the architecture. Also, assume that none of the branches in programs A and B are taken.

4.a. Execution Time of Programs A and B on a Single-Cycle Machine

CPU Execution time = Instruction count x CPI x Clock Cycle Time

Clock-cycle time = delay produced by the longest instruction, i.e., lw, which uses all the functional units. So, clock-cycle time for a single cycle = 4+4+4+4+4 = 20 ns.

Each program uses 6 instructions, and the CPI is always one for a single-cycle machine.

CPU Execution time A = CPU Execution time B = 6 x 20 = 120 ns.

(NOTE: Clock cycle time depends on the longest instruction that the machine can perform. For this single-cycle machine, it is 20 ns for lw. The cycle time is not dependent on the program; for program B, though there are no lw instructions, the cycle time is still 20 ns.)

	Program A	Program B
Execution Time (in ns) on single-cycle machine	120 ns	120 ns

1.b. Execution Time of Programs A and B on a Multi-Cycle Machine

CPU Execution time = ∑ Instruction count type x CPI x Clock Cycle Time

Clock-cycle time = delay produced by the longest functional unit, i.e., 4 ns for memory or ALUs or registers.

CPI for R-type instructions: 4

CPI for sw: 4

CPI for lw: 5

CPI for beq: 3

Program A has 2 R-type, 1 sw, 2 lw & 1 beq = 4 x (2×4+1×4+2×5+1×3) = 100 ns

Program B has 3 R-type & 3 beq. = 4 x (3×4+3×3) = 84 ns

(NOTE: beq requires 3 cycles to be implemented – fetch, decode, and execution. In the EX stage, the ALU compares the two register contents.)

	Program A	Program B
Execution Time (in ns) on multi-cycle machine	100 ns	84 ns

1.c. Machine Performance Comparison and Speedup

	Program A	Program B
Which machine is faster? (single-cycle or multi-cycle)	Multi-cycle	Multi-cycle
Speedup of multi-cycle	120 / 100 = 1.2	120 / 84 = 1.42

2. Processor Pipeline

Consider the following instructions executed on the pipelined processor of Figure 6.17 in the text (reprinted below).

sub $2, $1, $1

lw $1, 840($3)

add $3, $1, $2

beq $5, $3, 20

4UqEEV6lCF+cmcavKoSe2mUlfIVKe+7KnBoGlUqV

sw $2, 600($3)

2.a. Instruction Operands

Instruction	rd	rs	rt	Immediate operand (in decimal)
sub $2, $1, $1	$2	$1	$1	X
lw $1, 840($3)	X	$3	$1	840
add $3, $1, $2	$3	$1	$2	X
beq $5, $3, 20	X	$5	$3	20
sw $2, 600($3)	X	$3	$2	600

Grading: 3points for trying, +2pt for each correct row in the table.

2.b. (9 points) Point out and classify (data or structural or control) the hazards in the above code.

Two data hazards:

Dependency between second and third instructions because of $1
Dependency between fourth and third instructions because of $3

One control hazard:

Dependency between fifth and fourth instructions because execution of the fifth instruction depends on the result of the fourth instruction

Grading: 3 points for trying. For each hazard, +1pt if the dependency is identified correctly, +1pt if it is classified correctly.

2.c. (11 points) Suppose that each pipeline stage takes one unit of time. Also assume that a non-pipelined implementation takes 5 units of time per instruction. For the above code, calculate the speedup gained by the pipelined architecture compared to the non-pipelined implementation if the following assumptions hold:

Instructions are stalled just sufficiently to avoid hazards;
The “beq” on line 4 is not taken;
The register file cannot be read and written in the same cycle.

Fill in this table to find the answer. The first row has been filled for you:

sub $2, $1, $1

lw $1, 840($3)

add $3, $1, $2

beq $5, $3, 20

sw $2, 600($3)

No pipeline: 25 unit times

With pipeline: 15 unit times

Speed gain: 25/15 = 1.67

Grading: 11 for perfect answer, 3pt for trying, +3 points for correct execution time, +3pts for correct execution time with pipeline.

3. (33 points) Cache Systems. A computer has a 32-bit physical address space, and it is byte addressed. It has a direct-mapped 1MB cache with 32-byte blocks. The cache is mapped to physical addresses.

3.a. (3 points) How many bits are required for the block byte offset?

Block size = 32 bytes. So, 5 bits are needed to specify one byte within a 32-byte block.

Grading: 2 points for clearly explained correct answer, 1 point for right idea but wrong answer, 0 for only incorrect answer.

3.b. (3 points) How many blocks does the cache hold?

Total cache size = 1 MB = 2²⁰ bytes.

1 block = 32 bytes.

So, number of blocks = 2²⁰/32 = 2²⁰/2⁵ = 2¹⁵ = 32 K blocks.

Grading: 2 points for clearly explained correct answer, 1 point for right idea but wrong answer, 0 for only incorrect answer.

3.c. (3 points) How many bits are required for the cache index?

Since we have 32 K blocks, a direct mapped cache has one “row” per block.

Now, 2¹⁵ = 32 K, so 15 bits are needed to specify a row. This is the cache index.

Grading: 2 points for clearly explained correct answer, 1 point for right idea but wrong answer, 0 for only incorrect answer.

3.d. (5 points) How many bits are required for each tag?

We make ourselves a little diagram like the one below. We have a total of 32 bits to work with. The bits left over after allocating byte offset and index are the tag bits. Answer = 12 bits.

32 – 15 – 5 = 12 bits for Tag

15 bits index

(from part 2c)

5 bits byte offset

(from Part 2a)

Grading: 3 points for clearly explained correct answer, 2 or 1 points for right idea but wrong answer, 0 for only incorrect answer.

3.e. (5 points) How many total bits of storage are required to build the cache? Assume that each cache entry has 1 valid bit, and 1 dirty bit.

Each “row” in our cache contains one block (32 bytes = 32 x 8 = 256 bits), plus 12 bits of tag, plus 1 valid bit, plus 1 dirty bit. This adds up to (256 + 12 + 1 + 1) = 270 bits.

We have 32K = 2¹⁵ rows in our cache, so the total number of bits = 2¹⁵x 270 = 8847360 bits.

Grading: 2 points for clearly explained correct answer, 1 point for right idea but wrong answer, 0 for only incorrect answer.

3.f. (3 points) What is the block byte offset for memory address C0C0 2610₁₆?

The byte offset is the rightmost (i.e., least significant) 5 bits. In this case, they are 10000₂ or 10₁₆.

Grading: 2 points for clearly explained correct answer, 1 point for right idea but wrong answer, 0 for only incorrect answer.

3.g. (3 points) What is the cache index for address C0C0 2610₁₆?

The cache index is the 15 bits to the left of the byte offset (see diagram above in part 2d). In this case, the answer is 000001001100000₂ = 0260₁₆.

Grading: 2 points for clearly explained correct answer, 1 point for right idea but wrong answer, 0 for only incorrect answer.

3.h. (3 points) What is the tag for address C0C0 2610₁₆?

The tag bits are the leftmost 12 bits (i.e., 3 hex digits, also known as “nibbles”). The answer, therefore, is 110000001100₂ = C0C₁₆.

Grading: 2 points for clearly explained correct answer, 1 point for right idea but wrong answer, 0 for only incorrect answer.

3.i. (5 points) If the 1MB cache were organized as a 8-way set-associative cache of 16-byte blocks, what would the tag be for address C0C0 2610₁₆?

In this case, we have 1/8^thas many rows to work with, compared to the direct-mapped cache. So the number of rows = 32K/8 = 4K. Since 4K = 2¹², we need 12 bits for cache index. That leaves us with 32 – 12 – 5 = 15 bits for the tags. For the given address, we pull out the leftmost 15 bits as the tag. So the answer is C0C0₁₆/2 = 110 0000 0110 0000 = 6060₁₆ = 24,672₁₀.

Grading: 3 points for clearly explained correct answer, 1 or 2 points for right idea but wrong answer, 0 for only incorrect answer.

Performance Analysis of Single-Cycle and Multi-Cycle Processors

1. Single-Cycle and Multi-Cycle Processor Implementations

4.a. Execution Time of Programs A and B on a Single-Cycle Machine

1.b. Execution Time of Programs A and B on a Multi-Cycle Machine

1.c. Machine Performance Comparison and Speedup

2. Processor Pipeline

2.a. Instruction Operands

Recent Notes

Subjects

Publicidad