Memory Interleaving and Cache Memory Optimization

(2.4) Interleaved Memory: Idea: Divide the memory into independent modules to allow simultaneous access to several words in different modules (interleaving).Efficiency condition: references to memory modules are distributed equitably, with a distribution key. • Ideal Location: The width of the memory access band is multiplied by # d modules. • Total Memory = N = 2n words / / No. of Modules = M = 2m. Schemes – There are 2 schemes, and with either, M parallel words can be obtained with each memory access. There is a memory conflict when several requests require access to the same module, reducing the access speed. Memory conflicts are greater in higher-order sequential access because of the programs. In multiprocessor systems, interleaving is sometimes better than entanglement when tasks are disjoint or interact somewhat (which is not always true). Low-order interleaving is often used.

1) Higher-Order Interleaving (Consecutive): Physical address: No bits • The i-th module qualifies for the consecutive address i • 2n – m to (i +1) • 2n – m -1 • The most significant m bits identify the module, and the rest is an offset into the module. • Advantages: expandability and reliability (a failure is restricted to a localized area of the address space).

2) Lower-Order Interleaving (Cyclic): The i-th module stores addresses as K • M + I, with k = 0, 1, …, 2n – m-1 (Spacing M between them). • The m least significant bits select the physical module, and the rest is the position within the module. (2a) With Latches in output: in each access, M consecutive words are read -> K • M + I with i = 0, 1, …, M – 1. They are stored in latches and transferred in series by a MUX. The words are read in the next cycle -> mechanisms of advance. In the best case, access time is reduced to M. This is ideal for sequential memory access. Low efficiency in non-sequential programs (jumps), and for solutions, systems with input latches can be designed. (2b) In the Input Latches: each module can use a particular relative address. Needs a memory controller. 1-by-1 processing of requests sequentially. If the request is occupied by a previous latch, it is delayed.

(T.3) Cache Memory —Objective: To achieve fast and unlimited memory. • Principle of locality: programs access a relatively small part of their address space at any given time. -> a) Temporal locality: if an element is referenced, it will be referenced soon (loops). b) Spatial locality: if an item is referenced, items whose addresses are close will tend to be referenced soon. • Leveraging the principle of locality, memory is implemented as a hierarchy of memory, presented to the user as much memory as possible + cheap technology while providing access at the speed offered by the fastest memory. • In our memory hierarchy, the highest level for us will be the cache memory, and the records are left aside. A level closer to the processor is a subset of all levels, and all data is stored in the lowest level, locating the information at each level according to its probability of use.

(3.1) Introduction to the Cache: The term -> Conti, Pitkowsky, Gibson (1968). 1st commercial computer with cache: IBM 360/85 (1968) • The cache is a small and fast memory that is located between the processor and main memory. • Stores the information currently in use from the memory. • The management of the cache is similar to the management of memory. • The virtual success of the cache justifies its use. • BLOCK (cache line): Minimum unit of information transfer between upper and lower levels. • Hit: the required information is in the upper level. • Miss: the information is not in the top level and needs access to below. • Hit rate: fraction of memory accesses found in the upper level. • Miss rate: fraction of memory accesses not found in the upper level (1-hit rate). • Penalty for miss: time required to replace a block in the upper level by the lower level, plus the time to deliver the block to the processor. It is divided into: a) access time -> time for accessing the 1st floor in a block. b) transfer time -> extra time to transfer the remaining words of the unit. • Time accuracy of < penalty for misses.

(3.2) Operation of a Cache System: A block is called a cache lineStructure of a Cache: Directory -> contains information on the lines of the storage area. Storage Areas -> contains copies of some lines of the memory. • The main directory contains a set of tags to identify whether a word in the cache corresponds to the searched word -> address + need some way to recognize whether a cache block has valid information (validity bit: a 0 indicates invalid information). Location of Lines – 3 categories of cache organization (n lines):

1) Direct Mapping: each line has a unique place where it can appear in the cache (fields -> tag, the cache line (log2 n –> index size), word inside the line, byte of the word). • 1 way, n sets.

2) Fully Associative: each line can start in any part of the cache (fields -> tag, word within the line, byte of the word). • n-way, 1 set.

3) Set Associative (m-way): each line corresponds to a set of m cache lines (no. of sets = n / m = c) • fields -> tag, set index (log2 c), word in the output, byte of the word.

Z


Search Line – Index: selects a set of the cache. • Tag: compared simultaneously with all tags of the selected set. • The validity bit indicates whether the entry contains a valid address. • Increasing the associativity increases the size of the tag. Replacement of Lines – We set a criterion to replace cache lines in the cache.• In direct mapping: a unique opportunity. • In set associative or fully associative, there is a choice of a line between a group or among all the cache, respectively. • Key strategies: 1) Random: it makes a random choice (easy to implement). 2) Least Recently Used (LRU): the block is replaced that has not been used for the longest time. Write operation – 2 basic options for writing to the cache: 1) Write-through: the information is written to the cache and the lower level memory. Advantages: easy to implement (stagnation of the CPU in writing -> writing buffer), reading misses are less costly, and the lower level memory is always updated.2) Write-back: the write is done to the cache, and when the line is replaced, it is written into memory. Advantages: speed of writing to the cache, multiple writes to a line involve writing one complete line in memory – memory bandwidth can make effective use of wide bandwidth. • Modified bit: a status bit is used in write-back to indicate if a line has been changed or not. Reduces the frequency of writing to memory on line replacements, and if the line is unchanged, it does not need to be written to memory. • Options on a write miss: 1) Write allocate or fetch on write: the line is loaded, and then the write is performed (in write-back). Subsequent writes on the line do not cause misses. 2) No-write allocate or write around: the line is modified in the lower level and not loaded in the cache (write-through). The following writes to the line will cause writes to memory. (3.3) Cache Performance: AverageMemoryAccessTime = TimeHit + (MissRate x PenaltyMiss) • CPU Time = (CPUExecutionCycles + MemoryStallCycles) x TimeClockCycle • For simplicity, we assume that all stalls are due to the cache, and clock cycles will include the hits and execution clock cycles of the CPU -> MemoryStallCycles = [(Reads / Program) x MissRateRead x PenaltyReadMiss] + [(Writes / Program) x MissRateWrite x PenaltyWriteMiss] • Combining and simplifying reads and writes -> MemoryStallCycles = (MemoryAccesses / Program) x MissRate x PenaltyMiss • Factoring the instruction count (IC) -> CPU Time = IC x [CPIExecution + (MemoryAccesses / Instruction) x MissRate x PenaltyMiss] x TimeClockCycle • Considering the miss rate as misses per instruction -> TCPU = IC x [CPIexecution + (Misses / Instruction) x PenaltyMiss] x TimeClockCycle • The behavior of the cache has a great influence on performance. In a CPU with low CPI and a faster clock, there is a dual impact: 1) A lower CPI means the relative impact of stall cycles is greater. 2) For two computers with equal memory, the CPU with higher frequency uses more clock cycles on a miss. • In the above equations, we have assumed that the hit time is not a determining factor in performance. Techniques for improving cache performance will set the objective of minimizing AverageMemoryAccessTime, however, we must consider that the ultimate goal is to reduce CPU time.

(3.4) Sources of Misses: A) Compulsory: the first access to a line is not in the cache (also called ‘cold start misses’ or ‘first reference misses’). B) Capacity: the cache does not have sufficient capacity to contain all the lines required during program execution. C) Conflict: they occur in set associative or direct mapping caches when a line is discarded and later retrieved, if there are too many lines in the set (also called ‘collision misses’). / / / Features of the model give a vision of the average behavior, do not explain individual misses, and ignore the replacement policies. • Many techniques to reduce the miss rate can increase the hit time or the miss penalty.

(3.5) Cache for Data and Instructions: Unified or mixed caches that contain instructions and data can be a bottleneck. • The CPU knows whether it is an instruction or data by issuing the address. Separate ports can be used. Advantages of separate caches: 1-almost double the bandwidth between the CPU and memory hierarchy. 2-optimization separately for each cache. 3-eliminates conflict between instruction misses. Separate caches for instructions and data Disadvantages – 1-fixed space for instructions and data -> worse hit rate. 2-Inconsistency: instructions and data in the same line, modifiable instructions.

(3.6) Miss Rate Reduction: Increase the size of the line – reduces compulsory misses, can increase capacity misses, increases the miss penalty, and the choice of line size depends on the latency and the bandwidth of the lower level of the memory hierarchy. Increase associativity – reduces conflict misses. In practice, a set associative cache with 8 ways is as effective as a fully associative one. A direct mapping cache has approximately the same miss rate as a set associative cache with 2 ways and half the size. Hit time can increase (increasing the clock cycle time). Victim Cache – small fully associative cache. Keeps the lines removed from the cache on a miss. If the line is required, it is exchanged with a line from the cache. Especially useful for small data caches with direct mapping. Pseudo-associative Cache – in a decision before going to the lower level of the memory hierarchy, it checks another memory line (‘pseudo-set’). Two hit times. It is necessary to correctly position the lines so as not to degrade performance. It can complicate the design of segmented CPUs. Prefetch of instructions. SPECIFICATIONS – the goal is to overlap the prefetch with execution. Performance may be reduced if demand interferes with misses. It can be done in 2 ways: 1) Hardware: directly in the cache or an external buffer. 2) Controlled by the compiler: prefetched data can be recorded in a register or cache. Optimization at compile time – by the rearrangement of the code or the relocation of the data, the miss rate can be reduced. Examples of techniques include the fusion of arrays, loop interchange, blocking, or fusion bonding.

(3.7) Reduction of Penalty for Miss: Give priority to reading over writing – in a write-through cache: add a write buffer of adequate size, with the handicap that the updated value of a position may be in a write buffer when a read is needed, or wait until the buffer is emptied, or verify the contents of the cache buffer. • In write-back: add a buffer to store the changed block, stopping the CPU if there is a new write until the buffer is emptied. Locating subblocks (in direct mapping caches) – the subblocks have a validity bit. Reduces the miss penalty, reduces the size of the tag, helps on write hits, always writing the word.