Cache prefetching

Cache prefetching is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed (hence the term 'prefetch').[1] Most modern computer processors have fast and local cache memory in which prefetched data is held till it is required. The source for the prefetch operation is usually main memory. Because of their design, accessing cache memories is typically much faster than accessing main memory, so prefetching data and then accessing it from caches is usually many orders of magnitude faster than accessing it directly from main memory.

Metrics of cache prefetching

There are three main metrics to judge cache prefetching[2]

Coverage

Coverage is the fraction of total misses that are eliminated because of prefetching, i.e

,

where,

Accuracy

Accuracy is the fraction of total prefetches that were useful - i.e. the ratio of the number of memory addresses prefetched were actually referenced by the program to the total prefetches done.

While it appears that having perfect accuracy might imply that there are no misses, this is not the case. The prefetches themselves might result in new misses if the prefetched blocks are placed directly into the cache. Although these are a very small fraction of the total number of misses we might see without any prefetching, this is still a small non-zero finite number of misses.

Timeliness

The qualitative definition of timeliness is how early a block is prefetched versus when it is actually referenced. An example to further explain timeliness is as follows :

Consider a for loop where each iteration takes 12 cycles to execute and the 'prefetch' operation takes 3 cycles. This implies that for the prefetched data to be useful, we must start the prefetch iterations prior to its usage to maintain timeliness.

Types of cache prefetching

There are two main types of cache prefetching,[2]

Hardware based prefetching is typically accomplished by having a dedicated hardware mechanism in the processor that watches the stream of instructions or data being requested by the executing program, recognizes the next few elements that the program might need based on this stream and prefetches into the processor's cache.[3]

Software based prefetching is typically accomplished by having the compiler analyze the code and insert additional "prefetch" instructions in the program during compilation itself.[4]

Methods of hardware prefetching

Stream buffers

A typical stream buffer setup as originally proposed
[6] A typical stream buffer setup as originally proposed

Another pattern of prefetching instructions is to prefetch addresses that are addresses ahead in the sequence. It is mainly used when the consecutive blocks that are to be prefetched are addresses apart.[2] This is termed as Strided Prefetching.

Methods of software prefetching

Compiler directed prefetching

Compiler directed prefetching is widely used within loops with a large number of iterations. In this technique, the compiler predicts future cache misses and inserts a prefetch instruction based on the miss penalty and execution time of the instructions.

These prefetches are non-blocking memory operations, i.e these memory accesses do not interfere with actual memory accesses. They do not change the state of the processor or cause page faults.

One main advantage of software prefetching is that it reduces the number of compulsory cache misses.[2]

The following example shows the how a prefetch instruction will be added into a code to improve cache performance.

Consider a for loop as shown below:

for (int i=0; i<1024; i++) {
    array1[i] = 2 * array1[i];
}

At each iteration, the ith element of the array "array1" is accessed. Therefore, we can prefetch the elements that are going to be accessed in future iterations by inserting a "prefetch" instruction as shown below:

for (int i=0; i<1024; i++) {
    prefetch (array1 [i + k]);
    array1[i] = 2 * array1[i];
}

Here, the prefetch stride, depends on two factors, the cache miss penalty and the time it takes to execute a single iteration of the for loop. For instance, if one iteration of the loop takes 7 cycles to execute, and the cache miss penalty is 49 cycles then we should have - which means that we prefetch 7 elements ahead. With the first iteration, i will be 0, so we prefetch the 7th element. Now, with this arrangement, the first 7 accesses (i=0->6) will be misses, but that can't be avoided.

Comparison of hardware and software prefetching

See also

References

  1. 1 2 Smith, Alan Jay (1982-09-01). "Cache Memories". ACM Comput. Surv. 14 (3): 473–530. doi:10.1145/356887.356892. ISSN 0360-0300.
  2. 1 2 3 4 5 6 Solihin, Yan (2016). Fundamentals of parallel multicore architecture. Boca Raton, FL: CRC Press, Taylor & Francis Group. p. 163. ISBN 1482211181.
  3. Baer, Jean-Loup; Chen, Tien-Fu (1991-01-01). "An Effective On-chip Preloading Scheme to Reduce Data Access Penalty". Proceedings of the 1991 ACM/IEEE Conference on Supercomputing. Supercomputing '91. New York, NY, USA: ACM: 176–186. doi:10.1145/125826.125932. ISBN 0897914597.
  4. Kennedy, Porterfield, Allan (1989-01-01). Software methods for improvement of cache performance on supercomputer applications (Thesis). Rice University.
  5. Mittal, Sparsh (2016-08-01). "A Survey of Recent Prefetching Techniques for Processor Caches". ACM Comput. Surv. 49 (2): 35:1–35:35. doi:10.1145/2907071. ISSN 0360-0300.
  6. 1 2 3 "Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers".
  7. Chen, Tien-Fu; Baer, Jean-Loup (1995-05-01). "Effective hardware-based data prefetching for high-performance processors". IEEE Transactions on Computers. 44 (5): 609–623. doi:10.1109/12.381947. ISSN 0018-9340.
  8. Palacharla, S.; Kessler, R. E. (1994-01-01). "Evaluating Stream Buffers As a Secondary Cache Replacement". Proceedings of the 21st Annual International Symposium on Computer Architecture. ISCA '94. Los Alamitos, CA, USA: IEEE Computer Society Press: 24–33. doi:10.1145/191995.192014. ISBN 0818655100.
  9. "Storage Efficient Hardware Prefetching using Delta-Correlating Prediction Tables".
  10. Callahan, David; Kennedy, Ken; Porterfield, Allan (1991-01-01). "Software Prefetching". Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. ASPLOS IV. New York, NY, USA: ACM: 40–52. doi:10.1145/106972.106979. ISBN 0897913809.
This article is issued from Wikipedia - version of the 10/25/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.