1…Cache Effects on Application Performance

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (12.94 MB, 298 trang )

56

8 Data Layout in Main Memory

we run a simple benchmark accessing a constant number (4,096) of addresses with

an increasing stride, i.e. distance, between the accessed addresses.

We implemented this benchmark by iterating through an array chasing a

pointer. The array is filled with structs so that following the pointer of the elements

creates a circle through the complete array. Structs are data structures, which allow

to create user-defined aggregate data types that group multiple individual variables

together. The structs consist of a pointer and an additional data attribute realizing

the padding in memory, resulting in a memory access with the desired stride when

following the pointer chained list.

In case of a sequential array, the pointer of element i points to element i ỵ 1 and

the pointer of the last element references the first element so that the circle is

closed. In case of a random array, the pointer of each element points to a random

element of the array while ensuring that every element is referenced exactly once.

Figure 8.1 outlines the created sequential and random array.

If the assumption holds and random memory access costs are constant, then the

size of the padding in the array and the array layout (sequential or random) should

make no difference when iterating over the array. Figure 8.2 shows the result for

iterating through a list with 4,096 elements, while following the pointers inside the

elements and increasing the padding between the elements. As we can clearly see,

the access costs are not constant and increase with an increasing stride. We also see

multiple points of discontinuity in the curves, e.g. the random access times increase

heavily up to a stride of 64 bytes and continue increasing with a smaller slope.

Figure 8.3 indicates that an increasing number of cache misses is causing the

increase in access times. The first point of discontinuity in Fig. 8.2 is quite exactly

the size of the cache lines of the test system. The strong increase is due to the fact,

that with a stride smaller than 64 bytes, multiple list elements are located on one

cache line and the overhead of loading one line is amortized over the multiple

elements.

pointer

padding

pointer

padding

pointer

padding

pointer

padding

pointer

padding

pointer

padding

pointer

padding

pointer

padding

Fig. 8.1 Sequential versus random array layout

CPU Cycles per Element

8.1 Cache Effects on Application Performance

550

500

450

400

350

300

250

200

150

100

50

0

0

2

57

[Cache Linesize]

2

2

2

4

2

6

2

8

[Pagesize]

2

10

2

12

2

14

2

16

2

18

Stride in Bytes

Random

Sequential

Fig. 8.2 Cycles for cache accesses with increasing stride

For strides greater than 64 bytes, we would expect a cache miss for every single

list element and no further increase in access times. However, as the stride gets

larger the array is placed over multiple pages in memory and more TLB misses

occur, as the virtual addresses on the new pages have to be translated into physical

addresses. The number of TLB cache misses increases up to the page size of 4 KB

and stays at its worst case of one miss per element. With strides greater as the page

size, the TLB misses can induce additional cache misses when translating the

virtual to a physical address. These cache misses are due to accesses to the paging

structures which reside in main memory [BCR10, BCR10, SS95].

To summarize, the performance of main memory accesses can largely differ

depending on the access patterns. In order to improve application performance,

main memory access should be optimized in order to exploit the usage of caches.

8.1.2 The Size Experiment

In a second experiment, we access a constant number of addresses in main

memory with a constant stride of 64 bytes and vary the size of the working set size

or accessed area in memory. A run with n memory accesses and a working set size

n

of s bytes would iterate sÁ64

times through the array, which is created as described

earlier in the stride experiment in Sect. 8.1.1.

Figure 8.4a shows that the access costs differ up to a factor of 100, depending

on the working set size. The points of discontinuity correlate with the sizes of the

caches in the system. As long as the working set size is smaller than the size of the

L1 Cache, only the first iteration results in cache misses and all other accesses can

be answered out of the cache. As the working set size increases, the accesses in

one iteration start to evict the earlier accessed addresses, resulting in cache misses

in the next iteration.

58

8 Data Layout in Main Memory

(a)

(b)

Fig. 8.3 Cache misses for cache accesses with increasing stride. (a) Sequential Access.

(b) Random Access

Figure 8.4b shows the individual cache misses with increasing working set

sizes. Up to working sets of 32 KB, the misses for the L1 cache go up to one per

element, the L2 cache misses reach their plateau at the L2 cache size of 256 KB

and the L3 cache misses at 12 MB.

As we can see, the larger the accessed area in main memory, the more capacity

cache misses occur, resulting in poorer application performance. Therefore, it is

advisable to process data in cache friendly chunks if possible.

8.2 Row and Columnar Layouts

Let us consider a simple example to illustrate the two mentioned approaches for

representing a relational table in memory. For simplicity, we assume that all values

8.2 Row and Columnar Layouts

CPU Cycles per Element

(a)1000.0

[L1]

59

[L2]

[L3]

100.0

10.0

1.0

16KB 64KB 256KB 1MB

4MB 16MB 64MB 256MB

Size of Accessed Area in Bytes

Sequential

Random

(b)

Fig. 8.4 Cycles and cache misses for cache accesses with increasing working sets. (a) Sequential

Access. (b) Random Access

are stored as strings directly in memory and that we do not need to store any

additional data. As an example, let us look at the simple world population example:

Id

Name

Country

City

1

2

3

Paul Smith

Lena Jones

Marc Winter

Australia

USA

Germany

Sydney

Washington

Berlin

As discussed above, the database must transform its two-dimensional table into a

one-dimensional series of bytes for the operating system to write them to memory.

The classical and obvious approach is a row- or record-based layout. In this case,

all attributes of a tuple are stored consecutively and sequentially in memory. In

other words, the data is stored tuple-wise. Considering our example table, the data

Xem Thêm

1…Cache Effects on Application Performance

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về