6…Memory Hierarchy and Latency Numbers

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (12.94 MB, 298 trang )

24

4 Changes in Hardware

Fig. 4.3 Conceptual view of the memory hierarchy

fully utilize the speed of flash based storage the interfaces and drivers have to be

adapted accordingly.

On top of flash is the main memory, which is directly accessible. The next level

are the CPU caches—L3, L2, L1—with different characteristics. Finally, the top

level of the memory hierarchy are the registers of the CPU where things like

calculations are happening.

When accessing data from a disk, there are usually four layers between the

accessed disk and the registers of the CPU which only transport information. In the

end, every operation takes place inside the CPU and in turn the data has to be in

the registers.

Table 4.1 shows some of the latencies, which come into play regarding the

memory hierarchy. Latency is the time delay experienced by the system to load the

data from the storage medium until it is available in a CPU register. The L1 cache

latency is 0.5 ns. In contrast, accessing a main memory reference takes 100 ns and

a simple disk seek is taking 10 ms.

In the end, there is nothing special about ‘‘in-memory’’ computing and all

computing ever done was in memory, because it can only take place in the CPU.

Table 4.1 Latency numbers

Action

L1 cache reference (cached data word)

Branch mispredict

L2 cache reference

Mutex lock / unlock

Main memory reference

Send 2,000 byte over 1 Gb/s network

SSD random read

Read 1 MB sequentially from memory

Disk seek

Send packet CA to Netherlands to CA

Time in nanoseconds (ns)

0.5

5

7

25

100

20,000

150,000

250,000

10,000,000

150,000,000

Time

0.1 ls

20 ls

150 ls

250 ls

10 ms

150 ms

4.6 Memory Hierarchy and Latency Numbers

25

Assuming a bandwidth-bound application, the performance is determined by how

fast the data can be transferred through the hierarchy to the CPU. In order to

estimate the runtime of an algorithm, it is possible to roughly estimate the amount

of data which has to be transferred to the CPU. A very simple operation that a CPU

can do is a comparison like filtering for an attribute. Let us assume a calculation

speed of 2 MB per millisecond for this operation using one core. So one core of a

CPU can digest 2 MB per millisecond. This number scales with the amount of

cores and if there are ten cores, they can scan 20 GB per second. If there are ten

nodes with ten cores each, then that is already 200 GB in per second.

Considering a large multi-node system like that, having ten nodes and 40 CPUs

per node where the data is distributed across the nodes, it is hard to write an

algorithm which needs more than 1 s. This includes large amounts of data. The

previously mentioned 200 GB are highly compressed data. So it is a much higher

amount of plain character data. To sum this up, the number to remember is 2 MB

per millisecond per core. If an algorithm shows a completely different result it is

worth looking into it as there is probably something going wrong. This could be an

issue in SQL, like a too complicated join or a loop in a loop.

4.7 Non-Uniform Memory Architecture

As the development in modern computer systems goes from multi-core to manycore systems and the amount of main memory continues to increase, the in Uniform Memory Architecture (UMA) systems becomes a bottleneck and introduces

heavy challenges in hardware design to connect all cores and memory.

Non-Uniform Memory Architectures (NUMA) attempt to solve this problems by

introducing local memory locations which are cheap to access for local processors.

Figure 4.4 pictures an overview of an UMA and a NUMA system. In an UMA

(a)

Fig. 4.4 (a) Shared FSB, (b) Intel quick path interconnect [Int09]

(b)

26

4 Changes in Hardware

system every processor observes the same speeds when accessing an arbitrary

memory address as the complete memory is accessed through a central memory

interface as shown in Fig. 4.4a. In contrast, in NUMA systems, every processor

has its primary used local memory as well as remote memory supplied from the

other processors. This setup is shown in Fig. 4.4b. The different kinds of memory

from the processors point of view introduce different memory access times

between local memory (adjacent slots) and remote memory that is adjacent to the

other processing units.

Additionally, systems can be classified into cache-coherent NUMA (ccNUMA)

and non cache-coherent NUMA systems. ccNuma systems provide each CPU

cache the same view to the complete memory and enforce coherency by a hardware implemented protocol. Non cache-coherent NUMA systems require software

layers to handle memory conflicts accordingly. Although non ccNUMA hardware

is easier and cheaper to build, most of todays available standard hardware provides

ccNUMA, since non ccNUMA hardware is more difficult to program.

To fully exploit the potentials of NUMA, applications have to be made aware of

the different memory latencies and should primarily load data from the locally

attached memory slots of a processor. Memory-bound applications may suffer a

degradation of up to 25 % of their maximal performance if remote memory is

accessed instead of local memory.

By introducing NUMA, the central bottleneck of the FSB can be avoided and

memory bandwidth can be increased. Benchmark results have shown that a

throughput of more than 72 GB per second is possible on an Intel XEON 7560

(Nehalem EX) system with four processors [Fuj10].

4.8 Scaling Main Memory Systems

An example system that consists of multiple nodes can be seen in Fig. 4.5. One

node has eight CPUs with eight cores, so each system has 64 cores, and there are

four nodes. Each of them has a terabyte of RAM and SSDs for persistence.

Everything which is below DRAM is for logging, archiving, and for emergency

reconstruction of data, which means reloading the data after the power supply was

turned off.

The networks which connect the nodes are continuously increasing in speed. In

the example shown in Fig. 4.5, a 10 Gb Ethernet network connects the four nodes.

Computers with 40 Gb Infiniband are already on the market and switch manufacturers are talking about 100 Gb switches which even have logic allowing smart

switching. This is another location where an optimization can take place—on a

low level and very effective for applications. It can be leveraged to improve joins,

where calculations often go across multiple nodes.

4.9 Remote Direct Memory Access

27

Fig. 4.5 A system consisting of multiple blades

4.9 Remote Direct Memory Access

Shared memory is another interesting way to directly access memory between

multiple nodes. The nodes are connected with the network via Infiniband and

create a shared memory region. The main idea is to automatically access data

which is on a different node without explicitly shipping the data. In turn, there is

direct access without shipping and processing it on the other side. Research has

been done at Stanford University in cooperation with the HPI using a RAM cluster.

It is very promising as it could basically offer direct access to a seemingly

unlimited amount of memory from a program’s perspective.

4.10

Self Test Questions

1. Speed per Core

What is the speed of a single core when processing a simple scan operation

(under optimal conditions)?

(a) 2 GB/ms/core

(b) 2 MB/ms/core

Xem Thêm

6…Memory Hierarchy and Latency Numbers

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về