cpumemory Ulrich Drepper What Every Programmer Should Know About Memory

[ Pobierz całość w formacie PDF ]

after one another are also close to each other in memory.
Working Set Size (Bytes)
P4/32k/1M P4/16k/512k/2M Core2/32k/4M
There are two important points to note in Figure 3.15.
The first is the large number of cycles needed for grow-
ing working set sizes. The machine makes it possible
Figure 3.14: Advantage of Larger L2/L3 Caches
to access the main memory in 200-300 cycles but here
we reach 450 cycles and more. We have seen this phe-
500
nomenon before (compare Figure 3.11). The automatic
450
prefetching is actually working to a disadvantage here.
400
The second interesting point is that the curve is not flat-
350
tening at various plateaus as it has been for the sequen-
300
tial access cases. The curve keeps on rising. To explain
this we can measure the L2 access of the program for
250
the various working set sizes. The result can be seen in
200
Figure 3.16 and Table 3.2.
150
The figure shows that, when the working set size is larger
100
than the L2 size, the cache miss ratio (L2 accesses / L2
50
misses) starts to grow. The curve has a similar form to
the one in Figure 3.15: it rises quickly, declines slightly,
210 213 216 219 222 225 228
and starts to rise again. There is a strong correlation with
Working Set Size (Bytes)
the cycles per list element graph. The L2 miss rate will
Sequential Random
grow until it eventually reaches close to 100%. Given a
large enough working set (and RAM) the probability that
any of the randomly picked cache lines is in L2 or is in
Figure 3.15: Sequential vs Random Read, NPAD=0
the process of being loaded can be reduced arbitrarily.
Ulrich Drepper Version 1.0 23
Cycles/List Element
Cycles/List Element
Cycles/List Element
Sequential Random
Set Ratio A2 Accesses Ratio L2 Accesses
Size L2 Hit L2 Miss #Iter Miss/Hit per Iteration L2 Hit L2 Miss #Iter Miss/Hit per Iteration
220 88,636 843 16,384 0.94% 5.5 30,462 4721 1,024 13.42% 34.4
221 88,105 1,584 8,192 1.77% 10.9 21,817 15,151 512 40.98% 72.2
222 88,106 1,600 4,096 1.78% 21.9 22,258 22,285 256 50.03% 174.0
223 88,104 1,614 2,048 1.80% 43.8 27,521 26,274 128 48.84% 420.3
224 88,114 1,655 1,024 1.84% 87.7 33,166 29,115 64 46.75% 973.1
225 88,112 1,730 512 1.93% 175.5 39,858 32,360 32 44.81% 2,256.8
226 88,112 1,906 256 2.12% 351.6 48,539 38,151 16 44.01% 5,418.1
227 88,114 2,244 128 2.48% 705.9 62,423 52,049 8 45.47% 14,309.0
228 88,120 2,939 64 3.23% 1,422.8 81,906 87,167 4 51.56% 42,268.3
229 88,137 4,318 32 4.67% 2,889.2 119,079 163,398 2 57.84% 141,238.5
Table 3.2: L2 Hits and Misses for Sequential and Random Walks, NPAD=0
The increasing cache miss rate alone explains some of
60%
the costs. But there is another factor. Looking at Ta-
ble 3.2 we can see in the L2/#Iter columns that the total
50%
number of L2 uses per iteration of the program is grow-
ing. Each working set is twice as large as the one be- 40%
fore. So, without caching we would expect double the
30%
main memory accesses. With caches and (almost) per-
fect predictability we see the modest increase in the L2
20%
use shown in the data for sequential access. The increase
is due to the increase of the working set size and nothing
10%
else.
0%
For random access the per-element access time more than
210 213 216 219 222 225 228
doubles for each doubling of the working set size. This
Working Set Size (Bytes)
means the average access time per list element increases
Sequential Random
since the working set size only doubles. The reason be-
hind this is a rising rate of TLB misses. In Figure 3.17 we
see the cost for random accesses forNPAD=7. Only this
Figure 3.16: L2d Miss Ratio
time the randomization is modified. While in the normal
case the entire list of randomized as one block (indicated
by the label ") the other 11 curves show randomizations
550
which are performed in smaller blocks. For the curve
500
labeled 60 each set of 60 pages (245.760 bytes) is ran-
450
domized individually. That means all list elements in the
400
block are traversed before going over to an element in
350
the next block. This has the effect that number of TLB
300
entries which are used at any one time is limited.
250
200
The element size forNPAD=7 is 64 bytes, which corre-
150
sponds to the cache line size. Due to the randomized or-
der of the list elements it is unlikely that the hardware 100
prefetcher has any effect, most certainly not for more 50
than a handful of elements. This means the L2 cache
miss rate does not differ significantly from the random-
210 213 216 219 222 225 228
ization of the entire list in one block. The performance
Working Set Size (Bytes)
of the test with increasing block size approaches asymp-
60 120 240 480 960 1920 3840 7680
totically the curve for the one-block randomization. This
15360 30720 61440 "
means the performance of this latter test case is signifi-
cantly influenced by the TLB misses. If the TLB misses
can be lowered the performance increases significantly
Figure 3.17: Page-Wise Randomization, NPAD=7
24 Version 1.0 What Every Programmer Should Know About Memory
L2 Misses
Cycles/List Element
(in one test we will see later up to 38%). line is needed. In the next section we will see how this is
currently implemented.
3.3.3 Write Behavior
Before we get to this there are two more cache policies
to mention:
Before we start looking at the cache behavior when mul-
tiple execution contexts (threads or processes) use the
" write-combining; and
same memory we have to explore a detail of cache im-
" uncacheable.
plementations. Caches are supposed to be coherent and
this coherency is supposed to be completely transparent
for the userlevel code. Kernel code is a different story; it
Both these policies are used for special regions of the
occasionally requires cache flushes.
address space which are not backed by real RAM. The
kernel sets up these policies for the address ranges (on
This specifically means that, if a cache line is modified,
x86 processors using the Memory Type Range Regis-
the result for the system after this point in time is the
ters, MTRRs) and the rest happens automatically. The
same as if there were no cache at all and the main mem-
MTRRs are also usable to select between write-through
ory location itself had been modified. This can be imple-
and write-back policies.
mented in two ways or policies:
Write-combining is a limited caching optimization more
often used for RAM on devices such as graphics cards.
" write-through cache implementation;
Since the transfer costs to the devices are much higher
than the local RAM access it is even more important
" write-back cache implementation.
to avoid doing too many transfers. Transferring an en- [ Pobierz całość w formacie PDF ]

Archiwum

Drogi użytkowniku!