SlideShare a Scribd company logo
1 of 114
Download to read offline
What Every Programmer Should Know About Memory
                                                          Ulrich Drepper
                                                           Red Hat, Inc.
                                                      drepper@redhat.com

                                                       November 21, 2007


                                                              Abstract
                As CPU cores become both faster and more numerous, the limiting factor for most programs is
                now, and will be for some time, memory access. Hardware designers have come up with ever
                more sophisticated memory handling and acceleration techniques–such as CPU caches–but
                these cannot work optimally without some help from the programmer. Unfortunately, neither
                the structure nor the cost of using the memory subsystem of a computer or the caches on CPUs
                is well understood by most programmers. This paper explains the structure of memory subsys-
                tems in use on modern commodity hardware, illustrating why CPU caches were developed, how
                they work, and what programs should do to achieve optimal performance by utilizing them.

1      Introduction                                                  day these changes mainly come in the following forms:

In the early days computers were much simpler. The var-                  • RAM hardware design (speed and parallelism).
ious components of a system, such as the CPU, memory,
mass storage, and network interfaces, were developed to-                 • Memory controller designs.
gether and, as a result, were quite balanced in their per-
                                                                         • CPU caches.
formance. For example, the memory and network inter-
faces were not (much) faster than the CPU at providing                   • Direct memory access (DMA) for devices.
data.
This situation changed once the basic structure of com-              For the most part, this document will deal with CPU
puters stabilized and hardware developers concentrated               caches and some effects of memory controller design.
on optimizing individual subsystems. Suddenly the per-               In the process of exploring these topics, we will explore
formance of some components of the computer fell sig-                DMA and bring it into the larger picture. However, we
nificantly behind and bottlenecks developed. This was                 will start with an overview of the design for today’s com-
especially true for mass storage and memory subsystems               modity hardware. This is a prerequisite to understand-
which, for cost reasons, improved more slowly relative               ing the problems and the limitations of efficiently us-
to other components.                                                 ing memory subsystems. We will also learn about, in
                                                                     some detail, the different types of RAM and illustrate
The slowness of mass storage has mostly been dealt with              why these differences still exist.
using software techniques: operating systems keep most
often used (and most likely to be used) data in main mem-            This document is in no way all inclusive and final. It is
ory, which can be accessed at a rate orders of magnitude             limited to commodity hardware and further limited to a
faster than the hard disk. Cache storage was added to the            subset of that hardware. Also, many topics will be dis-
storage devices themselves, which requires no changes in             cussed in just enough detail for the goals of this paper.
the operating system to increase performance.1 For the               For such topics, readers are recommended to find more
purposes of this paper, we will not go into more details             detailed documentation.
of software optimizations for the mass storage access.
                                                                     When it comes to operating-system-specific details and
Unlike storage subsystems, removing the main memory                  solutions, the text exclusively describes Linux. At no
as a bottleneck has proven much more difficult and al-                time will it contain any information about other OSes.
most all solutions require changes to the hardware. To-              The author has no interest in discussing the implications
                                                                     for other OSes. If the reader thinks s/he has to use a
   1 Changes are needed, however, to guarantee data integrity when
                                                                     different OS they have to go to their vendors and demand
using storage device caches.
                                                                     they write documents similar to this one.
                                                                     One last comment before the start. The text contains a
    Copyright © 2007 Ulrich Drepper                                  number of occurrences of the term “usually” and other,
    All rights reserved. No redistribution allowed.                  similar qualifiers. The technology discussed here exists
in many, many variations in the real world and this paper       Thanks
only addresses the most common, mainstream versions.
It is rare that absolute statements can be made about this
technology, thus the qualifiers.                                 I would like to thank Johnray Fuller and the crew at LWN
                                                                (especially Jonathan Corbet for taking on the daunting
                                                                task of transforming the author’s form of English into
Document Structure                                              something more traditional. Markus Armbruster provided
                                                                a lot of valuable input on problems and omissions in the
                                                                text.
This document is mostly for software developers. It does
not go into enough technical details of the hardware to be
useful for hardware-oriented readers. But before we can         About this Document
go into the practical information for developers a lot of
groundwork must be laid.
                                                                The title of this paper is an homage to David Goldberg’s
To that end, the second section describes random-access         classic paper “What Every Computer Scientist Should
memory (RAM) in technical detail. This section’s con-           Know About Floating-Point Arithmetic” [12]. This pa-
tent is nice to know but not absolutely critical to be able     per is still not widely known, although it should be a
to understand the later sections. Appropriate back refer-       prerequisite for anybody daring to touch a keyboard for
ences to the section are added in places where the content      serious programming.
is required so that the anxious reader could skip most of
this section at first.                                           One word on the PDF: xpdf draws some of the diagrams
                                                                rather poorly. It is recommended it be viewed with evince
The third section goes into a lot of details of CPU cache       or, if really necessary, Adobe’s programs. If you use
behavior. Graphs have been used to keep the text from           evince be advised that hyperlinks are used extensively
being as dry as it would otherwise be. This content is es-      throughout the document even though the viewer does
sential for an understanding of the rest of the document.       not indicate them like others do.
Section 4 describes briefly how virtual memory is imple-
mented. This is also required groundwork for the rest.
Section 5 goes into a lot of detail about Non Uniform
Memory Access (NUMA) systems.
Section 6 is the central section of this paper. It brings to-
gether all the previous sections’ information and gives
programmers advice on how to write code which per-
forms well in the various situations. The very impatient
reader could start with this section and, if necessary, go
back to the earlier sections to freshen up the knowledge
of the underlying technology.
Section 7 introduces tools which can help the program-
mer do a better job. Even with a complete understanding
of the technology it is far from obvious where in a non-
trivial software project the problems are. Some tools are
necessary.
In section 8 we finally give an outlook of technology
which can be expected in the near future or which might
just simply be good to have.


Reporting Problems

The author intends to update this document for some
time. This includes updates made necessary by advances
in technology but also to correct mistakes. Readers will-
ing to report problems are encouraged to send email to
the author. They are asked to include exact version in-
formation in the report. The version information can be
found on the last page of the document.




2                                                Version 1.0    What Every Programmer Should Know About Memory
2    Commodity Hardware Today                                             tion with devices through a variety of different buses. To-
                                                                          day the PCI, PCI Express, SATA, and USB buses are of
It is important to understand commodity hardware be-                      most importance, but PATA, IEEE 1394, serial, and par-
cause specialized hardware is in retreat. Scaling these                   allel ports are also supported by the Southbridge. Older
days is most often achieved horizontally instead of verti-                systems had AGP slots which were attached to the North-
cally, meaning today it is more cost-effective to use many                bridge. This was done for performance reasons related to
smaller, connected commodity computers instead of a                       insufficiently fast connections between the Northbridge
few really large and exceptionally fast (and expensive)                   and Southbridge. However, today the PCI-E slots are all
systems. This is the case because fast and inexpensive                    connected to the Southbridge.
network hardware is widely available. There are still sit-
uations where the large specialized systems have their                    Such a system structure has a number of noteworthy con-
place and these systems still provide a business opportu-                 sequences:
nity, but the overall market is dwarfed by the commodity
hardware market. Red Hat, as of 2007, expects that for                          • All data communication from one CPU to another
future products, the “standard building blocks” for most                          must travel over the same bus used to communicate
data centers will be a computer with up to four sockets,                          with the Northbridge.
each filled with a quad core CPU that, in the case of Intel
CPUs, will be hyper-threaded.2 This means the standard                          • All communication with RAM must pass through
system in the data center will have up to 64 virtual pro-                         the Northbridge.
cessors. Bigger machines will be supported, but the quad                                                                  3
                                                                                • The RAM has only a single port.
socket, quad CPU core case is currently thought to be the
sweet spot and most optimizations are targeted for such                         • Communication between a CPU and a device at-
machines.                                                                         tached to the Southbridge is routed through the
                                                                                  Northbridge.
Large differences exist in the structure of computers built
of commodity parts. That said, we will cover more than
90% of such hardware by concentrating on the most im-                     A couple of bottlenecks are immediately apparent in this
portant differences. Note that these technical details tend               design. One such bottleneck involves access to RAM for
to change rapidly, so the reader is advised to take the date              devices. In the earliest days of the PC, all communica-
of this writing into account.                                             tion with devices on either bridge had to pass through the
                                                                          CPU, negatively impacting overall system performance.
Over the years personal computers and smaller servers                     To work around this problem some devices became ca-
standardized on a chipset with two parts: the Northbridge                 pable of direct memory access (DMA). DMA allows de-
and Southbridge. Figure 2.1 shows this structure.                         vices, with the help of the Northbridge, to store and re-
                                                                          ceive data in RAM directly without the intervention of
                        CPU1          CPU2                                the CPU (and its inherent performance cost). Today all
                                           FSB                            high-performance devices attached to any of the buses
                                                                          can utilize DMA. While this greatly reduces the work-
             RAM           Northbridge
                                                                          load on the CPU, it also creates contention for the band-
                                                                          width of the Northbridge as DMA requests compete with
                           Southbridge            SATA                    RAM access from the CPUs. This problem, therefore,
            PCI-E
                                                  USB
                                                                          must be taken into account.
                                                                          A second bottleneck involves the bus from the North-
Figure 2.1: Structure with Northbridge and Southbridge                    bridge to the RAM. The exact details of the bus depend
                                                                          on the memory types deployed. On older systems there
                                                                          is only one bus to all the RAM chips, so parallel ac-
All CPUs (two in the previous example, but there can be
                                                                          cess is not possible. Recent RAM types require two sep-
more) are connected via a common bus (the Front Side
                                                                          arate buses (or channels as they are called for DDR2,
Bus, FSB) to the Northbridge. The Northbridge contains,
                                                                          see page 8) which doubles the available bandwidth. The
among other things, the memory controller, and its im-
                                                                          Northbridge interleaves memory access across the chan-
plementation determines the type of RAM chips used for
                                                                          nels. More recent memory technologies (FB-DRAM, for
the computer. Different types of RAM, such as DRAM,
                                                                          instance) add more channels.
Rambus, and SDRAM, require different memory con-
trollers.                                                                 With limited bandwidth available, it is important for per-
                                                                          formance to schedule memory access in ways that mini-
To reach all other system devices, the Northbridge must
                                                                          mize delays. As we will see, processors are much faster
communicate with the Southbridge. The Southbridge,
often referred to as the I/O bridge, handles communica-                      3 We will not discuss multi-port RAM in this document as this type

                                                                          of RAM is not found in commodity hardware, at least not in places
   2 Hyper-threading enables a single processor core to be used for two   where the programmer has access to it. It can be found in specialized
or more concurrent executions with just a little extra hardware.          hardware such as network routers which depend on utmost speed.


Ulrich Drepper                                                    Version 1.0                                                                3
and must wait to access memory, despite the use of CPU
                                                                                  RAM       CPU1         CPU2       RAM
caches. If multiple hyper-threads, cores, or processors
access memory at the same time, the wait times for mem-
ory access are even longer. This is also true for DMA                             RAM       CPU3         CPU4       RAM
operations.
                                                                                               Southbridge         SATA
There is more to accessing memory than concurrency,                               PCI-E
                                                                                                                   USB
however. Access patterns themselves also greatly influ-
ence the performance of the memory subsystem, espe-
cially with multiple memory channels. In section 2.2 we
wil cover more details of RAM access patterns.                                 Figure 2.3: Integrated Memory Controller

On some more expensive systems, the Northbridge does
not actually contain the memory controller. Instead the                 deeper into this technology here.
Northbridge can be connected to a number of external
memory controllers (in the following example, four of                   There are disadvantages to this architecture, too. First of
them).                                                                  all, because the machine still has to make all the mem-
                                                                        ory of the system accessible to all processors, the mem-
                                                                        ory is not uniform anymore (hence the name NUMA -
                       CPU1          CPU2
                                                                        Non-Uniform Memory Architecture - for such an archi-
    RAM       MC1                               MC3       RAM
                                                                        tecture). Local memory (memory attached to a proces-
                          Northbridge                                   sor) can be accessed with the usual speed. The situation
    RAM       MC2                               MC4       RAM
                                                                        is different when memory attached to another processor
                                                SATA                    is accessed. In this case the interconnects between the
           PCI-E          Southbridge
                                                USB                     processors have to be used. To access memory attached
                                                                        to CPU2 from CPU1 requires communication across one
                                                                        interconnect. When the same CPU accesses memory at-
    Figure 2.2: Northbridge with External Controllers                   tached to CPU4 two interconnects have to be crossed.
                                                                        Each such communication has an associated cost. We
The advantage of this architecture is that more than one                talk about “NUMA factors” when we describe the ex-
memory bus exists and therefore total available band-                   tra time needed to access remote memory. The example
width increases. This design also supports more memory.                 architecture in Figure 2.3 has two levels for each CPU:
Concurrent memory access patterns reduce delays by si-                  immediately adjacent CPUs and one CPU which is two
multaneously accessing different memory banks. This                     interconnects away. With more complicated machines
is especially true when multiple processors are directly                the number of levels can grow significantly. There are
connected to the Northbridge, as in Figure 2.2. For such                also machine architectures (for instance IBM’s x445 and
a design, the primary limitation is the internal bandwidth              SGI’s Altix series) where there is more than one type
of the Northbridge, which is phenomenal for this archi-                 of connection. CPUs are organized into nodes; within a
tecture (from Intel).4                                                  node the time to access the memory might be uniform or
                                                                        have only small NUMA factors. The connection between
Using multiple external memory controllers is not the
                                                                        nodes can be very expensive, though, and the NUMA
only way to increase memory bandwidth. One other in-
                                                                        factor can be quite high.
creasingly popular way is to integrate memory controllers
into the CPUs and attach memory to each CPU. This                       Commodity NUMA machines exist today and will likely
architecture is made popular by SMP systems based on                    play an even greater role in the future. It is expected that,
AMD’s Opteron processor. Figure 2.3 shows such a sys-                   from late 2008 on, every SMP machine will use NUMA.
tem. Intel will have support for the Common System In-                  The costs associated with NUMA make it important to
terface (CSI) starting with the Nehalem processors; this                recognize when a program is running on a NUMA ma-
is basically the same approach: an integrated memory                    chine. In section 5 we will discuss more machine archi-
controller with the possibility of local memory for each                tectures and some technologies the Linux kernel provides
processor.                                                              for these programs.
With an architecture like this there are as many memory                 Beyond the technical details described in the remainder
banks available as there are processors. On a quad-CPU                  of this section, there are several additional factors which
machine the memory bandwidth is quadrupled without                      influence the performance of RAM. They are not con-
the need for a complicated Northbridge with enormous                    trollable by software, which is why they are not covered
bandwidth. Having a memory controller integrated into                   in this section. The interested reader can learn about
the CPU has some additional advantages; we will not dig                 some of these factors in section 2.1. They are really only
    4 For completeness it should be mentioned that such a memory con-   needed to get a more complete picture of RAM technol-
troller arrangement can be used for other purposes such as “memory      ogy and possibly to make better decisions when purchas-
RAID” which is useful in combination with hotplug memory.               ing computers.


4                                                      Version 1.0       What Every Programmer Should Know About Memory
The following two sections discuss hardware details at              If access to the state of the cell is needed the word access
the gate level and the access protocol between the mem-             line WL is raised. This makes the state of the cell imme-
ory controller and the DRAM chips. Programmers will                 diately available for reading on BL and BL. If the cell
likely find this information enlightening since these de-            state must be overwritten the BL and BL lines are first
tails explain why RAM access works the way it does. It              set to the desired values and then WL is raised. Since the
is optional knowledge, though, and the reader anxious to            outside drivers are stronger than the four transistors (M1
get to topics with more immediate relevance for everyday            through M4 ) this allows the old state to be overwritten.
life can jump ahead to section 2.2.5.
                                                                    See [20] for a more detailed description of the way the
2.1 RAM Types                                                       cell works. For the following discussion it is important
                                                                    to note that

There have been many types of RAM over the years and
                                                                           • one cell requires six transistors. There are variants
each type varies, sometimes significantly, from the other.
                                                                             with four transistors but they have disadvantages.
The older types are today really only interesting to the
historians. We will not explore the details of those. In-                  • maintaining the state of the cell requires constant
stead we will concentrate on modern RAM types; we will                       power.
only scrape the surface, exploring some details which
are visible to the kernel or application developer through                 • the cell state is available for reading almost im-
their performance characteristics.                                           mediately once the word access line WL is raised.
                                                                             The signal is as rectangular (changing quickly be-
The first interesting details are centered around the ques-                   tween the two binary states) as other transistor-
tion why there are different types of RAM in the same                        controlled signals.
machine. More specifically, why are there both static
RAM (SRAM5 ) and dynamic RAM (DRAM). The for-                              • the cell state is stable, no refresh cycles are needed.
mer is much faster and provides the same functionality.
Why is not all RAM in a machine SRAM? The answer
                                                                    There are other, slower and less power-hungry, SRAM
is, as one might expect, cost. SRAM is much more ex-
                                                                    forms available, but those are not of interest here since
pensive to produce and to use than DRAM. Both these
                                                                    we are looking at fast RAM. These slow variants are
cost factors are important, the second one increasing in
                                                                    mainly interesting because they can be more easily used
importance more and more. To understand these differ-
                                                                    in a system than dynamic RAM because of their simpler
ences we look at the implementation of a bit of storage
                                                                    interface.
for both SRAM and DRAM.
In the remainder of this section we will discuss some               2.1.2     Dynamic RAM
low-level details of the implementation of RAM. We will
keep the level of detail as low as possible. To that end,           Dynamic RAM is, in its structure, much simpler than
we will discuss the signals at a “logic level” and not at a         static RAM. Figure 2.5 shows the structure of a usual
level a hardware designer would have to use. That level             DRAM cell design. All it consists of is one transistor
of detail is unnecessary for our purpose here.                      and one capacitor. This huge difference in complexity of
                                                                    course means that it functions very differently than static
2.1.1     Static RAM                                                RAM.
                                                                                                   AL
                               WL
                                                                                             DL     M
                               Vdd                                                                         C
                         M2          M4
                                           M6
                       M5
                        M1            M3                                            Figure 2.5: 1-T Dynamic RAM
                 BL                             BL
                                                                    A dynamic RAM cell keeps its state in the capacitor C.
                                                                    The transistor M is used to guard the access to the state.
                  Figure 2.4: 6-T Static RAM                        To read the state of the cell the access line AL is raised;
                                                                    this either causes a current to flow on the data line DL or
Figure 2.4 shows the structure of a 6 transistor SRAM               not, depending on the charge in the capacitor. To write
cell. The core of this cell is formed by the four transistors       to the cell the data line DL is appropriately set and then
M1 to M4 which form two cross-coupled inverters. They               AL is raised for a time long enough to charge or drain
have two stable states, representing 0 and 1 respectively.          the capacitor.
The state is stable as long as power on Vdd is available.
                                                                    There are a number of complications with the design of
  5 In   other contexts SRAM might mean “synchronous RAM”.          dynamic RAM. The use of a capacitor means that reading


Ulrich Drepper                                               Version 1.0                                                          5
the cell discharges the capacitor. The procedure cannot                                    Charge                Discharge
be repeated indefinitely, the capacitor must be recharged                   100
at some point. Even worse, to accommodate the huge                          90




                                                               Percentage Charge
number of cells (chips with 109 or more cells are now                       80
common) the capacity to the capacitor must be low (in                       70
                                                                            60
the femto-farad range or lower). A fully charged capac-
                                                                            50
itor holds a few 10’s of thousands of electrons. Even                       40
though the resistance of the capacitor is high (a couple of                 30
tera-ohms) it only takes a short time for the capacity to                   20
dissipate. This problem is called “leakage”.                                10
                                                                             0       1RC 2RC 3RC 4RC 5RC 6RC 7RC 8RC 9RC
This leakage is why a DRAM cell must be constantly
refreshed. For most DRAM chips these days this refresh
must happen every 64ms. During the refresh cycle no                            Figure 2.6: Capacitor Charge and Discharge Timing
access to the memory is possible since a refresh is simply
a memory read operation where the result is discarded.
                                                               cell. The SRAM cells also need individual power for
For some workloads this overhead might stall up to 50%
                                                               the transistors maintaining the state. The structure of
of the memory accesses (see [3]).
                                                               the DRAM cell is also simpler and more regular which
A second problem resulting from the tiny charge is that        means packing many of them close together on a die is
the information read from the cell is not directly usable.     simpler.
The data line must be connected to a sense amplifier
                                                               Overall, the (quite dramatic) difference in cost wins. Ex-
which can distinguish between a stored 0 or 1 over the
                                                               cept in specialized hardware – network routers, for exam-
whole range of charges which still have to count as 1.
                                                               ple – we have to live with main memory which is based
A third problem is that reading a cell causes the charge       on DRAM. This has huge implications on the program-
of the capacitor to be depleted. This means every read         mer which we will discuss in the remainder of this paper.
operation must be followed by an operation to recharge         But first we need to look into a few more details of the
the capacitor. This is done automatically by feeding the       actual use of DRAM cells.
output of the sense amplifier back into the capacitor. It
does mean, though, the reading memory content requires         2.1.3               DRAM Access
additional energy and, more importantly, time.
A fourth problem is that charging and draining a capac-        A program selects a memory location using a virtual ad-
itor is not instantaneous. The signals received by the         dress. The processor translates this into a physical ad-
sense amplifier are not rectangular, so a conservative es-      dress and finally the memory controller selects the RAM
timate as to when the output of the cell is usable has to      chip corresponding to that address. To select the individ-
be used. The formulas for charging and discharging a           ual memory cell on the RAM chip, parts of the physical
capacitor are                                                  address are passed on in the form of a number of address
                                                               lines.

                                            t
                                                               It would be completely impractical to address memory
             QCharge (t)    = Q0 (1 − e− RC )                  locations individually from the memory controller: 4GB
           QDischarge (t)   = Q0 e− RC
                                       t                       of RAM would require 232 address lines. Instead the
                                                               address is passed encoded as a binary number using a
                                                               smaller set of address lines. The address passed to the
This means it takes some time (determined by the capac-        DRAM chip this way must be demultiplexed first. A
ity C and resistance R) for the capacitor to be charged and    demultiplexer with N address lines will have 2N output
discharged. It also means that the current which can be        lines. These output lines can be used to select the mem-
detected by the sense amplifiers is not immediately avail-      ory cell. Using this direct approach is no big problem for
able. Figure 2.6 shows the charge and discharge curves.        chips with small capacities.
The X–axis is measured in units of RC (resistance multi-
plied by capacitance) which is a unit of time.                 But if the number of cells grows this approach is not suit-
                                                               able anymore. A chip with 1Gbit6 capacity would need
Unlike the static RAM case where the output is immedi-         30 address lines and 230 select lines. The size of a de-
ately available when the word access line is raised, it will   multiplexer increases exponentially with the number of
always take a bit of time until the capacitor discharges       input lines when speed is not to be sacrificed. A demulti-
sufficiently. This delay severely limits how fast DRAM          plexer for 30 address lines needs a whole lot of chip real
can be.                                                        estate in addition to the complexity (size and time) of
The simple approach has its advantages, too. The main          the demultiplexer. Even more importantly, transmitting
advantage is size. The chip real estate needed for one            6 I hate those SI prefixes. For me a giga-bit will always be 230 and

DRAM cell is many times smaller than that of an SRAM           not 109 bits.


6                                               Version 1.0           What Every Programmer Should Know About Memory
30 impulses on the address lines synchronously is much                     itors do not fill or drain instantaneously). These timing
harder than transmitting “only” 15 impulses. Fewer lines                   constants are crucial for the performance of the DRAM
have to be laid out at exactly the same length or timed                    chip. We will talk about this in the next section.
appropriately.7
                                                                           A secondary scalability problem is that having 30 address
                                                                           lines connected to every RAM chip is not feasible either.
           Row Address Selection                                           Pins of a chip are precious resources. It is “bad” enough
                                                                           that the data must be transferred as much as possible in
                                                                           parallel (e.g., in 64 bit batches). The memory controller
                                                                           must be able to address each RAM module (collection of
     a0                                                                    RAM chips). If parallel access to multiple RAM mod-
     a1                                                                    ules is required for performance reasons and each RAM
                                                                           module requires its own set of 30 or more address lines,
                                                                           then the memory controller needs to have, for 8 RAM
                                                                           modules, a whopping 240+ pins only for the address han-
                                                                           dling.
                                                                           To counter these secondary scalability problems DRAM
                                                                           chips have, for a long time, multiplexed the address it-
     a2                                                                    self. That means the address is transferred in two parts.
     a3                            Column Address Selection                The first part consisting of address bits (a0 and a1 in the
                                                                           example in Figure 2.7) select the row. This selection re-
                                            Data
                                                                           mains active until revoked. Then the second part, address
                                                                           bits a2 and a3 , select the column. The crucial difference
           Figure 2.7: Dynamic RAM Schematic                               is that only two external address lines are needed. A few
                                                                           more lines are needed to indicate when the RAS and CAS
Figure 2.7 shows a DRAM chip at a very high level. The                     signals are available but this is a small price to pay for
DRAM cells are organized in rows and columns. They                         cutting the number of address lines in half. This address
could all be aligned in one row but then the DRAM chip                     multiplexing brings its own set of problems, though. We
would need a huge demultiplexer. With the array ap-                        will discuss them in section 2.2.
proach the design can get by with one demultiplexer and
one multiplexer of half the size.8 This is a huge saving                   2.1.4    Conclusions
on all fronts. In the example the address lines a0 and a1
through the row address selection (RAS)9 demultiplexer                     Do not worry if the details in this section are a bit over-
select the address lines of a whole row of cells. When                     whelming. The important things to take away from this
reading, the content of all cells is thusly made available to              section are:
the column address selection (CAS)9 multiplexer. Based
on the address lines a2 and a3 the content of one col-
umn is then made available to the data pin of the DRAM                           • there are reasons why not all memory is SRAM
chip. This happens many times in parallel on a number                            • memory cells need to be individually selected to
of DRAM chips to produce a total number of bits corre-                             be used
sponding to the width of the data bus.
                                                                                 • the number of address lines is directly responsi-
For writing, the new cell value is put on the data bus and,
                                                                                   ble for the cost of the memory controller, mother-
when the cell is selected using the RAS and CAS, it is
                                                                                   boards, DRAM module, and DRAM chip
stored in the cell. A pretty straightforward design. There
are in reality – obviously – many more complications.                            • it takes a while before the results of the read or
There need to be specifications for how much delay there                            write operation are available
is after the signal before the data will be available on the
data bus for reading. The capacitors do not unload instan-
taneously, as described in the previous section. The sig-                  The following section will go into more details about the
nal from the cells is so weak that it needs to be amplified.                actual process of accessing DRAM memory. We are not
For writing it must be specified how long the data must                     going into more details of accessing SRAM, which is
be available on the bus after the RAS and CAS is done to                   usually directly addressed. This happens for speed and
successfully store the new value in the cell (again, capac-                because the SRAM memory is limited in size. SRAM
                                                                           is currently used in CPU caches and on-die where the
    7 Modern DRAM types like DDR3 can automatically adjust the tim-        connections are small and fully under control of the CPU
ing but there is a limit as to what can be tolerated.                      designer. CPU caches are a topic which we discuss later
    8 Multiplexers and demultiplexers are equivalent and the multiplexer

here needs to work as a demultiplexer when writing. So we will drop
                                                                           but all we need to know is that SRAM cells have a certain
the differentiation from now on.                                           maximum speed which depends on the effort spent on the
    9 The line over the name indicates that the signal is negated.         SRAM. The speed can vary from only slightly slower


Ulrich Drepper                                                     Version 1.0                                                      7
than the CPU core to one or two orders of magnitude
                                                                    CLK
slower.
                                                                    RAS
2.2 DRAM Access Technical Details
                                                                    CAS

In the section introducing DRAM we saw that DRAM                           Row            Col
chips multiplex the addresses in order to save resources         Address   Addr           Addr

int the form of address pins. We also saw that access-                            t RCD          CL
ing DRAM cells takes time since the capacitors in those              DQ
                                                                                                      Data   Data   Data   Data
                                                                                                      Out    Out    Out    Out
cells do not discharge instantaneously to produce a stable
signal; we also saw that DRAM cells must be refreshed.
Now it is time to put this all together and see how all
these factors determine how the DRAM access has to                   Figure 2.8: SDRAM Read Access Timing
happen.
We will concentrate on current technology; we will not        bus and lowering the RAS signal. All signals are read on
discuss asynchronous DRAM and its variants as they are        the rising edge of the clock (CLK) so it does not matter if
simply not relevant anymore. Readers interested in this       the signal is not completely square as long as it is stable
topic are referred to [3] and [19]. We will also not talk     at the time it is read. Setting the row address causes the
about Rambus DRAM (RDRAM) even though the tech-               RAM chip to start latching the addressed row.
nology is not obsolete. It is just not widely used for sys-
tem memory. We will concentrate exclusively on Syn-           The CAS signal can be sent after tRCD (RAS-to-CAS
chronous DRAM (SDRAM) and its successors Double               Delay) clock cycles. The column address is then trans-
Data Rate DRAM (DDR).                                         mitted by making it available on the address bus and low-
                                                              ering the CAS line. Here we can see how the two parts
Synchronous DRAM, as the name suggests, works rel-            of the address (more or less halves, nothing else makes
ative to a time source. The memory controller provides        sense) can be transmitted over the same address bus.
a clock, the frequency of which determines the speed of
the Front Side Bus (FSB) – the memory controller in-          Now the addressing is complete and the data can be trans-
terface used by the DRAM chips. As of this writing,           mitted. The RAM chip needs some time to prepare for
frequencies of 800MHz, 1,066MHz, or even 1,333MHz             this. The delay is usually called CAS Latency (CL). In
are available with higher frequencies (1,600MHz) being        Figure 2.8 the CAS latency is 2. It can be higher or lower,
announced for the next generation. This does not mean         depending on the quality of the memory controller, moth-
the frequency used on the bus is actually this high. In-      erboard, and DRAM module. The latency can also have
stead, today’s buses are double- or quad-pumped, mean-        half values. With CL=2.5 the first data would be avail-
ing that data is transported two or four times per cy-        able at the first falling flank in the blue area.
cle. Higher numbers sell so the manufacturers like to         With all this preparation to get to the data it would be
advertise a quad-pumped 200MHz bus as an “effective”          wasteful to only transfer one data word. This is why
800MHz bus.                                                   DRAM modules allow the memory controller to spec-
For SDRAM today each data transfer consists of 64 bits        ify how much data is to be transmitted. Often the choice
– 8 bytes. The transfer rate of the FSB is therefore 8        is between 2, 4, or 8 words. This allows filling entire
bytes multiplied by the effective bus frequency (6.4GB/s      lines in the caches without a new RAS/CAS sequence. It
for the quad-pumped 200MHz bus). That sounds a lot            is also possible for the memory controller to send a new
but it is the burst speed, the maximum speed which will       CAS signal without resetting the row selection. In this
never be surpassed. As we will see now the protocol for       way, consecutive memory addresses can be read from
talking to the RAM modules has a lot of downtime when         or written to significantly faster because the RAS sig-
no data can be transmitted. It is exactly this downtime       nal does not have to be sent and the row does not have
which we must understand and minimize to achieve the          to be deactivated (see below). Keeping the row “open”
best performance.                                             is something the memory controller has to decide. Spec-
                                                              ulatively leaving it open all the time has disadvantages
                                                              with real-world applications (see [3]). Sending new CAS
2.2.1   Read Access Protocol
                                                              signals is only subject to the Command Rate of the RAM
                                                              module (usually specified as Tx, where x is a value like
Figure 2.8 shows the activity on some of the connectors       1 or 2; it will be 1 for high-performance DRAM modules
of a DRAM module which happens in three differently           which accept new commands every cycle).
colored phases. As usual, time flows from left to right.
A lot of details are left out. Here we only talk about the    In this example the SDRAM spits out one word per cy-
bus clock, RAS and CAS signals, and the address and           cle. This is what the first generation does. DDR is able
data buses. A read cycle begins with the memory con-          to transmit two words per cycle. This cuts down on the
troller making the row address available on the address       transfer time but does not change the latency. In princi-


8                                              Version 1.0    What Every Programmer Should Know About Memory
ple, DDR2 works the same although in practice it looks               only in use two cycles out of seven. Multiply this with
different. There is no need to go into the details here. It is       the FSB speed and the theoretical 6.4GB/s for a 800MHz
sufficient to note that DDR2 can be made faster, cheaper,             bus become 1.8GB/s. That is bad and must be avoided.
more reliable, and is more energy efficient (see [6] for              The techniques described in section 6 help to raise this
more information).                                                   number. But the programmer usually has to do her share.
                                                                     There is one more timing value for a SDRAM module
2.2.2     Precharge and Activation                                   which we have not discussed. In Figure 2.9 the precharge
                                                                     command was only limited by the data transfer time. An-
Figure 2.8 does not cover the whole cycle. It only shows             other constraint is that an SDRAM module needs time
parts of the full cycle of accessing DRAM. Before a new              after a RAS signal before it can precharge another row
RAS signal can be sent the currently latched row must be             (denoted as tRAS ). This number is usually pretty high,
deactivated and the new row must be precharged. We can               in the order of two or three times the tRP value. This is
concentrate here on the case where this is done with an              a problem if, after a RAS signal, only one CAS signal
explicit command. There are improvements to the pro-                 follows and the data transfer is finished in a few cycles.
tocol which, in some situations, allows this extra step to           Assume that in Figure 2.9 the initial CAS signal was pre-
be avoided. The delays introduced by precharging still               ceded directly by a RAS signal and that tRAS is 8 cycles.
affect the operation, though.                                        Then the precharge command would have to be delayed
                                                                     by one additional cycle since the sum of tRCD , CL, and
        CLK                                                          tRP (since it is larger than the data transfer time) is only
                                                                     7 cycles.
         WE                      t RP
                                                                     DDR modules are often described using a special nota-
        RAS                                                          tion: w-x-y-z-T. For instance: 2-3-2-8-T1. This means:
        CAS
                                                                             w    2    CAS Latency (CL)
   Address
              Col                       Row            Col                   x    3    RAS-to-CAS delay (tRCD )
              Addr                      Addr           Addr
                     CL                        t RCD                         y    2    RAS Precharge (tRP )
                          Data   Data
                                                                             z    8    Active to Precharge delay (tRAS )
         DQ               Out    Out                                         T   T1    Command Rate

                                                                     There are numerous other timing constants which affect
        Figure 2.9: SDRAM Precharge and Activation                   the way commands can be issued and are handled. Those
                                                                     five constants are in practice sufficient to determine the
Figure 2.9 shows the activity starting from one CAS sig-             performance of the module, though.
nal to the CAS signal for another row. The data requested
with the first CAS signal is available as before, after CL            It is sometimes useful to know this information for the
cycles. In the example two words are requested which,                computers in use to be able to interpret certain measure-
on a simple SDRAM, takes two cycles to transmit. Al-                 ments. It is definitely useful to know these details when
ternatively, imagine four words on a DDR chip.                       buying computers since they, along with the FSB and
                                                                     SDRAM module speed, are among the most important
Even on DRAM modules with a command rate of one                      factors determining a computer’s speed.
the precharge command cannot be issued right away. It
is necessary to wait as long as it takes to transmit the             The very adventurous reader could also try to tweak a
data. In this case it takes two cycles. This happens to be           system. Sometimes the BIOS allows changing some or
the same as CL but that is just a coincidence. The pre-              all these values. SDRAM modules have programmable
charge signal has no dedicated line; instead, some imple-            registers where these values can be set. Usually the BIOS
mentations issue it by lowering the Write Enable (WE)                picks the best default value. If the quality of the RAM
and RAS line simultaneously. This combination has no                 module is high it might be possible to reduce the one
useful meaning by itself (see [18] for encoding details).            or the other latency without affecting the stability of the
                                                                     computer. Numerous overclocking websites all around
Once the precharge command is issued it takes tRP (Row               the Internet provide ample of documentation for doing
Precharge time) cycles until the row can be selected. In             this. Do it at your own risk, though and do not say you
Figure 2.9 much of the time (indicated by the purplish               have not been warned.
color) overlaps with the memory transfer (light blue).
This is good! But tRP is larger than the transfer time
                                                                     2.2.3   Recharging
and so the next RAS signal is stalled for one cycle.
If we were to continue the timeline in the diagram we                A mostly-overlooked topic when it comes to DRAM ac-
would find that the next data transfer happens 5 cycles               cess is recharging. As explained in section 2.1.2, DRAM
after the previous one stops. This means the data bus is             cells must constantly be refreshed. This does not happen


Ulrich Drepper                                                Version 1.0                                                      9
completely transparently for the rest of the system. At                            f              f                      f
times when a row10 is recharged no access is possible.                          DRAM
                                                                                                 I/O
                                                                                 Cell
The study in [3] found that “[s]urprisingly, DRAM re-                           Array
                                                                                                Buffer
fresh organization can affect performance dramatically”.
Each DRAM cell must be refreshed every 64ms accord-
ing to the JEDEC (Joint Electron Device Engineering                              Figure 2.11: DDR1 SDRAM Operation
Council) specification. If a DRAM array has 8,192 rows
this means the memory controller has to issue a refresh
command on average every 7.8125µs (refresh commands                    The difference between SDR and DDR1 is, as can be
can be queued so in practice the maximum interval be-                  seen in Figure 2.11 and guessed from the name, that twice
tween two requests can be higher). It is the memory                    the amount of data is transported per cycle. I.e., the
controller’s responsibility to schedule the refresh com-               DDR1 chip transports data on the rising and falling edge.
mands. The DRAM module keeps track of the address                      This is sometimes called a “double-pumped” bus. To
of the last refreshed row and automatically increases the              make this possible without increasing the frequency of
address counter for each new request.                                  the cell array a buffer has to be introduced. This buffer
                                                                       holds two bits per data line. This in turn requires that,
There is really not much the programmer can do about                   in the cell array in Figure 2.7, the data bus consists of
the refresh and the points in time when the commands are               two lines. Implementing this is trivial: one only has to
issued. But it is important to keep this part of the DRAM              use the same column address for two DRAM cells and
life cycle in mind when interpreting measurements. If a                access them in parallel. The changes to the cell array to
critical word has to be retrieved from a row which cur-                implement this are also minimal.
rently is being refreshed the processor could be stalled
for quite a long time. How long each refresh takes de-                 The SDR DRAMs were known simply by their frequency
pends on the DRAM module.                                              (e.g., PC100 for 100MHz SDR). To make DDR1 DRAM
                                                                       sound better the marketers had to come up with a new
                                                                       scheme since the frequency did not change. They came
2.2.4    Memory Types                                                  with a name which contains the transfer rate in bytes a
                                                                       DDR module (they have 64-bit busses) can sustain:
It is worth spending some time on the current and soon-
to-be current memory types in use. We will start with
SDR (Single Data Rate) SDRAMs since they are the ba-                               100MHz × 64bit × 2 = 1, 600MB/s
sis of the DDR (Double Data Rate) SDRAMs. SDRs
were pretty simple. The memory cells and the data trans-               Hence a DDR module with 100MHz frequency is called
fer rate were identical.                                               PC1600. With 1600 > 100 all marketing requirements
                                                                       are fulfilled; it sounds much better although the improve-
           f                                    f                      ment is really only a factor of two.12
        DRAM
         Cell
        Array                                                                     f              2f                     2f
                                                                                DRAM
                                                                                                I/O
                                                                                 Cell
                                                                                               Buffer
                                                                                Array

           Figure 2.10: SDR SDRAM Operation

In Figure 2.10 the DRAM cell array can output the mem-                           Figure 2.12: DDR2 SDRAM Operation
ory content at the same rate it can be transported over
the memory bus. If the DRAM cell array can operate at                  To get even more out of the memory technology DDR2
100MHz, the data transfer rate of the bus of a single cell             includes a bit more innovation. The most obvious change
is thus 100Mb/s. The frequency f for all components is                 that can be seen in Figure 2.12 is the doubling of the
the same. Increasing the throughput of the DRAM chip                   frequency of the bus. Doubling the frequency means
is expensive since the energy consumption rises with the               doubling the bandwidth. Since this doubling of the fre-
frequency. With a huge number of array cells this is                   quency is not economical for the cell array it is now re-
prohibitively expensive.11 In reality it is even more of               quired that the I/O buffer gets four bits in each clock cy-
a problem since increasing the frequency usually also                  cle which it then can send on the bus. This means the
requires increasing the voltage to maintain stability of               changes to the DDR2 modules consist of making only the
the system. DDR SDRAM (called DDR1 retroactively)                      I/O buffer component of the DIMM capable of running
manages to improve the throughput without increasing                   at higher speeds. This is certainly possible and will not
any of the involved frequencies.                                       require measurably more energy, it is just one tiny com-
                                                                       ponent and not the whole module. The names the mar-
  10 Rows are the granularity this happens with despite what [3] and

other literature says (see [18]).                                        12 I will take the factor of two but I do not have to like the inflated
  11 Power = Dynamic Capacity × Voltage2 × Frequency.                  numbers.


10                                                     Version 1.0      What Every Programmer Should Know About Memory
keters came up with for DDR2 are similar to the DDR1                  nel may be reduced to just one. In earlier versions this
names only in the computation of the value the factor of              requirement held for all frequencies, so one can hope
two is replaced by four (we now have a quad-pumped                    that the requirement will at some point be lifted for all
bus). Table 2.1 shows the names of the modules in use                 frequencies. Otherwise the capacity of systems will be
today.                                                                severely limited.

     Array      Bus              Data    Name        Name             Table 2.2 shows the names of the DDR3 modules we are
     Freq.      Freq.            Rate    (Rate)      (FSB)            likely to see. JEDEC agreed so far on the first four types.
    133MHz     266MHz      4,256MB/s    PC2-4200   DDR2-533           Given that Intel’s 45nm processors have an FSB speed of
    166MHz     333MHz      5,312MB/s    PC2-5300   DDR2-667           1,600Mb/s, the 1,866Mb/s is needed for the overclocking
    200MHz     400MHz      6,400MB/s    PC2-6400   DDR2-800
    250MHz     500MHz      8,000MB/s    PC2-8000   DDR2-1000          market. We will likely see more of this towards the end
    266MHz     533MHz      8,512MB/s    PC2-8500   DDR2-1066          of the DDR3 lifecycle.
                                                                             Array    Bus      Data       Name        Name
                                                                             Freq.    Freq.    Rate       (Rate)      (FSB)
             Table 2.1: DDR2 Module Names
                                                                         100MHz      400MHz 6,400MB/s PC3-6400 DDR3-800
                                                                         133MHz      533MHz 8,512MB/s PC3-8500 DDR3-1066
There is one more twist to the naming. The FSB speed                     166MHz      667MHz 10,667MB/s PC3-10667 DDR3-1333
used by CPU, motherboard, and DRAM module is spec-                       200MHz      800MHz 12,800MB/s PC3-12800 DDR3-1600
ified by using the effective frequency. I.e., it factors in               233MHz      933MHz 14,933MB/s PC3-14900 DDR3-1866
the transmission on both flanks of the clock cycle and
thereby inflates the number. So, a 133MHz module with
                                                                                     Table 2.2: DDR3 Module Names
a 266MHz bus has an FSB “frequency” of 533MHz.
The specification for DDR3 (the real one, not the fake                 All DDR memory has one problem: the increased bus
GDDR3 used in graphics cards) calls for more changes                  frequency makes it hard to create parallel data busses. A
along the lines of the transition to DDR2. The voltage                DDR2 module has 240 pins. All connections to data and
will be reduced from 1.8V for DDR2 to 1.5V for DDR3.                  address pins must be routed so that they have approxi-
Since the power consumption equation is calculated us-                mately the same length. Even more of a problem is that,
ing the square of the voltage this alone brings a 30% im-             if more than one DDR module is to be daisy-chained on
provement. Add to this a reduction in die size plus other             the same bus, the signals get more and more distorted for
electrical advances and DDR3 can manage, at the same                  each additional module. The DDR2 specification allow
frequency, to get by with half the power consumption.                 only two modules per bus (aka channel), the DDR3 spec-
Alternatively, with higher frequencies, the same power                ification only one module for high frequencies. With 240
envelope can be hit. Or with double the capacity the same             pins per channel a single Northbridge cannot reasonably
heat emission can be achieved.                                        drive more than two channels. The alternative is to have
                                                                      external memory controllers (as in Figure 2.2) but this is
The cell array of DDR3 modules will run at a quarter of               expensive.
the speed of the external bus which requires an 8 bit I/O
buffer, up from 4 bits for DDR2. See Figure 2.13 for the              What this means is that commodity motherboards are re-
schematics.                                                           stricted to hold at most four DDR2 or DDR3 modules.
                                                                      This restriction severely limits the amount of memory
         f               4f                   4f
                                                                      a system can have. Even old 32-bit IA-32 processors
       DRAM
                         I/O                                          can handle 64GB of RAM and memory demand even for
        Cell
                        Buffer
       Array                                                          home use is growing, so something has to be done.
                                                                      One answer is to add memory controllers into each pro-
                                                                      cessor as explained in section 2. AMD does it with the
        Figure 2.13: DDR3 SDRAM Operation                             Opteron line and Intel will do it with their CSI technol-
                                                                      ogy. This will help as long as the reasonable amount of
Initially DDR3 modules will likely have slightly higher               memory a processor is able to use can be connected to a
CAS latencies just because the DDR2 technology is more                single processor. In some situations this is not the case
mature. This would cause DDR3 to be useful only at                    and this setup will introduce a NUMA architecture and
frequencies which are higher than those which can be                  its negative effects. For some situations another solution
achieved with DDR2, and, even then, mostly when band-                 is needed.
width is more important than latency. There is already
talk about 1.3V modules which can achieve the same                    Intel’s answer to this problem for big server machines, at
CAS latency as DDR2. In any case, the possibility of                  least at the moment, is called Fully Buffered DRAM (FB-
achieving higher speeds because of faster buses will out-             DRAM). The FB-DRAM modules use the same memory
weigh the increased latency.                                          chips as today’s DDR2 modules which makes them rela-
                                                                      tively cheap to produce. The difference is in the connec-
One possible problem with DDR3 is that, for 1,600Mb/s                 tion with the memory controller. Instead of a parallel data
transfer rate or higher, the number of modules per chan-              bus FB-DRAM utilizes a serial bus (Rambus DRAM had


Ulrich Drepper                                                 Version 1.0                                                    11
this back when, too, and SATA is the successor of PATA,                   There are a few drawbacks to FB-DRAMs if multiple
as is PCI Express for PCI/AGP). The serial bus can be                     DIMMs on one channel are used. The signal is delayed–
driven at a much higher frequency, reverting the negative                 albeit minimally–at each DIMM in the chain, thereby in-
impact of the serialization and even increasing the band-                 creasing the latency. A second problem is that the chip
width. The main effects of using a serial bus are                         driving the serial bus requires significant amounts of en-
                                                                          ergy because of the very high frequency and the need to
                                                                          drive a bus. But for the same amount of memory with
     1. more modules per channel can be used.
                                                                          the same frequency FB-DRAM can always be faster than
     2. more channels per Northbridge/memory controller                   DDR2 and DDR3 since the up-to four DIMMS can each
        can be used.                                                      get their own channel; for large memory systems DDR
                                                                          simply has no answer using commodity components.
     3. the serial bus is designed to be fully-duplex (two
        lines).                                                           2.2.5   Conclusions
     4. it is cheap enough to implement a differential bus
        (two lines in each direction) and so increase the                 This section should have shown that accessing DRAM is
        speed.                                                            not an arbitrarily fast process. At least not fast compared
                                                                          with the speed the processor is running and with which it
                                                                          can access registers and cache. It is important to keep in
An FB-DRAM module has only 69 pins, compared with                         mind the differences between CPU and memory frequen-
the 240 for DDR2. Daisy chaining FB-DRAM modules                          cies. An Intel Core 2 processor running at 2.933GHz and
is much easier since the electrical effects of the bus can                a 1.066GHz FSB have a clock ratio of 11:1 (note: the
be handled much better. The FB-DRAM specification                          1.066GHz bus is quad-pumped). Each stall of one cycle
allows up to 8 DRAM modules per channel.                                  on the memory bus means a stall of 11 cycles for the pro-
                                                                          cessor. For most machines the actual DRAMs used are
Compared with the connectivity requirements of a dual-
                                                                          slower, thusly increasing the delay. Keep these numbers
channel Northbridge it is now possible to drive 6 chan-
                                                                          in mind when we are talking about stalls in the upcoming
nels of FB-DRAM with fewer pins: 2 × 240 pins ver-
                                                                          sections.
sus 6 × 69 pins. The routing for each channel is much
simpler which could also help reducing the cost of the                    The timing charts for the read command have shown that
motherboards.                                                             DRAM modules are capable of high sustained data rates.
                                                                          Entire DRAM rows could be transported without a single
Fully duplex parallel busses are prohibitively expensive
                                                                          stall. The data bus could be kept occupied 100%. For
for the traditional DRAM modules, duplicating all those
                                                                          DDR modules this means two 64-bit words transferred
lines is too costly. With serial lines (even if they are dif-
                                                                          each cycle. With DDR2-800 modules and two channels
ferential, as FB-DRAM requires) this is not the case and
                                                                          this means a rate of 12.8GB/s.
so the serial bus is designed to be fully duplexed, which
means, in some situations, that the bandwidth is theoret-                 But, unless designed this way, DRAM access is not al-
ically doubled alone by this. But it is not the only place                ways sequential. Non-continuous memory regions are
where parallelism is used for bandwidth increase. Since                   used which means precharging and new RAS signals are
an FB-DRAM controller can run up to six channels at the                   needed. This is when things slow down and when the
same time the bandwidth can be increased even for sys-                    DRAM modules need help. The sooner the precharg-
tems with smaller amounts of RAM by using FB-DRAM.                        ing can happen and the RAS signal sent the smaller the
Where a DDR2 system with four modules has two chan-                       penalty when the row is actually used.
nels, the same capacity can be handled via four chan-
nels using an ordinary FB-DRAM controller. The actual                     Hardware and software prefetching (see section 6.3) can
bandwidth of the serial bus depends on the type of DDR2                   be used to create more overlap in the timing and reduce
(or DDR3) chips used on the FB-DRAM module.                               the stall. Prefetching also helps shift memory operations
                                                                          in time so that there is less contention at later times, right
We can summarize the advantages like this:                                before the data is actually needed. This is a frequent
                                                                          problem when the data produced in one round has to be
                                   DDR2           FB-DRAM                 stored and the data required for the next round has to be
          Pins                       240                 69               read. By shifting the read in time, the write and read op-
          Channels                     2                  6               erations do not have to be issued at basically the same
          DIMMs/Channel                2                  8               time.
          Max Memory13            16GB14             192GB
          Throughput15           ∼10GB/s           ∼40GB/s                2.3     Other Main Memory Users

  13 Assuming   4GB modules.
                                                                          Beside CPUs there are other system components which
  14 An Intel presentation, for some reason I do not see, says 8GB. . .   can access the main memory. High-performance cards
  15 Assuming DDR2-800 modules.                                           such as network and mass-storage controllers cannot af-


12                                                         Version 1.0     What Every Programmer Should Know About Memory
ford to pipe all the data they need or provide through the   3 CPU Caches
CPU. Instead, they read or write the data directly from/to
the main memory (Direct Memory Access, DMA). In              CPUs are today much more sophisticated than they were
Figure 2.1 we can see that the cards can talk through        only 25 years ago. In those days, the frequency of the
the South- and Northbridge directly with the memory.         CPU core was at a level equivalent to that of the mem-
Other buses, like USB, also require FSB bandwidth–even       ory bus. Memory access was only a bit slower than reg-
if they do not use DMA–since the Southbridge is con-         ister access. But this changed dramatically in the early
nected via the Northbridge to the processor through the      90s, when CPU designers increased the frequency of the
FSB, too.                                                    CPU core but the frequency of the memory bus and the
While DMA is certainly beneficial, it means that there is     performance of RAM chips did not increase proportion-
more competition for the FSB bandwidth. In times with        ally. This is not due to the fact that faster RAM could
high DMA traffic the CPU might stall more than usual          not be built, as explained in the previous section. It is
while waiting for data from the main memory. There           possible but it is not economical. RAM as fast as current
are ways around this given the right hardware. With an       CPU cores is orders of magnitude more expensive than
architecture as in Figure 2.3 one can make sure the com-     any dynamic RAM.
putation uses memory on nodes which are not affected         If the choice is between a machine with very little, very
by DMA. It is also possible to attach a Southbridge to       fast RAM and a machine with a lot of relatively fast
each node, equally distributing the load on the FSB of       RAM, the second will always win given a working set
all the nodes. There are a myriad of possibilities. In       size which exceeds the small RAM size and the cost of
section 6 we will introduce techniques and programming       accessing secondary storage media such as hard drives.
interfaces which help achieving the improvements which       The problem here is the speed of secondary storage, usu-
are possible in software.                                    ally hard disks, which must be used to hold the swapped
Finally it should be mentioned that some cheap systems       out part of the working set. Accessing those disks is or-
have graphics systems without separate, dedicated video      ders of magnitude slower than even DRAM access.
RAM. Those systems use parts of the main memory as           Fortunately it does not have to be an all-or-nothing deci-
video RAM. Since access to the video RAM is frequent         sion. A computer can have a small amount of high-speed
(for a 1024x768 display with 16 bpp at 60Hz we are talk-     SRAM in addition to the large amount of DRAM. One
ing 94MB/s) and system memory, unlike RAM on graph-          possible implementation would be to dedicate a certain
ics cards, does not have two ports this can substantially    area of the address space of the processor as containing
influence the systems performance and especially the la-      the SRAM and the rest the DRAM. The task of the op-
tency. It is best to ignore such systems when performance    erating system would then be to optimally distribute data
is a priority. They are more trouble than they are worth.    to make use of the SRAM. Basically, the SRAM serves
People buying those machines know they will not get the      in this situation as an extension of the register set of the
best performance.                                            processor.
                                                             While this is a possible implementation it is not viable.
                                                             Ignoring the problem of mapping the physical resources
                                                             of such SRAM-backed memory to the virtual address
                                                             spaces of the processes (which by itself is terribly hard)
                                                             this approach would require each process to administer
                                                             in software the allocation of this memory region. The
                                                             size of the memory region can vary from processor to
                                                             processor (i.e., processors have different amounts of the
                                                             expensive SRAM-backed memory). Each module which
                                                             makes up part of a program will claim its share of the
                                                             fast memory, which introduces additional costs through
                                                             synchronization requirements. In short, the gains of hav-
                                                             ing fast memory would be eaten up completely by the
                                                             overhead of administering the resources.
                                                             So, instead of putting the SRAM under the control of
                                                             the OS or user, it becomes a resource which is transpar-
                                                             ently used and administered by the processors. In this
                                                             mode, SRAM is used to make temporary copies of (to
                                                             cache, in other words) data in main memory which is
                                                             likely to be used soon by the processor. This is possible
                                                             because program code and data has temporal and spa-
                                                             tial locality. This means that, over short periods of time,
                                                             there is a good chance that the same code or data gets


Ulrich Drepper                                        Version 1.0                                                     13
reused. For code this means that there are most likely
                                                                                       Main Memory
loops in the code so that the same code gets executed
over and over again (the perfect case for spatial locality).                                                     Bus
Data accesses are also ideally limited to small regions.
Even if the memory used over short time periods is not
close together there is a high chance that the same data                        Cache                  CPU Core
will be reused before long (temporal locality). For code
this means, for instance, that in a loop a function call is
made and that function is located elsewhere in the ad-                 Figure 3.1: Minimum Cache Configuration
dress space. The function may be distant in memory, but
calls to that function will be close in time. For data it
means that the total amount of memory used at one time         Figure 3.1 shows the minimum cache configuration. It
(the working set size) is ideally limited but the memory       corresponds to the architecture which could be found in
used, as a result of the random access nature of RAM, is       early systems which deployed CPU caches. The CPU
not close together. Realizing that locality exists is key to   core is no longer directly connected to the main mem-
the concept of CPU caches as we use them today.                ory.16 All loads and stores have to go through the cache.
                                                               The connection between the CPU core and the cache is
A simple computation can show how effective caches             a special, fast connection. In a simplified representation,
can theoretically be. Assume access to main memory             the main memory and the cache are connected to the sys-
takes 200 cycles and access to the cache memory take           tem bus which can also be used for communication with
15 cycles. Then code using 100 data elements 100 times         other components of the system. We introduced the sys-
each will spend 2,000,000 cycles on memory operations          tem bus as “FSB” which is the name in use today; see
if there is no cache and only 168,500 cycles if all data       section 2.2. In this section we ignore the Northbridge; it
can be cached. That is an improvement of 91.5%.                is assumed to be present to facilitate the communication
The size of the SRAM used for caches is many times             of the CPU(s) with the main memory.
smaller than the main memory. In the author’s experi-          Even though most computers for the last several decades
ence with workstations with CPU caches the cache size          have used the von Neumann architecture, experience has
has always been around 1/1000th of the size of the main        shown that it is of advantage to separate the caches used
memory (today: 4MB cache and 4GB main memory).                 for code and for data. Intel has used separate code and
This alone does not constitute a problem. If the size of       data caches since 1993 and never looked back. The mem-
the working set (the set of data currently worked on) is       ory regions needed for code and data are pretty much
smaller than the cache size it does not matter. But com-       independent of each other, which is why independent
puters do not have large main memories for no reason.          caches work better. In recent years another advantage
The working set is bound to be larger than the cache.          emerged: the instruction decoding step for the most com-
This is especially true for systems running multiple pro-      mon processors is slow; caching decoded instructions
cesses where the size of the working set is the sum of the     can speed up the execution, especially when the pipeline
sizes of all the individual processes and the kernel.          is empty due to incorrectly predicted or impossible-to-
What is needed to deal with the limited size of the cache      predict branches.
is a set of good strategies to determine what should be        Soon after the introduction of the cache the system got
cached at any given time. Since not all data of the work-      more complicated. The speed difference between the
ing set is used at exactly the same time we can use tech-      cache and the main memory increased again, to a point
niques to temporarily replace some data in the cache with      that another level of cache was added, bigger and slower
others. And maybe this can be done before the data is          than the first-level cache. Only increasing the size of the
actually needed. This prefetching would remove some            first-level cache was not an option for economical rea-
of the costs of accessing main memory since it happens         sons. Today, there are even machines with three levels
asynchronously with respect to the execution of the pro-       of cache in regular use. A system with such a processor
gram. All these techniques and more can be used to make        looks like Figure 3.2. With the increase on the number of
the cache appear bigger than it actually is. We will dis-      cores in a single CPU the number of cache levels might
cuss them in section 3.3. Once all these techniques are        increase in the future even more.
exploited it is up to the programmer to help the processor.
How this can be done will be discussed in section 6.           Figure 3.2 shows three levels of cache and introduces the
                                                               nomenclature we will use in the remainder of the docu-
3.1 CPU Caches in the Big Picture                              ment. L1d is the level 1 data cache, L1i the level 1 in-
                                                               struction cache, etc. Note that this is a schematic; the
                                                               data flow in reality need not pass through any of the
Before diving into technical details of the implementa-        higher-level caches on the way from the core to the main
tion of CPU caches some readers might find it useful to            16 In even earlier systems the cache was attached to the system bus
first see in some more details how caches fit into the “big      just like the CPU and the main memory. This was more a hack than a
picture” of a modern computer system.                          real solution.


14                                              Version 1.0     What Every Programmer Should Know About Memory
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance
What Every Programmer Should Know About Optimizing Memory Performance

More Related Content

Similar to What Every Programmer Should Know About Optimizing Memory Performance

What Every Programmer Should Know About Memory
What Every Programmer Should Know About MemoryWhat Every Programmer Should Know About Memory
What Every Programmer Should Know About MemoryYing wei (Joe) Chou
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfanil0878
 
Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)Subhajit Sahu
 
What Have We Lost - A look at some historical techniques
What Have We Lost - A look at some historical techniquesWhat Have We Lost - A look at some historical techniques
What Have We Lost - A look at some historical techniquesLloydMoore
 
Dedupe-Centric Storage for General Applications
Dedupe-Centric Storage for General Applications Dedupe-Centric Storage for General Applications
Dedupe-Centric Storage for General Applications EMC
 
EOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperEOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperDavid Walker
 
PeerToPeerComputing (1)
PeerToPeerComputing (1)PeerToPeerComputing (1)
PeerToPeerComputing (1)MurtazaB
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practiceswebuploader
 
Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operationniallmilton
 
Operating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptxOperating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptxPrudhvi668506
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersRyousei Takano
 
Operating Systems and Memory Management
Operating Systems and Memory ManagementOperating Systems and Memory Management
Operating Systems and Memory Managementguest1415ae65
 
Symmetric multiprocessing and Microkernel
Symmetric multiprocessing and MicrokernelSymmetric multiprocessing and Microkernel
Symmetric multiprocessing and MicrokernelManoraj Pannerselum
 
[NetApp] Simplified HA:DR Using Storage Solutions
[NetApp] Simplified HA:DR Using Storage Solutions[NetApp] Simplified HA:DR Using Storage Solutions
[NetApp] Simplified HA:DR Using Storage SolutionsPerforce
 

Similar to What Every Programmer Should Know About Optimizing Memory Performance (20)

What Every Programmer Should Know About Memory
What Every Programmer Should Know About MemoryWhat Every Programmer Should Know About Memory
What Every Programmer Should Know About Memory
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdf
 
Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)
 
What Have We Lost - A look at some historical techniques
What Have We Lost - A look at some historical techniquesWhat Have We Lost - A look at some historical techniques
What Have We Lost - A look at some historical techniques
 
Dedupe-Centric Storage for General Applications
Dedupe-Centric Storage for General Applications Dedupe-Centric Storage for General Applications
Dedupe-Centric Storage for General Applications
 
Os Lamothe
Os LamotheOs Lamothe
Os Lamothe
 
Ghel os
Ghel osGhel os
Ghel os
 
Ghel os
Ghel osGhel os
Ghel os
 
EOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperEOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - Paper
 
PeerToPeerComputing (1)
PeerToPeerComputing (1)PeerToPeerComputing (1)
PeerToPeerComputing (1)
 
Memory hierarchy
Memory hierarchyMemory hierarchy
Memory hierarchy
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practices
 
Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operation
 
Wiki 2
Wiki 2Wiki 2
Wiki 2
 
Introducing Parallel Pixie Dust
Introducing Parallel Pixie DustIntroducing Parallel Pixie Dust
Introducing Parallel Pixie Dust
 
Operating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptxOperating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptx
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
 
Operating Systems and Memory Management
Operating Systems and Memory ManagementOperating Systems and Memory Management
Operating Systems and Memory Management
 
Symmetric multiprocessing and Microkernel
Symmetric multiprocessing and MicrokernelSymmetric multiprocessing and Microkernel
Symmetric multiprocessing and Microkernel
 
[NetApp] Simplified HA:DR Using Storage Solutions
[NetApp] Simplified HA:DR Using Storage Solutions[NetApp] Simplified HA:DR Using Storage Solutions
[NetApp] Simplified HA:DR Using Storage Solutions
 

Recently uploaded

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 

What Every Programmer Should Know About Optimizing Memory Performance

  • 1. What Every Programmer Should Know About Memory Ulrich Drepper Red Hat, Inc. drepper@redhat.com November 21, 2007 Abstract As CPU cores become both faster and more numerous, the limiting factor for most programs is now, and will be for some time, memory access. Hardware designers have come up with ever more sophisticated memory handling and acceleration techniques–such as CPU caches–but these cannot work optimally without some help from the programmer. Unfortunately, neither the structure nor the cost of using the memory subsystem of a computer or the caches on CPUs is well understood by most programmers. This paper explains the structure of memory subsys- tems in use on modern commodity hardware, illustrating why CPU caches were developed, how they work, and what programs should do to achieve optimal performance by utilizing them. 1 Introduction day these changes mainly come in the following forms: In the early days computers were much simpler. The var- • RAM hardware design (speed and parallelism). ious components of a system, such as the CPU, memory, mass storage, and network interfaces, were developed to- • Memory controller designs. gether and, as a result, were quite balanced in their per- • CPU caches. formance. For example, the memory and network inter- faces were not (much) faster than the CPU at providing • Direct memory access (DMA) for devices. data. This situation changed once the basic structure of com- For the most part, this document will deal with CPU puters stabilized and hardware developers concentrated caches and some effects of memory controller design. on optimizing individual subsystems. Suddenly the per- In the process of exploring these topics, we will explore formance of some components of the computer fell sig- DMA and bring it into the larger picture. However, we nificantly behind and bottlenecks developed. This was will start with an overview of the design for today’s com- especially true for mass storage and memory subsystems modity hardware. This is a prerequisite to understand- which, for cost reasons, improved more slowly relative ing the problems and the limitations of efficiently us- to other components. ing memory subsystems. We will also learn about, in some detail, the different types of RAM and illustrate The slowness of mass storage has mostly been dealt with why these differences still exist. using software techniques: operating systems keep most often used (and most likely to be used) data in main mem- This document is in no way all inclusive and final. It is ory, which can be accessed at a rate orders of magnitude limited to commodity hardware and further limited to a faster than the hard disk. Cache storage was added to the subset of that hardware. Also, many topics will be dis- storage devices themselves, which requires no changes in cussed in just enough detail for the goals of this paper. the operating system to increase performance.1 For the For such topics, readers are recommended to find more purposes of this paper, we will not go into more details detailed documentation. of software optimizations for the mass storage access. When it comes to operating-system-specific details and Unlike storage subsystems, removing the main memory solutions, the text exclusively describes Linux. At no as a bottleneck has proven much more difficult and al- time will it contain any information about other OSes. most all solutions require changes to the hardware. To- The author has no interest in discussing the implications for other OSes. If the reader thinks s/he has to use a 1 Changes are needed, however, to guarantee data integrity when different OS they have to go to their vendors and demand using storage device caches. they write documents similar to this one. One last comment before the start. The text contains a Copyright © 2007 Ulrich Drepper number of occurrences of the term “usually” and other, All rights reserved. No redistribution allowed. similar qualifiers. The technology discussed here exists
  • 2. in many, many variations in the real world and this paper Thanks only addresses the most common, mainstream versions. It is rare that absolute statements can be made about this technology, thus the qualifiers. I would like to thank Johnray Fuller and the crew at LWN (especially Jonathan Corbet for taking on the daunting task of transforming the author’s form of English into Document Structure something more traditional. Markus Armbruster provided a lot of valuable input on problems and omissions in the text. This document is mostly for software developers. It does not go into enough technical details of the hardware to be useful for hardware-oriented readers. But before we can About this Document go into the practical information for developers a lot of groundwork must be laid. The title of this paper is an homage to David Goldberg’s To that end, the second section describes random-access classic paper “What Every Computer Scientist Should memory (RAM) in technical detail. This section’s con- Know About Floating-Point Arithmetic” [12]. This pa- tent is nice to know but not absolutely critical to be able per is still not widely known, although it should be a to understand the later sections. Appropriate back refer- prerequisite for anybody daring to touch a keyboard for ences to the section are added in places where the content serious programming. is required so that the anxious reader could skip most of this section at first. One word on the PDF: xpdf draws some of the diagrams rather poorly. It is recommended it be viewed with evince The third section goes into a lot of details of CPU cache or, if really necessary, Adobe’s programs. If you use behavior. Graphs have been used to keep the text from evince be advised that hyperlinks are used extensively being as dry as it would otherwise be. This content is es- throughout the document even though the viewer does sential for an understanding of the rest of the document. not indicate them like others do. Section 4 describes briefly how virtual memory is imple- mented. This is also required groundwork for the rest. Section 5 goes into a lot of detail about Non Uniform Memory Access (NUMA) systems. Section 6 is the central section of this paper. It brings to- gether all the previous sections’ information and gives programmers advice on how to write code which per- forms well in the various situations. The very impatient reader could start with this section and, if necessary, go back to the earlier sections to freshen up the knowledge of the underlying technology. Section 7 introduces tools which can help the program- mer do a better job. Even with a complete understanding of the technology it is far from obvious where in a non- trivial software project the problems are. Some tools are necessary. In section 8 we finally give an outlook of technology which can be expected in the near future or which might just simply be good to have. Reporting Problems The author intends to update this document for some time. This includes updates made necessary by advances in technology but also to correct mistakes. Readers will- ing to report problems are encouraged to send email to the author. They are asked to include exact version in- formation in the report. The version information can be found on the last page of the document. 2 Version 1.0 What Every Programmer Should Know About Memory
  • 3. 2 Commodity Hardware Today tion with devices through a variety of different buses. To- day the PCI, PCI Express, SATA, and USB buses are of It is important to understand commodity hardware be- most importance, but PATA, IEEE 1394, serial, and par- cause specialized hardware is in retreat. Scaling these allel ports are also supported by the Southbridge. Older days is most often achieved horizontally instead of verti- systems had AGP slots which were attached to the North- cally, meaning today it is more cost-effective to use many bridge. This was done for performance reasons related to smaller, connected commodity computers instead of a insufficiently fast connections between the Northbridge few really large and exceptionally fast (and expensive) and Southbridge. However, today the PCI-E slots are all systems. This is the case because fast and inexpensive connected to the Southbridge. network hardware is widely available. There are still sit- uations where the large specialized systems have their Such a system structure has a number of noteworthy con- place and these systems still provide a business opportu- sequences: nity, but the overall market is dwarfed by the commodity hardware market. Red Hat, as of 2007, expects that for • All data communication from one CPU to another future products, the “standard building blocks” for most must travel over the same bus used to communicate data centers will be a computer with up to four sockets, with the Northbridge. each filled with a quad core CPU that, in the case of Intel CPUs, will be hyper-threaded.2 This means the standard • All communication with RAM must pass through system in the data center will have up to 64 virtual pro- the Northbridge. cessors. Bigger machines will be supported, but the quad 3 • The RAM has only a single port. socket, quad CPU core case is currently thought to be the sweet spot and most optimizations are targeted for such • Communication between a CPU and a device at- machines. tached to the Southbridge is routed through the Northbridge. Large differences exist in the structure of computers built of commodity parts. That said, we will cover more than 90% of such hardware by concentrating on the most im- A couple of bottlenecks are immediately apparent in this portant differences. Note that these technical details tend design. One such bottleneck involves access to RAM for to change rapidly, so the reader is advised to take the date devices. In the earliest days of the PC, all communica- of this writing into account. tion with devices on either bridge had to pass through the CPU, negatively impacting overall system performance. Over the years personal computers and smaller servers To work around this problem some devices became ca- standardized on a chipset with two parts: the Northbridge pable of direct memory access (DMA). DMA allows de- and Southbridge. Figure 2.1 shows this structure. vices, with the help of the Northbridge, to store and re- ceive data in RAM directly without the intervention of CPU1 CPU2 the CPU (and its inherent performance cost). Today all FSB high-performance devices attached to any of the buses can utilize DMA. While this greatly reduces the work- RAM Northbridge load on the CPU, it also creates contention for the band- width of the Northbridge as DMA requests compete with Southbridge SATA RAM access from the CPUs. This problem, therefore, PCI-E USB must be taken into account. A second bottleneck involves the bus from the North- Figure 2.1: Structure with Northbridge and Southbridge bridge to the RAM. The exact details of the bus depend on the memory types deployed. On older systems there is only one bus to all the RAM chips, so parallel ac- All CPUs (two in the previous example, but there can be cess is not possible. Recent RAM types require two sep- more) are connected via a common bus (the Front Side arate buses (or channels as they are called for DDR2, Bus, FSB) to the Northbridge. The Northbridge contains, see page 8) which doubles the available bandwidth. The among other things, the memory controller, and its im- Northbridge interleaves memory access across the chan- plementation determines the type of RAM chips used for nels. More recent memory technologies (FB-DRAM, for the computer. Different types of RAM, such as DRAM, instance) add more channels. Rambus, and SDRAM, require different memory con- trollers. With limited bandwidth available, it is important for per- formance to schedule memory access in ways that mini- To reach all other system devices, the Northbridge must mize delays. As we will see, processors are much faster communicate with the Southbridge. The Southbridge, often referred to as the I/O bridge, handles communica- 3 We will not discuss multi-port RAM in this document as this type of RAM is not found in commodity hardware, at least not in places 2 Hyper-threading enables a single processor core to be used for two where the programmer has access to it. It can be found in specialized or more concurrent executions with just a little extra hardware. hardware such as network routers which depend on utmost speed. Ulrich Drepper Version 1.0 3
  • 4. and must wait to access memory, despite the use of CPU RAM CPU1 CPU2 RAM caches. If multiple hyper-threads, cores, or processors access memory at the same time, the wait times for mem- ory access are even longer. This is also true for DMA RAM CPU3 CPU4 RAM operations. Southbridge SATA There is more to accessing memory than concurrency, PCI-E USB however. Access patterns themselves also greatly influ- ence the performance of the memory subsystem, espe- cially with multiple memory channels. In section 2.2 we wil cover more details of RAM access patterns. Figure 2.3: Integrated Memory Controller On some more expensive systems, the Northbridge does not actually contain the memory controller. Instead the deeper into this technology here. Northbridge can be connected to a number of external memory controllers (in the following example, four of There are disadvantages to this architecture, too. First of them). all, because the machine still has to make all the mem- ory of the system accessible to all processors, the mem- ory is not uniform anymore (hence the name NUMA - CPU1 CPU2 Non-Uniform Memory Architecture - for such an archi- RAM MC1 MC3 RAM tecture). Local memory (memory attached to a proces- Northbridge sor) can be accessed with the usual speed. The situation RAM MC2 MC4 RAM is different when memory attached to another processor SATA is accessed. In this case the interconnects between the PCI-E Southbridge USB processors have to be used. To access memory attached to CPU2 from CPU1 requires communication across one interconnect. When the same CPU accesses memory at- Figure 2.2: Northbridge with External Controllers tached to CPU4 two interconnects have to be crossed. Each such communication has an associated cost. We The advantage of this architecture is that more than one talk about “NUMA factors” when we describe the ex- memory bus exists and therefore total available band- tra time needed to access remote memory. The example width increases. This design also supports more memory. architecture in Figure 2.3 has two levels for each CPU: Concurrent memory access patterns reduce delays by si- immediately adjacent CPUs and one CPU which is two multaneously accessing different memory banks. This interconnects away. With more complicated machines is especially true when multiple processors are directly the number of levels can grow significantly. There are connected to the Northbridge, as in Figure 2.2. For such also machine architectures (for instance IBM’s x445 and a design, the primary limitation is the internal bandwidth SGI’s Altix series) where there is more than one type of the Northbridge, which is phenomenal for this archi- of connection. CPUs are organized into nodes; within a tecture (from Intel).4 node the time to access the memory might be uniform or have only small NUMA factors. The connection between Using multiple external memory controllers is not the nodes can be very expensive, though, and the NUMA only way to increase memory bandwidth. One other in- factor can be quite high. creasingly popular way is to integrate memory controllers into the CPUs and attach memory to each CPU. This Commodity NUMA machines exist today and will likely architecture is made popular by SMP systems based on play an even greater role in the future. It is expected that, AMD’s Opteron processor. Figure 2.3 shows such a sys- from late 2008 on, every SMP machine will use NUMA. tem. Intel will have support for the Common System In- The costs associated with NUMA make it important to terface (CSI) starting with the Nehalem processors; this recognize when a program is running on a NUMA ma- is basically the same approach: an integrated memory chine. In section 5 we will discuss more machine archi- controller with the possibility of local memory for each tectures and some technologies the Linux kernel provides processor. for these programs. With an architecture like this there are as many memory Beyond the technical details described in the remainder banks available as there are processors. On a quad-CPU of this section, there are several additional factors which machine the memory bandwidth is quadrupled without influence the performance of RAM. They are not con- the need for a complicated Northbridge with enormous trollable by software, which is why they are not covered bandwidth. Having a memory controller integrated into in this section. The interested reader can learn about the CPU has some additional advantages; we will not dig some of these factors in section 2.1. They are really only 4 For completeness it should be mentioned that such a memory con- needed to get a more complete picture of RAM technol- troller arrangement can be used for other purposes such as “memory ogy and possibly to make better decisions when purchas- RAID” which is useful in combination with hotplug memory. ing computers. 4 Version 1.0 What Every Programmer Should Know About Memory
  • 5. The following two sections discuss hardware details at If access to the state of the cell is needed the word access the gate level and the access protocol between the mem- line WL is raised. This makes the state of the cell imme- ory controller and the DRAM chips. Programmers will diately available for reading on BL and BL. If the cell likely find this information enlightening since these de- state must be overwritten the BL and BL lines are first tails explain why RAM access works the way it does. It set to the desired values and then WL is raised. Since the is optional knowledge, though, and the reader anxious to outside drivers are stronger than the four transistors (M1 get to topics with more immediate relevance for everyday through M4 ) this allows the old state to be overwritten. life can jump ahead to section 2.2.5. See [20] for a more detailed description of the way the 2.1 RAM Types cell works. For the following discussion it is important to note that There have been many types of RAM over the years and • one cell requires six transistors. There are variants each type varies, sometimes significantly, from the other. with four transistors but they have disadvantages. The older types are today really only interesting to the historians. We will not explore the details of those. In- • maintaining the state of the cell requires constant stead we will concentrate on modern RAM types; we will power. only scrape the surface, exploring some details which are visible to the kernel or application developer through • the cell state is available for reading almost im- their performance characteristics. mediately once the word access line WL is raised. The signal is as rectangular (changing quickly be- The first interesting details are centered around the ques- tween the two binary states) as other transistor- tion why there are different types of RAM in the same controlled signals. machine. More specifically, why are there both static RAM (SRAM5 ) and dynamic RAM (DRAM). The for- • the cell state is stable, no refresh cycles are needed. mer is much faster and provides the same functionality. Why is not all RAM in a machine SRAM? The answer There are other, slower and less power-hungry, SRAM is, as one might expect, cost. SRAM is much more ex- forms available, but those are not of interest here since pensive to produce and to use than DRAM. Both these we are looking at fast RAM. These slow variants are cost factors are important, the second one increasing in mainly interesting because they can be more easily used importance more and more. To understand these differ- in a system than dynamic RAM because of their simpler ences we look at the implementation of a bit of storage interface. for both SRAM and DRAM. In the remainder of this section we will discuss some 2.1.2 Dynamic RAM low-level details of the implementation of RAM. We will keep the level of detail as low as possible. To that end, Dynamic RAM is, in its structure, much simpler than we will discuss the signals at a “logic level” and not at a static RAM. Figure 2.5 shows the structure of a usual level a hardware designer would have to use. That level DRAM cell design. All it consists of is one transistor of detail is unnecessary for our purpose here. and one capacitor. This huge difference in complexity of course means that it functions very differently than static 2.1.1 Static RAM RAM. AL WL DL M Vdd C M2 M4 M6 M5 M1 M3 Figure 2.5: 1-T Dynamic RAM BL BL A dynamic RAM cell keeps its state in the capacitor C. The transistor M is used to guard the access to the state. Figure 2.4: 6-T Static RAM To read the state of the cell the access line AL is raised; this either causes a current to flow on the data line DL or Figure 2.4 shows the structure of a 6 transistor SRAM not, depending on the charge in the capacitor. To write cell. The core of this cell is formed by the four transistors to the cell the data line DL is appropriately set and then M1 to M4 which form two cross-coupled inverters. They AL is raised for a time long enough to charge or drain have two stable states, representing 0 and 1 respectively. the capacitor. The state is stable as long as power on Vdd is available. There are a number of complications with the design of 5 In other contexts SRAM might mean “synchronous RAM”. dynamic RAM. The use of a capacitor means that reading Ulrich Drepper Version 1.0 5
  • 6. the cell discharges the capacitor. The procedure cannot Charge Discharge be repeated indefinitely, the capacitor must be recharged 100 at some point. Even worse, to accommodate the huge 90 Percentage Charge number of cells (chips with 109 or more cells are now 80 common) the capacity to the capacitor must be low (in 70 60 the femto-farad range or lower). A fully charged capac- 50 itor holds a few 10’s of thousands of electrons. Even 40 though the resistance of the capacitor is high (a couple of 30 tera-ohms) it only takes a short time for the capacity to 20 dissipate. This problem is called “leakage”. 10 0 1RC 2RC 3RC 4RC 5RC 6RC 7RC 8RC 9RC This leakage is why a DRAM cell must be constantly refreshed. For most DRAM chips these days this refresh must happen every 64ms. During the refresh cycle no Figure 2.6: Capacitor Charge and Discharge Timing access to the memory is possible since a refresh is simply a memory read operation where the result is discarded. cell. The SRAM cells also need individual power for For some workloads this overhead might stall up to 50% the transistors maintaining the state. The structure of of the memory accesses (see [3]). the DRAM cell is also simpler and more regular which A second problem resulting from the tiny charge is that means packing many of them close together on a die is the information read from the cell is not directly usable. simpler. The data line must be connected to a sense amplifier Overall, the (quite dramatic) difference in cost wins. Ex- which can distinguish between a stored 0 or 1 over the cept in specialized hardware – network routers, for exam- whole range of charges which still have to count as 1. ple – we have to live with main memory which is based A third problem is that reading a cell causes the charge on DRAM. This has huge implications on the program- of the capacitor to be depleted. This means every read mer which we will discuss in the remainder of this paper. operation must be followed by an operation to recharge But first we need to look into a few more details of the the capacitor. This is done automatically by feeding the actual use of DRAM cells. output of the sense amplifier back into the capacitor. It does mean, though, the reading memory content requires 2.1.3 DRAM Access additional energy and, more importantly, time. A fourth problem is that charging and draining a capac- A program selects a memory location using a virtual ad- itor is not instantaneous. The signals received by the dress. The processor translates this into a physical ad- sense amplifier are not rectangular, so a conservative es- dress and finally the memory controller selects the RAM timate as to when the output of the cell is usable has to chip corresponding to that address. To select the individ- be used. The formulas for charging and discharging a ual memory cell on the RAM chip, parts of the physical capacitor are address are passed on in the form of a number of address lines. t It would be completely impractical to address memory QCharge (t) = Q0 (1 − e− RC ) locations individually from the memory controller: 4GB QDischarge (t) = Q0 e− RC t of RAM would require 232 address lines. Instead the address is passed encoded as a binary number using a smaller set of address lines. The address passed to the This means it takes some time (determined by the capac- DRAM chip this way must be demultiplexed first. A ity C and resistance R) for the capacitor to be charged and demultiplexer with N address lines will have 2N output discharged. It also means that the current which can be lines. These output lines can be used to select the mem- detected by the sense amplifiers is not immediately avail- ory cell. Using this direct approach is no big problem for able. Figure 2.6 shows the charge and discharge curves. chips with small capacities. The X–axis is measured in units of RC (resistance multi- plied by capacitance) which is a unit of time. But if the number of cells grows this approach is not suit- able anymore. A chip with 1Gbit6 capacity would need Unlike the static RAM case where the output is immedi- 30 address lines and 230 select lines. The size of a de- ately available when the word access line is raised, it will multiplexer increases exponentially with the number of always take a bit of time until the capacitor discharges input lines when speed is not to be sacrificed. A demulti- sufficiently. This delay severely limits how fast DRAM plexer for 30 address lines needs a whole lot of chip real can be. estate in addition to the complexity (size and time) of The simple approach has its advantages, too. The main the demultiplexer. Even more importantly, transmitting advantage is size. The chip real estate needed for one 6 I hate those SI prefixes. For me a giga-bit will always be 230 and DRAM cell is many times smaller than that of an SRAM not 109 bits. 6 Version 1.0 What Every Programmer Should Know About Memory
  • 7. 30 impulses on the address lines synchronously is much itors do not fill or drain instantaneously). These timing harder than transmitting “only” 15 impulses. Fewer lines constants are crucial for the performance of the DRAM have to be laid out at exactly the same length or timed chip. We will talk about this in the next section. appropriately.7 A secondary scalability problem is that having 30 address lines connected to every RAM chip is not feasible either. Row Address Selection Pins of a chip are precious resources. It is “bad” enough that the data must be transferred as much as possible in parallel (e.g., in 64 bit batches). The memory controller must be able to address each RAM module (collection of a0 RAM chips). If parallel access to multiple RAM mod- a1 ules is required for performance reasons and each RAM module requires its own set of 30 or more address lines, then the memory controller needs to have, for 8 RAM modules, a whopping 240+ pins only for the address han- dling. To counter these secondary scalability problems DRAM chips have, for a long time, multiplexed the address it- a2 self. That means the address is transferred in two parts. a3 Column Address Selection The first part consisting of address bits (a0 and a1 in the example in Figure 2.7) select the row. This selection re- Data mains active until revoked. Then the second part, address bits a2 and a3 , select the column. The crucial difference Figure 2.7: Dynamic RAM Schematic is that only two external address lines are needed. A few more lines are needed to indicate when the RAS and CAS Figure 2.7 shows a DRAM chip at a very high level. The signals are available but this is a small price to pay for DRAM cells are organized in rows and columns. They cutting the number of address lines in half. This address could all be aligned in one row but then the DRAM chip multiplexing brings its own set of problems, though. We would need a huge demultiplexer. With the array ap- will discuss them in section 2.2. proach the design can get by with one demultiplexer and one multiplexer of half the size.8 This is a huge saving 2.1.4 Conclusions on all fronts. In the example the address lines a0 and a1 through the row address selection (RAS)9 demultiplexer Do not worry if the details in this section are a bit over- select the address lines of a whole row of cells. When whelming. The important things to take away from this reading, the content of all cells is thusly made available to section are: the column address selection (CAS)9 multiplexer. Based on the address lines a2 and a3 the content of one col- umn is then made available to the data pin of the DRAM • there are reasons why not all memory is SRAM chip. This happens many times in parallel on a number • memory cells need to be individually selected to of DRAM chips to produce a total number of bits corre- be used sponding to the width of the data bus. • the number of address lines is directly responsi- For writing, the new cell value is put on the data bus and, ble for the cost of the memory controller, mother- when the cell is selected using the RAS and CAS, it is boards, DRAM module, and DRAM chip stored in the cell. A pretty straightforward design. There are in reality – obviously – many more complications. • it takes a while before the results of the read or There need to be specifications for how much delay there write operation are available is after the signal before the data will be available on the data bus for reading. The capacitors do not unload instan- taneously, as described in the previous section. The sig- The following section will go into more details about the nal from the cells is so weak that it needs to be amplified. actual process of accessing DRAM memory. We are not For writing it must be specified how long the data must going into more details of accessing SRAM, which is be available on the bus after the RAS and CAS is done to usually directly addressed. This happens for speed and successfully store the new value in the cell (again, capac- because the SRAM memory is limited in size. SRAM is currently used in CPU caches and on-die where the 7 Modern DRAM types like DDR3 can automatically adjust the tim- connections are small and fully under control of the CPU ing but there is a limit as to what can be tolerated. designer. CPU caches are a topic which we discuss later 8 Multiplexers and demultiplexers are equivalent and the multiplexer here needs to work as a demultiplexer when writing. So we will drop but all we need to know is that SRAM cells have a certain the differentiation from now on. maximum speed which depends on the effort spent on the 9 The line over the name indicates that the signal is negated. SRAM. The speed can vary from only slightly slower Ulrich Drepper Version 1.0 7
  • 8. than the CPU core to one or two orders of magnitude CLK slower. RAS 2.2 DRAM Access Technical Details CAS In the section introducing DRAM we saw that DRAM Row Col chips multiplex the addresses in order to save resources Address Addr Addr int the form of address pins. We also saw that access- t RCD CL ing DRAM cells takes time since the capacitors in those DQ Data Data Data Data Out Out Out Out cells do not discharge instantaneously to produce a stable signal; we also saw that DRAM cells must be refreshed. Now it is time to put this all together and see how all these factors determine how the DRAM access has to Figure 2.8: SDRAM Read Access Timing happen. We will concentrate on current technology; we will not bus and lowering the RAS signal. All signals are read on discuss asynchronous DRAM and its variants as they are the rising edge of the clock (CLK) so it does not matter if simply not relevant anymore. Readers interested in this the signal is not completely square as long as it is stable topic are referred to [3] and [19]. We will also not talk at the time it is read. Setting the row address causes the about Rambus DRAM (RDRAM) even though the tech- RAM chip to start latching the addressed row. nology is not obsolete. It is just not widely used for sys- tem memory. We will concentrate exclusively on Syn- The CAS signal can be sent after tRCD (RAS-to-CAS chronous DRAM (SDRAM) and its successors Double Delay) clock cycles. The column address is then trans- Data Rate DRAM (DDR). mitted by making it available on the address bus and low- ering the CAS line. Here we can see how the two parts Synchronous DRAM, as the name suggests, works rel- of the address (more or less halves, nothing else makes ative to a time source. The memory controller provides sense) can be transmitted over the same address bus. a clock, the frequency of which determines the speed of the Front Side Bus (FSB) – the memory controller in- Now the addressing is complete and the data can be trans- terface used by the DRAM chips. As of this writing, mitted. The RAM chip needs some time to prepare for frequencies of 800MHz, 1,066MHz, or even 1,333MHz this. The delay is usually called CAS Latency (CL). In are available with higher frequencies (1,600MHz) being Figure 2.8 the CAS latency is 2. It can be higher or lower, announced for the next generation. This does not mean depending on the quality of the memory controller, moth- the frequency used on the bus is actually this high. In- erboard, and DRAM module. The latency can also have stead, today’s buses are double- or quad-pumped, mean- half values. With CL=2.5 the first data would be avail- ing that data is transported two or four times per cy- able at the first falling flank in the blue area. cle. Higher numbers sell so the manufacturers like to With all this preparation to get to the data it would be advertise a quad-pumped 200MHz bus as an “effective” wasteful to only transfer one data word. This is why 800MHz bus. DRAM modules allow the memory controller to spec- For SDRAM today each data transfer consists of 64 bits ify how much data is to be transmitted. Often the choice – 8 bytes. The transfer rate of the FSB is therefore 8 is between 2, 4, or 8 words. This allows filling entire bytes multiplied by the effective bus frequency (6.4GB/s lines in the caches without a new RAS/CAS sequence. It for the quad-pumped 200MHz bus). That sounds a lot is also possible for the memory controller to send a new but it is the burst speed, the maximum speed which will CAS signal without resetting the row selection. In this never be surpassed. As we will see now the protocol for way, consecutive memory addresses can be read from talking to the RAM modules has a lot of downtime when or written to significantly faster because the RAS sig- no data can be transmitted. It is exactly this downtime nal does not have to be sent and the row does not have which we must understand and minimize to achieve the to be deactivated (see below). Keeping the row “open” best performance. is something the memory controller has to decide. Spec- ulatively leaving it open all the time has disadvantages with real-world applications (see [3]). Sending new CAS 2.2.1 Read Access Protocol signals is only subject to the Command Rate of the RAM module (usually specified as Tx, where x is a value like Figure 2.8 shows the activity on some of the connectors 1 or 2; it will be 1 for high-performance DRAM modules of a DRAM module which happens in three differently which accept new commands every cycle). colored phases. As usual, time flows from left to right. A lot of details are left out. Here we only talk about the In this example the SDRAM spits out one word per cy- bus clock, RAS and CAS signals, and the address and cle. This is what the first generation does. DDR is able data buses. A read cycle begins with the memory con- to transmit two words per cycle. This cuts down on the troller making the row address available on the address transfer time but does not change the latency. In princi- 8 Version 1.0 What Every Programmer Should Know About Memory
  • 9. ple, DDR2 works the same although in practice it looks only in use two cycles out of seven. Multiply this with different. There is no need to go into the details here. It is the FSB speed and the theoretical 6.4GB/s for a 800MHz sufficient to note that DDR2 can be made faster, cheaper, bus become 1.8GB/s. That is bad and must be avoided. more reliable, and is more energy efficient (see [6] for The techniques described in section 6 help to raise this more information). number. But the programmer usually has to do her share. There is one more timing value for a SDRAM module 2.2.2 Precharge and Activation which we have not discussed. In Figure 2.9 the precharge command was only limited by the data transfer time. An- Figure 2.8 does not cover the whole cycle. It only shows other constraint is that an SDRAM module needs time parts of the full cycle of accessing DRAM. Before a new after a RAS signal before it can precharge another row RAS signal can be sent the currently latched row must be (denoted as tRAS ). This number is usually pretty high, deactivated and the new row must be precharged. We can in the order of two or three times the tRP value. This is concentrate here on the case where this is done with an a problem if, after a RAS signal, only one CAS signal explicit command. There are improvements to the pro- follows and the data transfer is finished in a few cycles. tocol which, in some situations, allows this extra step to Assume that in Figure 2.9 the initial CAS signal was pre- be avoided. The delays introduced by precharging still ceded directly by a RAS signal and that tRAS is 8 cycles. affect the operation, though. Then the precharge command would have to be delayed by one additional cycle since the sum of tRCD , CL, and CLK tRP (since it is larger than the data transfer time) is only 7 cycles. WE t RP DDR modules are often described using a special nota- RAS tion: w-x-y-z-T. For instance: 2-3-2-8-T1. This means: CAS w 2 CAS Latency (CL) Address Col Row Col x 3 RAS-to-CAS delay (tRCD ) Addr Addr Addr CL t RCD y 2 RAS Precharge (tRP ) Data Data z 8 Active to Precharge delay (tRAS ) DQ Out Out T T1 Command Rate There are numerous other timing constants which affect Figure 2.9: SDRAM Precharge and Activation the way commands can be issued and are handled. Those five constants are in practice sufficient to determine the Figure 2.9 shows the activity starting from one CAS sig- performance of the module, though. nal to the CAS signal for another row. The data requested with the first CAS signal is available as before, after CL It is sometimes useful to know this information for the cycles. In the example two words are requested which, computers in use to be able to interpret certain measure- on a simple SDRAM, takes two cycles to transmit. Al- ments. It is definitely useful to know these details when ternatively, imagine four words on a DDR chip. buying computers since they, along with the FSB and SDRAM module speed, are among the most important Even on DRAM modules with a command rate of one factors determining a computer’s speed. the precharge command cannot be issued right away. It is necessary to wait as long as it takes to transmit the The very adventurous reader could also try to tweak a data. In this case it takes two cycles. This happens to be system. Sometimes the BIOS allows changing some or the same as CL but that is just a coincidence. The pre- all these values. SDRAM modules have programmable charge signal has no dedicated line; instead, some imple- registers where these values can be set. Usually the BIOS mentations issue it by lowering the Write Enable (WE) picks the best default value. If the quality of the RAM and RAS line simultaneously. This combination has no module is high it might be possible to reduce the one useful meaning by itself (see [18] for encoding details). or the other latency without affecting the stability of the computer. Numerous overclocking websites all around Once the precharge command is issued it takes tRP (Row the Internet provide ample of documentation for doing Precharge time) cycles until the row can be selected. In this. Do it at your own risk, though and do not say you Figure 2.9 much of the time (indicated by the purplish have not been warned. color) overlaps with the memory transfer (light blue). This is good! But tRP is larger than the transfer time 2.2.3 Recharging and so the next RAS signal is stalled for one cycle. If we were to continue the timeline in the diagram we A mostly-overlooked topic when it comes to DRAM ac- would find that the next data transfer happens 5 cycles cess is recharging. As explained in section 2.1.2, DRAM after the previous one stops. This means the data bus is cells must constantly be refreshed. This does not happen Ulrich Drepper Version 1.0 9
  • 10. completely transparently for the rest of the system. At f f f times when a row10 is recharged no access is possible. DRAM I/O Cell The study in [3] found that “[s]urprisingly, DRAM re- Array Buffer fresh organization can affect performance dramatically”. Each DRAM cell must be refreshed every 64ms accord- ing to the JEDEC (Joint Electron Device Engineering Figure 2.11: DDR1 SDRAM Operation Council) specification. If a DRAM array has 8,192 rows this means the memory controller has to issue a refresh command on average every 7.8125µs (refresh commands The difference between SDR and DDR1 is, as can be can be queued so in practice the maximum interval be- seen in Figure 2.11 and guessed from the name, that twice tween two requests can be higher). It is the memory the amount of data is transported per cycle. I.e., the controller’s responsibility to schedule the refresh com- DDR1 chip transports data on the rising and falling edge. mands. The DRAM module keeps track of the address This is sometimes called a “double-pumped” bus. To of the last refreshed row and automatically increases the make this possible without increasing the frequency of address counter for each new request. the cell array a buffer has to be introduced. This buffer holds two bits per data line. This in turn requires that, There is really not much the programmer can do about in the cell array in Figure 2.7, the data bus consists of the refresh and the points in time when the commands are two lines. Implementing this is trivial: one only has to issued. But it is important to keep this part of the DRAM use the same column address for two DRAM cells and life cycle in mind when interpreting measurements. If a access them in parallel. The changes to the cell array to critical word has to be retrieved from a row which cur- implement this are also minimal. rently is being refreshed the processor could be stalled for quite a long time. How long each refresh takes de- The SDR DRAMs were known simply by their frequency pends on the DRAM module. (e.g., PC100 for 100MHz SDR). To make DDR1 DRAM sound better the marketers had to come up with a new scheme since the frequency did not change. They came 2.2.4 Memory Types with a name which contains the transfer rate in bytes a DDR module (they have 64-bit busses) can sustain: It is worth spending some time on the current and soon- to-be current memory types in use. We will start with SDR (Single Data Rate) SDRAMs since they are the ba- 100MHz × 64bit × 2 = 1, 600MB/s sis of the DDR (Double Data Rate) SDRAMs. SDRs were pretty simple. The memory cells and the data trans- Hence a DDR module with 100MHz frequency is called fer rate were identical. PC1600. With 1600 > 100 all marketing requirements are fulfilled; it sounds much better although the improve- f f ment is really only a factor of two.12 DRAM Cell Array f 2f 2f DRAM I/O Cell Buffer Array Figure 2.10: SDR SDRAM Operation In Figure 2.10 the DRAM cell array can output the mem- Figure 2.12: DDR2 SDRAM Operation ory content at the same rate it can be transported over the memory bus. If the DRAM cell array can operate at To get even more out of the memory technology DDR2 100MHz, the data transfer rate of the bus of a single cell includes a bit more innovation. The most obvious change is thus 100Mb/s. The frequency f for all components is that can be seen in Figure 2.12 is the doubling of the the same. Increasing the throughput of the DRAM chip frequency of the bus. Doubling the frequency means is expensive since the energy consumption rises with the doubling the bandwidth. Since this doubling of the fre- frequency. With a huge number of array cells this is quency is not economical for the cell array it is now re- prohibitively expensive.11 In reality it is even more of quired that the I/O buffer gets four bits in each clock cy- a problem since increasing the frequency usually also cle which it then can send on the bus. This means the requires increasing the voltage to maintain stability of changes to the DDR2 modules consist of making only the the system. DDR SDRAM (called DDR1 retroactively) I/O buffer component of the DIMM capable of running manages to improve the throughput without increasing at higher speeds. This is certainly possible and will not any of the involved frequencies. require measurably more energy, it is just one tiny com- ponent and not the whole module. The names the mar- 10 Rows are the granularity this happens with despite what [3] and other literature says (see [18]). 12 I will take the factor of two but I do not have to like the inflated 11 Power = Dynamic Capacity × Voltage2 × Frequency. numbers. 10 Version 1.0 What Every Programmer Should Know About Memory
  • 11. keters came up with for DDR2 are similar to the DDR1 nel may be reduced to just one. In earlier versions this names only in the computation of the value the factor of requirement held for all frequencies, so one can hope two is replaced by four (we now have a quad-pumped that the requirement will at some point be lifted for all bus). Table 2.1 shows the names of the modules in use frequencies. Otherwise the capacity of systems will be today. severely limited. Array Bus Data Name Name Table 2.2 shows the names of the DDR3 modules we are Freq. Freq. Rate (Rate) (FSB) likely to see. JEDEC agreed so far on the first four types. 133MHz 266MHz 4,256MB/s PC2-4200 DDR2-533 Given that Intel’s 45nm processors have an FSB speed of 166MHz 333MHz 5,312MB/s PC2-5300 DDR2-667 1,600Mb/s, the 1,866Mb/s is needed for the overclocking 200MHz 400MHz 6,400MB/s PC2-6400 DDR2-800 250MHz 500MHz 8,000MB/s PC2-8000 DDR2-1000 market. We will likely see more of this towards the end 266MHz 533MHz 8,512MB/s PC2-8500 DDR2-1066 of the DDR3 lifecycle. Array Bus Data Name Name Freq. Freq. Rate (Rate) (FSB) Table 2.1: DDR2 Module Names 100MHz 400MHz 6,400MB/s PC3-6400 DDR3-800 133MHz 533MHz 8,512MB/s PC3-8500 DDR3-1066 There is one more twist to the naming. The FSB speed 166MHz 667MHz 10,667MB/s PC3-10667 DDR3-1333 used by CPU, motherboard, and DRAM module is spec- 200MHz 800MHz 12,800MB/s PC3-12800 DDR3-1600 ified by using the effective frequency. I.e., it factors in 233MHz 933MHz 14,933MB/s PC3-14900 DDR3-1866 the transmission on both flanks of the clock cycle and thereby inflates the number. So, a 133MHz module with Table 2.2: DDR3 Module Names a 266MHz bus has an FSB “frequency” of 533MHz. The specification for DDR3 (the real one, not the fake All DDR memory has one problem: the increased bus GDDR3 used in graphics cards) calls for more changes frequency makes it hard to create parallel data busses. A along the lines of the transition to DDR2. The voltage DDR2 module has 240 pins. All connections to data and will be reduced from 1.8V for DDR2 to 1.5V for DDR3. address pins must be routed so that they have approxi- Since the power consumption equation is calculated us- mately the same length. Even more of a problem is that, ing the square of the voltage this alone brings a 30% im- if more than one DDR module is to be daisy-chained on provement. Add to this a reduction in die size plus other the same bus, the signals get more and more distorted for electrical advances and DDR3 can manage, at the same each additional module. The DDR2 specification allow frequency, to get by with half the power consumption. only two modules per bus (aka channel), the DDR3 spec- Alternatively, with higher frequencies, the same power ification only one module for high frequencies. With 240 envelope can be hit. Or with double the capacity the same pins per channel a single Northbridge cannot reasonably heat emission can be achieved. drive more than two channels. The alternative is to have external memory controllers (as in Figure 2.2) but this is The cell array of DDR3 modules will run at a quarter of expensive. the speed of the external bus which requires an 8 bit I/O buffer, up from 4 bits for DDR2. See Figure 2.13 for the What this means is that commodity motherboards are re- schematics. stricted to hold at most four DDR2 or DDR3 modules. This restriction severely limits the amount of memory f 4f 4f a system can have. Even old 32-bit IA-32 processors DRAM I/O can handle 64GB of RAM and memory demand even for Cell Buffer Array home use is growing, so something has to be done. One answer is to add memory controllers into each pro- cessor as explained in section 2. AMD does it with the Figure 2.13: DDR3 SDRAM Operation Opteron line and Intel will do it with their CSI technol- ogy. This will help as long as the reasonable amount of Initially DDR3 modules will likely have slightly higher memory a processor is able to use can be connected to a CAS latencies just because the DDR2 technology is more single processor. In some situations this is not the case mature. This would cause DDR3 to be useful only at and this setup will introduce a NUMA architecture and frequencies which are higher than those which can be its negative effects. For some situations another solution achieved with DDR2, and, even then, mostly when band- is needed. width is more important than latency. There is already talk about 1.3V modules which can achieve the same Intel’s answer to this problem for big server machines, at CAS latency as DDR2. In any case, the possibility of least at the moment, is called Fully Buffered DRAM (FB- achieving higher speeds because of faster buses will out- DRAM). The FB-DRAM modules use the same memory weigh the increased latency. chips as today’s DDR2 modules which makes them rela- tively cheap to produce. The difference is in the connec- One possible problem with DDR3 is that, for 1,600Mb/s tion with the memory controller. Instead of a parallel data transfer rate or higher, the number of modules per chan- bus FB-DRAM utilizes a serial bus (Rambus DRAM had Ulrich Drepper Version 1.0 11
  • 12. this back when, too, and SATA is the successor of PATA, There are a few drawbacks to FB-DRAMs if multiple as is PCI Express for PCI/AGP). The serial bus can be DIMMs on one channel are used. The signal is delayed– driven at a much higher frequency, reverting the negative albeit minimally–at each DIMM in the chain, thereby in- impact of the serialization and even increasing the band- creasing the latency. A second problem is that the chip width. The main effects of using a serial bus are driving the serial bus requires significant amounts of en- ergy because of the very high frequency and the need to drive a bus. But for the same amount of memory with 1. more modules per channel can be used. the same frequency FB-DRAM can always be faster than 2. more channels per Northbridge/memory controller DDR2 and DDR3 since the up-to four DIMMS can each can be used. get their own channel; for large memory systems DDR simply has no answer using commodity components. 3. the serial bus is designed to be fully-duplex (two lines). 2.2.5 Conclusions 4. it is cheap enough to implement a differential bus (two lines in each direction) and so increase the This section should have shown that accessing DRAM is speed. not an arbitrarily fast process. At least not fast compared with the speed the processor is running and with which it can access registers and cache. It is important to keep in An FB-DRAM module has only 69 pins, compared with mind the differences between CPU and memory frequen- the 240 for DDR2. Daisy chaining FB-DRAM modules cies. An Intel Core 2 processor running at 2.933GHz and is much easier since the electrical effects of the bus can a 1.066GHz FSB have a clock ratio of 11:1 (note: the be handled much better. The FB-DRAM specification 1.066GHz bus is quad-pumped). Each stall of one cycle allows up to 8 DRAM modules per channel. on the memory bus means a stall of 11 cycles for the pro- cessor. For most machines the actual DRAMs used are Compared with the connectivity requirements of a dual- slower, thusly increasing the delay. Keep these numbers channel Northbridge it is now possible to drive 6 chan- in mind when we are talking about stalls in the upcoming nels of FB-DRAM with fewer pins: 2 × 240 pins ver- sections. sus 6 × 69 pins. The routing for each channel is much simpler which could also help reducing the cost of the The timing charts for the read command have shown that motherboards. DRAM modules are capable of high sustained data rates. Entire DRAM rows could be transported without a single Fully duplex parallel busses are prohibitively expensive stall. The data bus could be kept occupied 100%. For for the traditional DRAM modules, duplicating all those DDR modules this means two 64-bit words transferred lines is too costly. With serial lines (even if they are dif- each cycle. With DDR2-800 modules and two channels ferential, as FB-DRAM requires) this is not the case and this means a rate of 12.8GB/s. so the serial bus is designed to be fully duplexed, which means, in some situations, that the bandwidth is theoret- But, unless designed this way, DRAM access is not al- ically doubled alone by this. But it is not the only place ways sequential. Non-continuous memory regions are where parallelism is used for bandwidth increase. Since used which means precharging and new RAS signals are an FB-DRAM controller can run up to six channels at the needed. This is when things slow down and when the same time the bandwidth can be increased even for sys- DRAM modules need help. The sooner the precharg- tems with smaller amounts of RAM by using FB-DRAM. ing can happen and the RAS signal sent the smaller the Where a DDR2 system with four modules has two chan- penalty when the row is actually used. nels, the same capacity can be handled via four chan- nels using an ordinary FB-DRAM controller. The actual Hardware and software prefetching (see section 6.3) can bandwidth of the serial bus depends on the type of DDR2 be used to create more overlap in the timing and reduce (or DDR3) chips used on the FB-DRAM module. the stall. Prefetching also helps shift memory operations in time so that there is less contention at later times, right We can summarize the advantages like this: before the data is actually needed. This is a frequent problem when the data produced in one round has to be DDR2 FB-DRAM stored and the data required for the next round has to be Pins 240 69 read. By shifting the read in time, the write and read op- Channels 2 6 erations do not have to be issued at basically the same DIMMs/Channel 2 8 time. Max Memory13 16GB14 192GB Throughput15 ∼10GB/s ∼40GB/s 2.3 Other Main Memory Users 13 Assuming 4GB modules. Beside CPUs there are other system components which 14 An Intel presentation, for some reason I do not see, says 8GB. . . can access the main memory. High-performance cards 15 Assuming DDR2-800 modules. such as network and mass-storage controllers cannot af- 12 Version 1.0 What Every Programmer Should Know About Memory
  • 13. ford to pipe all the data they need or provide through the 3 CPU Caches CPU. Instead, they read or write the data directly from/to the main memory (Direct Memory Access, DMA). In CPUs are today much more sophisticated than they were Figure 2.1 we can see that the cards can talk through only 25 years ago. In those days, the frequency of the the South- and Northbridge directly with the memory. CPU core was at a level equivalent to that of the mem- Other buses, like USB, also require FSB bandwidth–even ory bus. Memory access was only a bit slower than reg- if they do not use DMA–since the Southbridge is con- ister access. But this changed dramatically in the early nected via the Northbridge to the processor through the 90s, when CPU designers increased the frequency of the FSB, too. CPU core but the frequency of the memory bus and the While DMA is certainly beneficial, it means that there is performance of RAM chips did not increase proportion- more competition for the FSB bandwidth. In times with ally. This is not due to the fact that faster RAM could high DMA traffic the CPU might stall more than usual not be built, as explained in the previous section. It is while waiting for data from the main memory. There possible but it is not economical. RAM as fast as current are ways around this given the right hardware. With an CPU cores is orders of magnitude more expensive than architecture as in Figure 2.3 one can make sure the com- any dynamic RAM. putation uses memory on nodes which are not affected If the choice is between a machine with very little, very by DMA. It is also possible to attach a Southbridge to fast RAM and a machine with a lot of relatively fast each node, equally distributing the load on the FSB of RAM, the second will always win given a working set all the nodes. There are a myriad of possibilities. In size which exceeds the small RAM size and the cost of section 6 we will introduce techniques and programming accessing secondary storage media such as hard drives. interfaces which help achieving the improvements which The problem here is the speed of secondary storage, usu- are possible in software. ally hard disks, which must be used to hold the swapped Finally it should be mentioned that some cheap systems out part of the working set. Accessing those disks is or- have graphics systems without separate, dedicated video ders of magnitude slower than even DRAM access. RAM. Those systems use parts of the main memory as Fortunately it does not have to be an all-or-nothing deci- video RAM. Since access to the video RAM is frequent sion. A computer can have a small amount of high-speed (for a 1024x768 display with 16 bpp at 60Hz we are talk- SRAM in addition to the large amount of DRAM. One ing 94MB/s) and system memory, unlike RAM on graph- possible implementation would be to dedicate a certain ics cards, does not have two ports this can substantially area of the address space of the processor as containing influence the systems performance and especially the la- the SRAM and the rest the DRAM. The task of the op- tency. It is best to ignore such systems when performance erating system would then be to optimally distribute data is a priority. They are more trouble than they are worth. to make use of the SRAM. Basically, the SRAM serves People buying those machines know they will not get the in this situation as an extension of the register set of the best performance. processor. While this is a possible implementation it is not viable. Ignoring the problem of mapping the physical resources of such SRAM-backed memory to the virtual address spaces of the processes (which by itself is terribly hard) this approach would require each process to administer in software the allocation of this memory region. The size of the memory region can vary from processor to processor (i.e., processors have different amounts of the expensive SRAM-backed memory). Each module which makes up part of a program will claim its share of the fast memory, which introduces additional costs through synchronization requirements. In short, the gains of hav- ing fast memory would be eaten up completely by the overhead of administering the resources. So, instead of putting the SRAM under the control of the OS or user, it becomes a resource which is transpar- ently used and administered by the processors. In this mode, SRAM is used to make temporary copies of (to cache, in other words) data in main memory which is likely to be used soon by the processor. This is possible because program code and data has temporal and spa- tial locality. This means that, over short periods of time, there is a good chance that the same code or data gets Ulrich Drepper Version 1.0 13
  • 14. reused. For code this means that there are most likely Main Memory loops in the code so that the same code gets executed over and over again (the perfect case for spatial locality). Bus Data accesses are also ideally limited to small regions. Even if the memory used over short time periods is not close together there is a high chance that the same data Cache CPU Core will be reused before long (temporal locality). For code this means, for instance, that in a loop a function call is made and that function is located elsewhere in the ad- Figure 3.1: Minimum Cache Configuration dress space. The function may be distant in memory, but calls to that function will be close in time. For data it means that the total amount of memory used at one time Figure 3.1 shows the minimum cache configuration. It (the working set size) is ideally limited but the memory corresponds to the architecture which could be found in used, as a result of the random access nature of RAM, is early systems which deployed CPU caches. The CPU not close together. Realizing that locality exists is key to core is no longer directly connected to the main mem- the concept of CPU caches as we use them today. ory.16 All loads and stores have to go through the cache. The connection between the CPU core and the cache is A simple computation can show how effective caches a special, fast connection. In a simplified representation, can theoretically be. Assume access to main memory the main memory and the cache are connected to the sys- takes 200 cycles and access to the cache memory take tem bus which can also be used for communication with 15 cycles. Then code using 100 data elements 100 times other components of the system. We introduced the sys- each will spend 2,000,000 cycles on memory operations tem bus as “FSB” which is the name in use today; see if there is no cache and only 168,500 cycles if all data section 2.2. In this section we ignore the Northbridge; it can be cached. That is an improvement of 91.5%. is assumed to be present to facilitate the communication The size of the SRAM used for caches is many times of the CPU(s) with the main memory. smaller than the main memory. In the author’s experi- Even though most computers for the last several decades ence with workstations with CPU caches the cache size have used the von Neumann architecture, experience has has always been around 1/1000th of the size of the main shown that it is of advantage to separate the caches used memory (today: 4MB cache and 4GB main memory). for code and for data. Intel has used separate code and This alone does not constitute a problem. If the size of data caches since 1993 and never looked back. The mem- the working set (the set of data currently worked on) is ory regions needed for code and data are pretty much smaller than the cache size it does not matter. But com- independent of each other, which is why independent puters do not have large main memories for no reason. caches work better. In recent years another advantage The working set is bound to be larger than the cache. emerged: the instruction decoding step for the most com- This is especially true for systems running multiple pro- mon processors is slow; caching decoded instructions cesses where the size of the working set is the sum of the can speed up the execution, especially when the pipeline sizes of all the individual processes and the kernel. is empty due to incorrectly predicted or impossible-to- What is needed to deal with the limited size of the cache predict branches. is a set of good strategies to determine what should be Soon after the introduction of the cache the system got cached at any given time. Since not all data of the work- more complicated. The speed difference between the ing set is used at exactly the same time we can use tech- cache and the main memory increased again, to a point niques to temporarily replace some data in the cache with that another level of cache was added, bigger and slower others. And maybe this can be done before the data is than the first-level cache. Only increasing the size of the actually needed. This prefetching would remove some first-level cache was not an option for economical rea- of the costs of accessing main memory since it happens sons. Today, there are even machines with three levels asynchronously with respect to the execution of the pro- of cache in regular use. A system with such a processor gram. All these techniques and more can be used to make looks like Figure 3.2. With the increase on the number of the cache appear bigger than it actually is. We will dis- cores in a single CPU the number of cache levels might cuss them in section 3.3. Once all these techniques are increase in the future even more. exploited it is up to the programmer to help the processor. How this can be done will be discussed in section 6. Figure 3.2 shows three levels of cache and introduces the nomenclature we will use in the remainder of the docu- 3.1 CPU Caches in the Big Picture ment. L1d is the level 1 data cache, L1i the level 1 in- struction cache, etc. Note that this is a schematic; the data flow in reality need not pass through any of the Before diving into technical details of the implementa- higher-level caches on the way from the core to the main tion of CPU caches some readers might find it useful to 16 In even earlier systems the cache was attached to the system bus first see in some more details how caches fit into the “big just like the CPU and the main memory. This was more a hack than a picture” of a modern computer system. real solution. 14 Version 1.0 What Every Programmer Should Know About Memory