SlideShare a Scribd company logo
1 of 78
Download to read offline
1
Processing in Memory
Oscar Plata
oplata@uma.es
Dept. Arquitectura de Computadores
Universidad de Málaga
HPACuma Research Group
ALDEBARAN Research Group
HPACuma
Research Group
UCM, May 26, 2021
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
2
• Important workloads are all data intensive
 Artificial Intelligence, Machine Learning, Genomics, Time Series Analysis,
In-Memory Databases, Graph/Tree Processing, Data Analytics,
UHD Video Analysis…
• Such workloads require fast and efficient processing for large amounts
of data
• Data is increasing…
Data is Key in Modern and Future Workloads
Data overwhelms modern computers
Data ⟹ Performance and energy bottleneck
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
3
• Frequent memory access patterns in modern workloads
 Streaming access (lack of data reuse)
 Strided access (with large strides)
 Non regular access (even random), specially when operating on sparse data
structures
• Usually, little amount of computation (low operational intensity)
Computing Behaviour of Modern Workloads
A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing, ISCA 2015
for (v:graph.vertices){
for (w:v.successors){
w.next_rank += weight*v.rank
}
}
sp = C[Q[m]]
ep = C[Q[m]+1]
for i from m-1 to 1 step -1{
sp = LF(Q[i],sp)
ep = LF(Q[i],ep)
}
Accelerating Sequence Alignments Based on FM-Index Using the Intel KNL Processor,
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2020
Graph processing Genome sequence alignment
PageRank algorithm
Backward search exact matching
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
4
• Four key components
 Computation
 Memory
 Input/Output
 Communication
Conventional Computing System
John Louis von Newmann & UNIVAC
Master
Processor-Centric Design
• All data is processes in the CPU
 At great system cost
• Processors are heavily optimized
and considered the master
• Data storage units are dumb
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
5
A Processor-Centric Design (I)
L1D & L2 caches Subsystem
Frontend
Execution Engine
Memory Channel
Memory Bank
Memory Bank
Memory Bank
Memory Bank
Memory Bank
Memory Bank
A great part of the processor is dedicated to storing and moving data
Shared interconnect
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
6
Main Memory Trends (I)
• Difficulty in scaling memory: capacity, bandwidth, latency (DRAM)
 Multi-core: increasing number of cores asking for data
 Data-intensive applications: increasing demand for data
Disaggregated Memory for Expansion and Sharing in Blade Servers, ISCA 2009
Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization,
SIGMETRICS 2016
Doubles every 2 years
Doubles every 3 years
Memory capacity per core drops by 30% every 2 years
Trends worse for memory bandwidth per core
Scaling for DRAM capacity, bandwidth, latency
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
7
Main Memory Trends (II)
• Data movement dominates computation in term of energy/power
 Scientific apps: ~40% of the total energy is spent on data movement
 Mobile apps: ~35% of the total energy is spent on data movement
 Consumer apps: ~62% of the total energy is spent on data movement
28nm nVIDIA (Source: W. Dally)
Quantifying the Energy Cost of Data Movement in Scientific Applications, IISWC 2013
Quantifying the Energy Cost of Data Movement for Emerging Smart Phone Workloads on Mobile Platforms, IISWC 2014
Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks, ASPLOS 2018
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
8
Main Memory Trends (III)
• DRAM scaling has become increasingly difficult
 Cell leakage current, cell reliability, manufacturing difficulties…
 Difficult to continue the improvements in capacity, energy and latency
• Emerging memory technologies are promising
 Reduced-latency DRAM: RLDRAM, TL-DRAM…
 Low-power DRAM: LPDDR4…
 3D-stacked RAM: HBM, HMC…
 Non-volatile memory (NVM): PCM, MRAM, STT-MRAM, ReRAM…
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
9
• The system is grossly imbalanced
 Processing is done only in one place (execution engines)
 The rest of the system just stores and moves data
 Modern workloads cause huge data movement
• The processor becomes overly complex
 Required to tolerate data accesses from memory
» Complex multi-level cache hierarchies
» Multiple complex hardware prefetching techniques
» High amounts of multithreading
» Complex out-of-order (OoO) execution mechanisms
A Processor-Centric Design (II)
Data movement causes energy waste and latency
Processor complex mechanisms to tolerate latency,
that causes additional energy waste … and latency
Data-intensive apps make
many techniques innefficient
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
10
• Enable computation with minimal data movement
• Compute where data resides
• Make computing architectures more data-centric
A Data-Centric Design
Processor
core
Cache
hierarchy
Memory
Interconnect
Query
Results
Data-intensive
kernels
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
11
Data-Centric Computing
PIM
Processing-in-Memory
CIM
Computation-in-Memory
• Memory technologies for PIM
• Approaches to implementing PIM
A Modern Primer on Processing in Memory, Springer 2021
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
12
• 3D-stacked memory architectures
• Non-volatile random-access memories (NVRAM)
Memory Technologies for PIM
Segment DRAM Standards & Architectures
Commodity DDR3, DDR4
Low-Power LPDDR3, LPDDR4
Graphics GDDR5, GDDR6
Performance eDRAM, RLDRAM
3D-Stacked MCDRAM, HBM, HMC
Technology NVRAM Architectures
Resistive-Phase PCM (Phase-Change Memory)
Resistive-Magnetic MRAM (magnetoresistive RAM)
STT-MRAM (Spin-Transfer Torque MRAM)
Resistive-Memristor ReRAM (Resistive RAM)
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
13
• Multiple layers of DRAM memory are stacked
 The layers are partitioned into banks
 Same banks in different layers are connected using TSVs (Through-Silicon Vias)
 Thousands of TSVs provides internal ultra high memory bandwidth
 Examples: HMC (Hybrid Memory Cube), HBM (High Bandwidth Memory),
MCDRAM (Multi-Channel DRAM)
3D-Stacked Memory Architectures
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
14
• 3D-stacked DRAM used in Intel Xeon Phi processor (Knights Landing)
 Variant of HMC developed with Micron
 Ultra high bandwidth: ~400 GB/s (DDR4 bandwidth: ~90 GB/s)
 Similar latency than DRAM DIMMs
Multi-Channel DRAM (MCDRAM) - 2013/18
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
15
• 3D-stacked DRAM developed by Micron and Samsung
 HMC uses standard DRAM cells
 HMC interface is non-compatible with DDRx
 Serial link: 8/16 SerDes @ 10-15 Gbps (half- or full-duplex)
» 4-link@15Gbps fd HMC ⤇ 240 GB/s bandwidth
» 8-link@10Gbps fd HMC ⤇ 320 GB/s bandwidth
Hybrid Memory Cube (HMC) - 2011/18
Serial
Links
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
16
• 3D-stacked DRAM developed by Samsung, AMD and SK Hynix
 HBM uses standard DRAM cells (up to 8)
 HMC interface is compatible with DDR4 or GDDR5
 Ultra wide memory bus
» A 4-DRAM HBM stack has 128-bit 8 channels (=1024-bit bus)
» Overall package bandwidth: 128 GB/s
 HBM2 doubles the package bandwidth of HBM (256 GB/s)
High Bandwidth Memory (HBM) - 2013
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
17
• 3D-stacked memory architectures
• Non-volatile random-access memories (NVRAM)
Memory Technologies for PIM
Segment DRAM Standards & Architectures
Commodity DDR3, DDR4
Low-Power LPDDR3, LPDDR4
Graphics GDDR5, GDDR6
Performance eDRAM, RLDRAM
3D-Stacked MCDRAM, HBM, HMC
Technology NVRAM Architectures
Resistive-Phase PCM (Phase-Change Memory)
Resistive-Magnetic MRAM (magnetoresistive RAM)
STT-MRAM (Spin-Transfer Torque MRAM)
Resistive-Memristor ReRAM (Resistive RAM)
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
18
• Byte-addressable resistive non-volatile random-access memory
 Emerging technologies
» Phase-change memory (PCM)
» Magnetoresistive RAM (MRAM) and Spin-Transfer Torque MRAM (STT-MRAM)
» Metal-oxide resistive RAM (RRAM or ReRAM) or memristors
Non-Volatile RAM (NVRAM)
PCM STT-MRAM RRAM DRAM SRAM
Density (F²) 4-30 6-50 <4 6-10 >100
W energy /bit (pJ) ~10 ~0.1 ~0.1 ~0.01 ~0.001
Read time (ns) <10 <10 <10 ~10 0.1-0.3
Write time (ns) ~50 <10 <10 ~10 0.1-0.3
Retention Years Years Years 10-20 ms As V applied
Endurance (cyc.) 108
-1015
>1015 108
-1012
>1016
>1016
An Overview of In-memory Processing with Emerging Non-volatile Memory for Data-intensive Applications, arXiv 2019
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
19
• Promising resistive memory technologies
 Random access, resistive, non volatile, no erase before write
Non-Volatile RAM (NVRAM)
BottomElectrode
Top Electrode
oxide
isolation
switching
region
phasechange material
Phase Change
Memory
Spin Transfer Torque
Magnetic Random Access
Memory
Soft Magnet
Pinned Magnet
tunnel barrier (oxide)
current
filament
oxygen ion
Top Electrode
BottomElectrode
metal
oxide
oxygen
vacancy
Resistive Switching
Random Access
Memory
PCM STT-MRAM RRAM, ReRAM
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
20
• Inject current to change phase of chalcogenide glass by heat
 Amorphous: Low optical reflexivity and high electrical resistivity
 Crystalline: High optical reflexivity and low electrical resistivity
• PCM cell can be switched between states reliably and quickly
PCM: Phase Change Memory (I)
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
21
• Write operation
 SET: sustained current to heat cell above crystalline temperature (𝑇𝑐𝑟𝑦𝑠)
 RESET: cell heated above melt temperature (𝑇𝑚𝑒𝑙𝑡) and quenched
• Read operation
 Detect phase (amorphous/crystalline) via material resistance (high/low)
PCM: Phase Change Memory (II)
Phase Change Memory: From Devices to Systems, Morgan & Claypool Pub. 2011
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
22
• Inject current to change magnetic polarity
• Based on a MTJ (Magnetic Tunnel Junction) device
 Reference (fixed) layer (RL): with a fixed magnetic orientation
 Free layer (FL): with a variable magnetic orientation
• Magnetic orientation of the free layer determines logical state
 FL parallel to RL: MTJ with low resistance
 FL anti-parallel to RL: MTJ with high resistance
STT-MRAM: Spin-Transfer Torque MRAM(I)
MTJ
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
23
• Write operation
 Push large current through MTJ to change orientation of free layer
• Read operation
 Sense current flow
STT-MRAM: Spin-Transfer Torque MRAM(II)
MTJ
Low
resistance
High
resistance
Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative, ISPASS 2013
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
24
• Inject current to change atomic structure
• Based on a memristor device
 Memristor: solid-state device with two resistive states (low/high resistance)
RRAM/ReRAM: Resistive RAM(I)
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
25
• Passive electronic element that change resistance depending on the
electrical field applied
 Unipolar memristor: the polarity of the operating voltage IS NOT relevant
 Bipolar memristor: the polarity of the operating voltage IS relevant
RRAM/ReRAM: Memristor (I)
Unipolar Memristor Bipolar Memristor
HRS: High Resistance State
LRS: Low Resistance State
SET
RESET
SET RESET
SET
RESET
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
26
• Passive electronic element that change resistance depending on the
electrical field applied
 Digital memristor: exhibits two neat resistance states: LRS and HRS
» Codifies 1 bit per memristor
 Analog memristor: exhibits multiple resistance windows between LRS and HRS
» Codifies several bits (small number) per memristor
RRAM/ReRAM: Memristor (II)
Digital Memristor Analog Memristor
HRS: High Resistance State
LRS: Low Resistance State
SET
RESET
LRS
HRS
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
27
• Memristor cells in a RRAM are usually arranged as a crossbar
RRAM/ReRAM: Memristor Crossbar
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
28
• Read/Write operations in a RRAM
RRAM/ReRAM: Memristor Read/Write
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
29
Data-Centric Computing
PIM
Processing-in-Memory
CIM
Computation-in-Memory
• Memory technologies for PIM
• Approaches to implementing PIM
A Modern Primer on Processing in Memory, Springer 2021
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
30
• PNM: Processing Near Memory
 Logic layer in 3D-stacked memory
 Function-specific logic connected to 3D-stacked memory (via logic interface)
 Silicon interposers (connected directly to TSVs)
 Logic in memory controllers
• PIM/PUM: Processing In/Using Memory
 Logic inside SRAM
 Logic inside DRAM
 PCM with logic (in cells and/or periphery)
 STT-MRAM with logic (in cells and/or periphery)
 RRAM with logic (in cells and/or periphery)
Approaches to Implementing PIM
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
31
• Logic layer in 3D-stacked memory
Processing Near Memory (PNM)
Accelerator,
Reconfigurable logic,
Simple processing cores
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
32
• Silicon interposer
 SOC connected directly to the TSVs in HBM
Processing Near Memory (PNM)
Accelerator,
Reconfigurable logic,
Simple processing cores
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
33
• Function-specific logic connected to 3D-stacked memory
 Processing logic connected HBM channel interface via an HBM controller
Processing Near Memory (PNM)
Accelerator,
Reconfigurable logic,
Simple processing cores
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
34
• PNM: Processing Near Memory
 Logic layer in 3D-stacked memory
 Function-specific logic connected to 3D-stacked memory (via logic interface)
 Silicon interposers (connected directly to TSVs)
 Logic in memory controllers
• PIM/PUM: Processing In/Using Memory
 Logic inside DRAM
 PCM with logic (in cells and/or periphery)
 STT-MRAM with logic (in cells and/or periphery)
 RRAM with logic (in cells and/or periphery)
Approaches to Implementing PIM
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
35
• Exploits the memory architecture to enable operations with minimal
changes
 Makes use of memory cells or cell arrays to perform useful computation
• A wide range of different functions are enabled
 Fully parallel bulk or fine-grained operations
» Data copy/initialization
» Bulk bitwise operations
» Simple arithmetic operations (addition, multiplication, comparison, majority…)
Processing In/Using Memory (PIM/PUM)
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
36
• Single or multiple functions
PIM/PUM on RRAM: Computation
Low Power Memristor-based Computing for Edge-AI Applications, ISCAS 2021
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
37
• Analog memristor operations
 The analog nature of memristors is used to codify data as resistance value
• Example: vector-matrix operations
PIM/PUM on RRAM: Analog Computation
Scalar product
of vectors
V=(V1,V2) and
G=(G1,G2)
Vector Matrix
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
38
• Digital memristor operations
 Data is codified as resistance values: LRS → 1, HRS → 0
• Example: boolean NOR using unipolar memristors
PIM/PUM on RRAM: Digital Computation (I)
𝑉𝑜𝑢𝑡 initializes to 0 (HRS)
UPIM : Unipolar Switching Logic for High Density Processing-in-Memory Applications, GLSVLSI 2019
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
39
• Digital memristor operations
 Data is codified as resistance values: LRS → 1, HRS → 0
• Example: boolean NOR using bipolar memristors
 Step 1: set ‘out’ to 1 (LRS)
 Step 2: apply 𝑉0 to ‘in’ and GND to ‘out’ (being 𝑉0/2 > 𝑉𝑅𝐸𝑆𝐸𝑇)
PIM/PUM on RRAM: Digital Computation (II)
Logic Design Within Memristive Memories Using Memristor-Aided loGIC (MAGIC), IEEE Transactions on Nanotechnology 2016
~0
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
40
• Processing in/using memory: SIMD parallelism
PIM/PUM on RRAM: Digital Computation (III)
~0
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
41
• Example: N-bit fixed-point multiplication
 Possible high latency but massive parallelism
PIM/PUM on RRAM: Digital Computation (IV)
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
42
Data-Centric Computing
Own Research
PNM Solutions
• NATSA: PNM accelerator for time series analysis
• FM-Index: PNM accelerator for genome sequence alignment
A Modern Primer on Processing in Memory, Springer 2021
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
43
NATSA: PNM Accelerator for Time Series Analysis
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
44
NATSA: Motivation
• Time series analysis has many applications
Climate change Medicine Signal processing
Economics Astronomy
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
45
NATSA: Motifs and Discords
• Given a sliced time series into subsequences
 Motif discovery focuses on finding similarities
 Discord discovery focuses on finding anomalies
• Naive example of anomaly detection
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
46
NATSA: Matrix Profile (MP)
• MP: Algorithm and open source tool for motif and anomaly discovery
• A single input parameter: subsequence length
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
47
NATSA: Matrix Profile (MP) - SCRIMP
• SCRIMP: state-of-the-art CPU matrix profile implementation
• SCRIMP on Intel Xeon Phi KNL
 SCRIMP is heavily bottlenecked by data movement
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
48
NATSA: PNM Accelerator for SCRIMP
• NATSA: fully exploit HBM
memory bandwidth
 NATSA consists of multiple
processing units (PUs)
 PUs compute batches of
diagonals of the distance
matrix in a vectorized way
• Hardware components
 DPU: dot product unit
 DCU: distance compute unit
 DPUU: dot product update
unit
 PUU: profile update unit
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
49
NATSA: Evaluation
• Simulated platforms
 ZSIM + Ramulator, McPAT, Aladdin + gem5, Micron power calculator
• Real hardware platforms
 Intel Xeon Phi KNL, NVIDIA Tesla K40c, NVIDIA GTX 1050
Hardware Cores / PUs Caches (L1 / L2 / L3) Memory
DDR4-OoO 8 OoO @ 3.75 GHz 32KB / 256KB / 8MB 16 GB DDR4-2400
DDR4-inOrder 64 in-order @ 2.5 GHz 32KB / - / - 16 GB DDR4-2400
HBM-OoO 8 OoO @ 3.75 GHz 32KB / 256KB / 8MB 4 GB HBM2
HBM-inOrder 64 in-order @ 2.5 GHz 32KB / - / - 4 GB HBM2
NATSA 48 PUs @ 1 GHz 48KB (Scratchpad) 4 GB HBM2
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
50
NATSA: Evaluation: Performance and Area
• Performance
• Area
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
51
NATSA: Evaluation: Performance and Area
• Power consumption
• Energy consumption
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
52
TraTSA: Transprecise Accelerator for Time Series Analysis
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
53
FM-Index: PNM Accelerator for Genome Sequence Alignment
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
54
FM-Index: Motivation
• Genome sequence alignment is very relevant in bioinformatics
• There is a variety of sequence alignment tools
 Bowtie, Bowtie2, BWA, HiSAT2, SOAP, CUS…
• Modern sequencing tools generate a lot of data
• Many sequence aligners are based on the FM-index data structure
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
55
FM-Index: Data Structure
• FM-index data structure
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
56
FM-Index: Algorithm
• FM-index exact matching algorithm (backward search)
Most time-consuming part
Random memory access pattern
Memory bound
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
57
FM-Index: Algorithm Optimizations
• FM-index exact matching algorithm (backward search)
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
58
FM-Index: Split Bit-Vector Sampled
• Minimizing memory bandwidth consumption
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
59
FM-Index: Evaluation
• Experimental setup
 20M 200-symbol queries generated by Mason simulation tool
 GRCh38 (3 Gbases) human genome reference
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
60
FM-Index: Evaluation
• KNL roofline
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
61
FM-Index: PNM Accelerator
• Logic layer in 3D-stacked memory (HMC)
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
62
FM-Index: PNM Accelerator
• Performance
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
63
FM-Index: PNM Accelerator
• Power consumption
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
64
FM-Index: Architecture Exploration
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
65
Data-Centric Computing
Commercial
PNM Solutions
• UPMEM
• Samsung FIMDRAM
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
66
• Standard DIMM modules
• Key features:
 Relies on mature 2D DRAM
fabrication process
 DPUs support a wide variety of
computations and data types
 Threads in DPUs are independent
 Complete software stack
(C language)
UPMEM: DRAM PIM Accelerator
Processing in DRAM Engine
https://www.upmem.com
• DDR4-2400 R-DIMM modules:
 PIM chip: 64MB + 8 DPUs @400-500MHz
 PIM DIMM: 16 PIM chips
(8GB + 128 DPUs @ 400-500MHz)
 Server: up to 20 PIM DIMMS
(160GB + 2560 DPUs)
PIM chip
PIM DIMM
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
67
• PIM DIMM: 8GB DRAM DDR4 2400 + 256 DPUs @ 400-500MHz
UPMEM PIM System Organization
UPMEM PIM White Paper, 2018
DPUs
Main RAM
Instruction
RAM
Working RAM
(scratchpad)
Copying data from Main Mem to DPU MRAM
Retrieving results from DPU MRAM to Main Mem
NO direct communication DPU – DPU!
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
68
• PIM DIMM: 8GB DRAM DDR4 2400 + 256 DPUs @ 400-500MHz
 DPU: Multithreaded in-order 32-bit RISC core with NO cache hierarchy
 24 hardware contexts (threads), each with 24 32-bit GP registers ❻
 Hardware threads share IRAM and WRAM (for operands)
 14-stage pipeline ❼
UPMEM PIM DPU Architecture
UPMEM PIM White Paper, 2018
D F1 F2 F3 R1 R2 R3 F A1 A2 A3 A4 M1 M2
D F1 F2 F3 R1 R2 R3 F A1 A2 A3 A4 M1 M2
11 cycles
Instruction
overlapping
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
69
• PIM DIMM: 8GB DRAM DDR4 2400 + 256 DPUs @ 400-500MHz
 DPU max. frequency: 400 MHz
 MRAM-WRAM bandwidth per DPU: 800 MB/s
 MRAM-WRAM bandwidth per DIMM: 102 GB/s
 Max. aggregated MRAM-WRAM bandwidth (20 DIMMS): ~2 TB/s
UPMEM PIM Data Throughput
UPMEM PIM White Paper, 2018
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
70
• SPMD programming model
 A software thread is called tasklet
» (1) All tasklets execute the same code on different data pieces
» (2) Tasklets can execute different control-flow paths at runtime
 Up to 24 tasklets can run in parallel in a single DPU
 Tasklets are statically assigned to each DPU
 Intra-DPU tasklets can share data in MRAM and WRAM, and can synchronize
 Inter-DPU tasklets do not share memory or any direct communication channel
• Programs are written in C with library calls (UPMEM SDK)
• UPMEM runtime library
 Calls to move instructions from MRAM to IRAM
 Calls to move data between MRAM and WRAM
 Calls to move data between Main Memory and DPU MRAM
 Calls for locks, barriers, handshaking, semaphores…
UPMEM PIM Programming
UPMEM PIM White Paper, 2018
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
71
UPMEM PIM Evaluation (I)
ACM SIGMETRICS (Fall Edition) 2021
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
72
UPMEM PIM Evaluation (II)
• PrIM Benchmarks
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
73
UPMEM PIM Evaluation (III)
• Key takeaways
 Takeaway 1
» The UPMEM PIM architecture is fundamentally compute bound
» As a result, the most suitable workloads are memory-bound
 Takeaway 2
» The most well-suited workloads for the UPMEM PIM architecture use no arithmetic
operations or use only simple operations (e.g., bitwise operations and integer
addition/subtraction)
 Takeaway 3
» The most well-suited workloads for the UPMEM PIM architecture require little or no
communication across DRAM Processing Units (inter-DPU communication).
 Takeaway 4
» UPMEM PIM systems outperform state-of-the-art CPUs in terms of performance and
energy efficiency on most of PrIM benchmarks
» UPMEM PIM systems outperform state-of-the-art GPUs on a majority of PrIM
benchmarks, and the outlook is even more positive for future PIM systems
» UPMEM PIM systems are more energy-efficient than state-of-the-art CPUs and GPUs on
workloads that they provide performance improvements over the CPUs and the GPUs
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
74
• Function-in-Memory DRAM based on HBM2 (HBM-PIM)
 HBM2-based memory with integrated AI processors
Samsung FIMDRAM: PIM AI Accelerator
Processing in DRAM Engine
ISSCC 2021
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
75
• Programmable Computing Unit (PCU)
 Interface unit to control data flow
 Execution unit to perform operations
Samsung FIMDRAM: PCU
ISSCC 2021
Instruction memory
Constants for MAC operations
Weight and accumulation
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
76
• Mixed design methodology
 Full-custom + digital RTL for PCU block
Samsung FIMDRAM: Chip Implementation
ISSCC 2021
Dept. Arquitectura de Computadores, Univ. de Málaga
Oscar Plata
May 2021
UCM, Madrid
77
• Performance and energy efficiency
Samsung FIMDRAM: Experimental Results
ISSCC 2021
HBM2 FIMDRAM
Type of DRAM HBM2 HBM2
Process 20 nm 20 nm
Memory density 8GB /
cube
6GB / cube
Data rate 2.4Gbps 2.4Gbps
Bandwidth 307GB/s
per cube
307GB/s per
cube
# channels 8 per cube 8 per cube
# processing
units
No 128 per cube
Processing
operation speed
- 300MHz
Peak
throughput
- 1.2 TFLOPS
per cube
Operation
precision
- FP16
78
Processing in Memory
Oscar Plata
oplata@uma.es
Dept. Arquitectura de Computadores
Universidad de Málaga
HPACuma Research Group
ALDEBARAN Research Group
HPACuma
Research Group
UCM, May 27, 2021

More Related Content

What's hot

301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog
Srinivas Naidu
 

What's hot (20)

Semiconductor memories
Semiconductor memoriesSemiconductor memories
Semiconductor memories
 
SRAM Design
SRAM DesignSRAM Design
SRAM Design
 
Designing memory and array structures.pptx
Designing memory and array structures.pptxDesigning memory and array structures.pptx
Designing memory and array structures.pptx
 
Advanced Low Power Techniques in Chip Design
Advanced Low Power Techniques in Chip DesignAdvanced Low Power Techniques in Chip Design
Advanced Low Power Techniques in Chip Design
 
EC6601 VLSI Design Memory Circuits
EC6601 VLSI Design   Memory CircuitsEC6601 VLSI Design   Memory Circuits
EC6601 VLSI Design Memory Circuits
 
Unit II Arm 7 Introduction
Unit II Arm 7 IntroductionUnit II Arm 7 Introduction
Unit II Arm 7 Introduction
 
Ec8791 unit 5 processes and operating systems
Ec8791 unit 5 processes and operating systemsEc8791 unit 5 processes and operating systems
Ec8791 unit 5 processes and operating systems
 
Mram (magneticRAM)
Mram (magneticRAM)Mram (magneticRAM)
Mram (magneticRAM)
 
301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog
 
MicroC/OS-II
MicroC/OS-IIMicroC/OS-II
MicroC/OS-II
 
System On Chip
System On ChipSystem On Chip
System On Chip
 
01 nand flash_reliability_notes
01 nand flash_reliability_notes01 nand flash_reliability_notes
01 nand flash_reliability_notes
 
Esd mod 3
Esd mod 3Esd mod 3
Esd mod 3
 
3D INTEGRATION
3D INTEGRATION3D INTEGRATION
3D INTEGRATION
 
Arm processor architecture awareness session pi technologies
Arm processor architecture awareness session pi technologiesArm processor architecture awareness session pi technologies
Arm processor architecture awareness session pi technologies
 
Module 4 advanced microprocessors
Module 4 advanced microprocessorsModule 4 advanced microprocessors
Module 4 advanced microprocessors
 
Trends and challenges in vlsi
Trends and challenges in vlsiTrends and challenges in vlsi
Trends and challenges in vlsi
 
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APUHot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
 
Introduction to arm architecture
Introduction to arm architectureIntroduction to arm architecture
Introduction to arm architecture
 
Power Gating
Power GatingPower Gating
Power Gating
 

Similar to Processing-in-Memory

lecture asdkvakm;bk;dv;advvAVHD;KASV;DVKHSVDK
lecture asdkvakm;bk;dv;advvAVHD;KASV;DVKHSVDKlecture asdkvakm;bk;dv;advvAVHD;KASV;DVKHSVDK
lecture asdkvakm;bk;dv;advvAVHD;KASV;DVKHSVDK
officeaiotfab
 
emerging memory technologies Document
emerging memory technologies Documentemerging memory technologies Document
emerging memory technologies Document
Vidya Baddam
 
The HPE Machine and Gen-Z - BUD17-503
The HPE Machine and Gen-Z - BUD17-503The HPE Machine and Gen-Z - BUD17-503
The HPE Machine and Gen-Z - BUD17-503
Linaro
 

Similar to Processing-in-Memory (20)

lecture asdkvakm;bk;dv;advvAVHD;KASV;DVKHSVDK
lecture asdkvakm;bk;dv;advvAVHD;KASV;DVKHSVDKlecture asdkvakm;bk;dv;advvAVHD;KASV;DVKHSVDK
lecture asdkvakm;bk;dv;advvAVHD;KASV;DVKHSVDK
 
Virtualization for Emerging Memory Devices
Virtualization for Emerging Memory DevicesVirtualization for Emerging Memory Devices
Virtualization for Emerging Memory Devices
 
Architectural tricks to maximize memory bandwidth
Architectural tricks to maximize memory bandwidthArchitectural tricks to maximize memory bandwidth
Architectural tricks to maximize memory bandwidth
 
emerging memory technologies Document
emerging memory technologies Documentemerging memory technologies Document
emerging memory technologies Document
 
onur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pptx
onur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pptxonur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pptx
onur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pptx
 
Nvram applications in the architectural revolutions of main memory implementa...
Nvram applications in the architectural revolutions of main memory implementa...Nvram applications in the architectural revolutions of main memory implementa...
Nvram applications in the architectural revolutions of main memory implementa...
 
NVM & Implications on Data Infratsructure
NVM & Implications on Data InfratsructureNVM & Implications on Data Infratsructure
NVM & Implications on Data Infratsructure
 
Introduction to Digital Signal processors
Introduction to Digital Signal processorsIntroduction to Digital Signal processors
Introduction to Digital Signal processors
 
RAMinate Invited Talk at NII
RAMinate Invited Talk at NIIRAMinate Invited Talk at NII
RAMinate Invited Talk at NII
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
The von Neumann Memory Barrier and Computer Architectures for the 21st Century
The von Neumann Memory Barrier and Computer Architectures for the 21st CenturyThe von Neumann Memory Barrier and Computer Architectures for the 21st Century
The von Neumann Memory Barrier and Computer Architectures for the 21st Century
 
MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...
MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...
MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...
 
RAMinate ACM SoCC 2016 Talk
RAMinate ACM SoCC 2016 TalkRAMinate ACM SoCC 2016 Talk
RAMinate ACM SoCC 2016 Talk
 
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
 
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
 
Intelligent ram
Intelligent ramIntelligent ram
Intelligent ram
 
The HPE Machine and Gen-Z - BUD17-503
The HPE Machine and Gen-Z - BUD17-503The HPE Machine and Gen-Z - BUD17-503
The HPE Machine and Gen-Z - BUD17-503
 
End of Moore's Law?
End of Moore's Law? End of Moore's Law?
End of Moore's Law?
 
Sc19 ibm hms final
Sc19 ibm hms finalSc19 ibm hms final
Sc19 ibm hms final
 
Expectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software researchExpectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software research
 

More from Facultad de Informática UCM

More from Facultad de Informática UCM (20)

¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?
 
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
 
DRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation ComputersDRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation Computers
 
uElectronics ongoing activities at ESA
uElectronics ongoing activities at ESAuElectronics ongoing activities at ESA
uElectronics ongoing activities at ESA
 
Tendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura ArmTendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura Arm
 
Formalizing Mathematics in Lean
Formalizing Mathematics in LeanFormalizing Mathematics in Lean
Formalizing Mathematics in Lean
 
Introduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented ComputingIntroduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented Computing
 
Computer Design Concepts for Machine Learning
Computer Design Concepts for Machine LearningComputer Design Concepts for Machine Learning
Computer Design Concepts for Machine Learning
 
Inteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuroInteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuro
 
Design Automation Approaches for Real-Time Edge Computing for Science Applic...
 Design Automation Approaches for Real-Time Edge Computing for Science Applic... Design Automation Approaches for Real-Time Edge Computing for Science Applic...
Design Automation Approaches for Real-Time Edge Computing for Science Applic...
 
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
 
Fault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error CorrectionFault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error Correction
 
Cómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intentoCómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intento
 
Automatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPCAutomatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPC
 
Type and proof structures for concurrency
Type and proof structures for concurrencyType and proof structures for concurrency
Type and proof structures for concurrency
 
Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...
 
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
 
Do you trust your artificial intelligence system?
Do you trust your artificial intelligence system?Do you trust your artificial intelligence system?
Do you trust your artificial intelligence system?
 
Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.
 
Challenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore windChallenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore wind
 

Recently uploaded

Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 

Recently uploaded (20)

Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 

Processing-in-Memory

  • 1. 1 Processing in Memory Oscar Plata oplata@uma.es Dept. Arquitectura de Computadores Universidad de Málaga HPACuma Research Group ALDEBARAN Research Group HPACuma Research Group UCM, May 26, 2021
  • 2. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 2 • Important workloads are all data intensive  Artificial Intelligence, Machine Learning, Genomics, Time Series Analysis, In-Memory Databases, Graph/Tree Processing, Data Analytics, UHD Video Analysis… • Such workloads require fast and efficient processing for large amounts of data • Data is increasing… Data is Key in Modern and Future Workloads Data overwhelms modern computers Data ⟹ Performance and energy bottleneck
  • 3. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 3 • Frequent memory access patterns in modern workloads  Streaming access (lack of data reuse)  Strided access (with large strides)  Non regular access (even random), specially when operating on sparse data structures • Usually, little amount of computation (low operational intensity) Computing Behaviour of Modern Workloads A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing, ISCA 2015 for (v:graph.vertices){ for (w:v.successors){ w.next_rank += weight*v.rank } } sp = C[Q[m]] ep = C[Q[m]+1] for i from m-1 to 1 step -1{ sp = LF(Q[i],sp) ep = LF(Q[i],ep) } Accelerating Sequence Alignments Based on FM-Index Using the Intel KNL Processor, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2020 Graph processing Genome sequence alignment PageRank algorithm Backward search exact matching
  • 4. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 4 • Four key components  Computation  Memory  Input/Output  Communication Conventional Computing System John Louis von Newmann & UNIVAC Master Processor-Centric Design • All data is processes in the CPU  At great system cost • Processors are heavily optimized and considered the master • Data storage units are dumb
  • 5. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 5 A Processor-Centric Design (I) L1D & L2 caches Subsystem Frontend Execution Engine Memory Channel Memory Bank Memory Bank Memory Bank Memory Bank Memory Bank Memory Bank A great part of the processor is dedicated to storing and moving data Shared interconnect
  • 6. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 6 Main Memory Trends (I) • Difficulty in scaling memory: capacity, bandwidth, latency (DRAM)  Multi-core: increasing number of cores asking for data  Data-intensive applications: increasing demand for data Disaggregated Memory for Expansion and Sharing in Blade Servers, ISCA 2009 Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization, SIGMETRICS 2016 Doubles every 2 years Doubles every 3 years Memory capacity per core drops by 30% every 2 years Trends worse for memory bandwidth per core Scaling for DRAM capacity, bandwidth, latency
  • 7. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 7 Main Memory Trends (II) • Data movement dominates computation in term of energy/power  Scientific apps: ~40% of the total energy is spent on data movement  Mobile apps: ~35% of the total energy is spent on data movement  Consumer apps: ~62% of the total energy is spent on data movement 28nm nVIDIA (Source: W. Dally) Quantifying the Energy Cost of Data Movement in Scientific Applications, IISWC 2013 Quantifying the Energy Cost of Data Movement for Emerging Smart Phone Workloads on Mobile Platforms, IISWC 2014 Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks, ASPLOS 2018
  • 8. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 8 Main Memory Trends (III) • DRAM scaling has become increasingly difficult  Cell leakage current, cell reliability, manufacturing difficulties…  Difficult to continue the improvements in capacity, energy and latency • Emerging memory technologies are promising  Reduced-latency DRAM: RLDRAM, TL-DRAM…  Low-power DRAM: LPDDR4…  3D-stacked RAM: HBM, HMC…  Non-volatile memory (NVM): PCM, MRAM, STT-MRAM, ReRAM…
  • 9. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 9 • The system is grossly imbalanced  Processing is done only in one place (execution engines)  The rest of the system just stores and moves data  Modern workloads cause huge data movement • The processor becomes overly complex  Required to tolerate data accesses from memory » Complex multi-level cache hierarchies » Multiple complex hardware prefetching techniques » High amounts of multithreading » Complex out-of-order (OoO) execution mechanisms A Processor-Centric Design (II) Data movement causes energy waste and latency Processor complex mechanisms to tolerate latency, that causes additional energy waste … and latency Data-intensive apps make many techniques innefficient
  • 10. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 10 • Enable computation with minimal data movement • Compute where data resides • Make computing architectures more data-centric A Data-Centric Design Processor core Cache hierarchy Memory Interconnect Query Results Data-intensive kernels
  • 11. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 11 Data-Centric Computing PIM Processing-in-Memory CIM Computation-in-Memory • Memory technologies for PIM • Approaches to implementing PIM A Modern Primer on Processing in Memory, Springer 2021
  • 12. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 12 • 3D-stacked memory architectures • Non-volatile random-access memories (NVRAM) Memory Technologies for PIM Segment DRAM Standards & Architectures Commodity DDR3, DDR4 Low-Power LPDDR3, LPDDR4 Graphics GDDR5, GDDR6 Performance eDRAM, RLDRAM 3D-Stacked MCDRAM, HBM, HMC Technology NVRAM Architectures Resistive-Phase PCM (Phase-Change Memory) Resistive-Magnetic MRAM (magnetoresistive RAM) STT-MRAM (Spin-Transfer Torque MRAM) Resistive-Memristor ReRAM (Resistive RAM)
  • 13. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 13 • Multiple layers of DRAM memory are stacked  The layers are partitioned into banks  Same banks in different layers are connected using TSVs (Through-Silicon Vias)  Thousands of TSVs provides internal ultra high memory bandwidth  Examples: HMC (Hybrid Memory Cube), HBM (High Bandwidth Memory), MCDRAM (Multi-Channel DRAM) 3D-Stacked Memory Architectures
  • 14. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 14 • 3D-stacked DRAM used in Intel Xeon Phi processor (Knights Landing)  Variant of HMC developed with Micron  Ultra high bandwidth: ~400 GB/s (DDR4 bandwidth: ~90 GB/s)  Similar latency than DRAM DIMMs Multi-Channel DRAM (MCDRAM) - 2013/18
  • 15. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 15 • 3D-stacked DRAM developed by Micron and Samsung  HMC uses standard DRAM cells  HMC interface is non-compatible with DDRx  Serial link: 8/16 SerDes @ 10-15 Gbps (half- or full-duplex) » 4-link@15Gbps fd HMC ⤇ 240 GB/s bandwidth » 8-link@10Gbps fd HMC ⤇ 320 GB/s bandwidth Hybrid Memory Cube (HMC) - 2011/18 Serial Links
  • 16. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 16 • 3D-stacked DRAM developed by Samsung, AMD and SK Hynix  HBM uses standard DRAM cells (up to 8)  HMC interface is compatible with DDR4 or GDDR5  Ultra wide memory bus » A 4-DRAM HBM stack has 128-bit 8 channels (=1024-bit bus) » Overall package bandwidth: 128 GB/s  HBM2 doubles the package bandwidth of HBM (256 GB/s) High Bandwidth Memory (HBM) - 2013
  • 17. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 17 • 3D-stacked memory architectures • Non-volatile random-access memories (NVRAM) Memory Technologies for PIM Segment DRAM Standards & Architectures Commodity DDR3, DDR4 Low-Power LPDDR3, LPDDR4 Graphics GDDR5, GDDR6 Performance eDRAM, RLDRAM 3D-Stacked MCDRAM, HBM, HMC Technology NVRAM Architectures Resistive-Phase PCM (Phase-Change Memory) Resistive-Magnetic MRAM (magnetoresistive RAM) STT-MRAM (Spin-Transfer Torque MRAM) Resistive-Memristor ReRAM (Resistive RAM)
  • 18. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 18 • Byte-addressable resistive non-volatile random-access memory  Emerging technologies » Phase-change memory (PCM) » Magnetoresistive RAM (MRAM) and Spin-Transfer Torque MRAM (STT-MRAM) » Metal-oxide resistive RAM (RRAM or ReRAM) or memristors Non-Volatile RAM (NVRAM) PCM STT-MRAM RRAM DRAM SRAM Density (F²) 4-30 6-50 <4 6-10 >100 W energy /bit (pJ) ~10 ~0.1 ~0.1 ~0.01 ~0.001 Read time (ns) <10 <10 <10 ~10 0.1-0.3 Write time (ns) ~50 <10 <10 ~10 0.1-0.3 Retention Years Years Years 10-20 ms As V applied Endurance (cyc.) 108 -1015 >1015 108 -1012 >1016 >1016 An Overview of In-memory Processing with Emerging Non-volatile Memory for Data-intensive Applications, arXiv 2019
  • 19. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 19 • Promising resistive memory technologies  Random access, resistive, non volatile, no erase before write Non-Volatile RAM (NVRAM) BottomElectrode Top Electrode oxide isolation switching region phasechange material Phase Change Memory Spin Transfer Torque Magnetic Random Access Memory Soft Magnet Pinned Magnet tunnel barrier (oxide) current filament oxygen ion Top Electrode BottomElectrode metal oxide oxygen vacancy Resistive Switching Random Access Memory PCM STT-MRAM RRAM, ReRAM
  • 20. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 20 • Inject current to change phase of chalcogenide glass by heat  Amorphous: Low optical reflexivity and high electrical resistivity  Crystalline: High optical reflexivity and low electrical resistivity • PCM cell can be switched between states reliably and quickly PCM: Phase Change Memory (I)
  • 21. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 21 • Write operation  SET: sustained current to heat cell above crystalline temperature (𝑇𝑐𝑟𝑦𝑠)  RESET: cell heated above melt temperature (𝑇𝑚𝑒𝑙𝑡) and quenched • Read operation  Detect phase (amorphous/crystalline) via material resistance (high/low) PCM: Phase Change Memory (II) Phase Change Memory: From Devices to Systems, Morgan & Claypool Pub. 2011
  • 22. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 22 • Inject current to change magnetic polarity • Based on a MTJ (Magnetic Tunnel Junction) device  Reference (fixed) layer (RL): with a fixed magnetic orientation  Free layer (FL): with a variable magnetic orientation • Magnetic orientation of the free layer determines logical state  FL parallel to RL: MTJ with low resistance  FL anti-parallel to RL: MTJ with high resistance STT-MRAM: Spin-Transfer Torque MRAM(I) MTJ
  • 23. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 23 • Write operation  Push large current through MTJ to change orientation of free layer • Read operation  Sense current flow STT-MRAM: Spin-Transfer Torque MRAM(II) MTJ Low resistance High resistance Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative, ISPASS 2013
  • 24. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 24 • Inject current to change atomic structure • Based on a memristor device  Memristor: solid-state device with two resistive states (low/high resistance) RRAM/ReRAM: Resistive RAM(I)
  • 25. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 25 • Passive electronic element that change resistance depending on the electrical field applied  Unipolar memristor: the polarity of the operating voltage IS NOT relevant  Bipolar memristor: the polarity of the operating voltage IS relevant RRAM/ReRAM: Memristor (I) Unipolar Memristor Bipolar Memristor HRS: High Resistance State LRS: Low Resistance State SET RESET SET RESET SET RESET
  • 26. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 26 • Passive electronic element that change resistance depending on the electrical field applied  Digital memristor: exhibits two neat resistance states: LRS and HRS » Codifies 1 bit per memristor  Analog memristor: exhibits multiple resistance windows between LRS and HRS » Codifies several bits (small number) per memristor RRAM/ReRAM: Memristor (II) Digital Memristor Analog Memristor HRS: High Resistance State LRS: Low Resistance State SET RESET LRS HRS
  • 27. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 27 • Memristor cells in a RRAM are usually arranged as a crossbar RRAM/ReRAM: Memristor Crossbar
  • 28. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 28 • Read/Write operations in a RRAM RRAM/ReRAM: Memristor Read/Write
  • 29. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 29 Data-Centric Computing PIM Processing-in-Memory CIM Computation-in-Memory • Memory technologies for PIM • Approaches to implementing PIM A Modern Primer on Processing in Memory, Springer 2021
  • 30. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 30 • PNM: Processing Near Memory  Logic layer in 3D-stacked memory  Function-specific logic connected to 3D-stacked memory (via logic interface)  Silicon interposers (connected directly to TSVs)  Logic in memory controllers • PIM/PUM: Processing In/Using Memory  Logic inside SRAM  Logic inside DRAM  PCM with logic (in cells and/or periphery)  STT-MRAM with logic (in cells and/or periphery)  RRAM with logic (in cells and/or periphery) Approaches to Implementing PIM
  • 31. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 31 • Logic layer in 3D-stacked memory Processing Near Memory (PNM) Accelerator, Reconfigurable logic, Simple processing cores
  • 32. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 32 • Silicon interposer  SOC connected directly to the TSVs in HBM Processing Near Memory (PNM) Accelerator, Reconfigurable logic, Simple processing cores
  • 33. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 33 • Function-specific logic connected to 3D-stacked memory  Processing logic connected HBM channel interface via an HBM controller Processing Near Memory (PNM) Accelerator, Reconfigurable logic, Simple processing cores
  • 34. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 34 • PNM: Processing Near Memory  Logic layer in 3D-stacked memory  Function-specific logic connected to 3D-stacked memory (via logic interface)  Silicon interposers (connected directly to TSVs)  Logic in memory controllers • PIM/PUM: Processing In/Using Memory  Logic inside DRAM  PCM with logic (in cells and/or periphery)  STT-MRAM with logic (in cells and/or periphery)  RRAM with logic (in cells and/or periphery) Approaches to Implementing PIM
  • 35. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 35 • Exploits the memory architecture to enable operations with minimal changes  Makes use of memory cells or cell arrays to perform useful computation • A wide range of different functions are enabled  Fully parallel bulk or fine-grained operations » Data copy/initialization » Bulk bitwise operations » Simple arithmetic operations (addition, multiplication, comparison, majority…) Processing In/Using Memory (PIM/PUM)
  • 36. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 36 • Single or multiple functions PIM/PUM on RRAM: Computation Low Power Memristor-based Computing for Edge-AI Applications, ISCAS 2021
  • 37. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 37 • Analog memristor operations  The analog nature of memristors is used to codify data as resistance value • Example: vector-matrix operations PIM/PUM on RRAM: Analog Computation Scalar product of vectors V=(V1,V2) and G=(G1,G2) Vector Matrix
  • 38. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 38 • Digital memristor operations  Data is codified as resistance values: LRS → 1, HRS → 0 • Example: boolean NOR using unipolar memristors PIM/PUM on RRAM: Digital Computation (I) 𝑉𝑜𝑢𝑡 initializes to 0 (HRS) UPIM : Unipolar Switching Logic for High Density Processing-in-Memory Applications, GLSVLSI 2019
  • 39. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 39 • Digital memristor operations  Data is codified as resistance values: LRS → 1, HRS → 0 • Example: boolean NOR using bipolar memristors  Step 1: set ‘out’ to 1 (LRS)  Step 2: apply 𝑉0 to ‘in’ and GND to ‘out’ (being 𝑉0/2 > 𝑉𝑅𝐸𝑆𝐸𝑇) PIM/PUM on RRAM: Digital Computation (II) Logic Design Within Memristive Memories Using Memristor-Aided loGIC (MAGIC), IEEE Transactions on Nanotechnology 2016 ~0
  • 40. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 40 • Processing in/using memory: SIMD parallelism PIM/PUM on RRAM: Digital Computation (III) ~0
  • 41. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 41 • Example: N-bit fixed-point multiplication  Possible high latency but massive parallelism PIM/PUM on RRAM: Digital Computation (IV)
  • 42. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 42 Data-Centric Computing Own Research PNM Solutions • NATSA: PNM accelerator for time series analysis • FM-Index: PNM accelerator for genome sequence alignment A Modern Primer on Processing in Memory, Springer 2021
  • 43. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 43 NATSA: PNM Accelerator for Time Series Analysis
  • 44. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 44 NATSA: Motivation • Time series analysis has many applications Climate change Medicine Signal processing Economics Astronomy
  • 45. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 45 NATSA: Motifs and Discords • Given a sliced time series into subsequences  Motif discovery focuses on finding similarities  Discord discovery focuses on finding anomalies • Naive example of anomaly detection
  • 46. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 46 NATSA: Matrix Profile (MP) • MP: Algorithm and open source tool for motif and anomaly discovery • A single input parameter: subsequence length
  • 47. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 47 NATSA: Matrix Profile (MP) - SCRIMP • SCRIMP: state-of-the-art CPU matrix profile implementation • SCRIMP on Intel Xeon Phi KNL  SCRIMP is heavily bottlenecked by data movement
  • 48. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 48 NATSA: PNM Accelerator for SCRIMP • NATSA: fully exploit HBM memory bandwidth  NATSA consists of multiple processing units (PUs)  PUs compute batches of diagonals of the distance matrix in a vectorized way • Hardware components  DPU: dot product unit  DCU: distance compute unit  DPUU: dot product update unit  PUU: profile update unit
  • 49. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 49 NATSA: Evaluation • Simulated platforms  ZSIM + Ramulator, McPAT, Aladdin + gem5, Micron power calculator • Real hardware platforms  Intel Xeon Phi KNL, NVIDIA Tesla K40c, NVIDIA GTX 1050 Hardware Cores / PUs Caches (L1 / L2 / L3) Memory DDR4-OoO 8 OoO @ 3.75 GHz 32KB / 256KB / 8MB 16 GB DDR4-2400 DDR4-inOrder 64 in-order @ 2.5 GHz 32KB / - / - 16 GB DDR4-2400 HBM-OoO 8 OoO @ 3.75 GHz 32KB / 256KB / 8MB 4 GB HBM2 HBM-inOrder 64 in-order @ 2.5 GHz 32KB / - / - 4 GB HBM2 NATSA 48 PUs @ 1 GHz 48KB (Scratchpad) 4 GB HBM2
  • 50. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 50 NATSA: Evaluation: Performance and Area • Performance • Area
  • 51. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 51 NATSA: Evaluation: Performance and Area • Power consumption • Energy consumption
  • 52. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 52 TraTSA: Transprecise Accelerator for Time Series Analysis
  • 53. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 53 FM-Index: PNM Accelerator for Genome Sequence Alignment
  • 54. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 54 FM-Index: Motivation • Genome sequence alignment is very relevant in bioinformatics • There is a variety of sequence alignment tools  Bowtie, Bowtie2, BWA, HiSAT2, SOAP, CUS… • Modern sequencing tools generate a lot of data • Many sequence aligners are based on the FM-index data structure
  • 55. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 55 FM-Index: Data Structure • FM-index data structure
  • 56. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 56 FM-Index: Algorithm • FM-index exact matching algorithm (backward search) Most time-consuming part Random memory access pattern Memory bound
  • 57. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 57 FM-Index: Algorithm Optimizations • FM-index exact matching algorithm (backward search)
  • 58. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 58 FM-Index: Split Bit-Vector Sampled • Minimizing memory bandwidth consumption
  • 59. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 59 FM-Index: Evaluation • Experimental setup  20M 200-symbol queries generated by Mason simulation tool  GRCh38 (3 Gbases) human genome reference
  • 60. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 60 FM-Index: Evaluation • KNL roofline
  • 61. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 61 FM-Index: PNM Accelerator • Logic layer in 3D-stacked memory (HMC)
  • 62. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 62 FM-Index: PNM Accelerator • Performance
  • 63. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 63 FM-Index: PNM Accelerator • Power consumption
  • 64. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 64 FM-Index: Architecture Exploration
  • 65. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 65 Data-Centric Computing Commercial PNM Solutions • UPMEM • Samsung FIMDRAM
  • 66. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 66 • Standard DIMM modules • Key features:  Relies on mature 2D DRAM fabrication process  DPUs support a wide variety of computations and data types  Threads in DPUs are independent  Complete software stack (C language) UPMEM: DRAM PIM Accelerator Processing in DRAM Engine https://www.upmem.com • DDR4-2400 R-DIMM modules:  PIM chip: 64MB + 8 DPUs @400-500MHz  PIM DIMM: 16 PIM chips (8GB + 128 DPUs @ 400-500MHz)  Server: up to 20 PIM DIMMS (160GB + 2560 DPUs) PIM chip PIM DIMM
  • 67. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 67 • PIM DIMM: 8GB DRAM DDR4 2400 + 256 DPUs @ 400-500MHz UPMEM PIM System Organization UPMEM PIM White Paper, 2018 DPUs Main RAM Instruction RAM Working RAM (scratchpad) Copying data from Main Mem to DPU MRAM Retrieving results from DPU MRAM to Main Mem NO direct communication DPU – DPU!
  • 68. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 68 • PIM DIMM: 8GB DRAM DDR4 2400 + 256 DPUs @ 400-500MHz  DPU: Multithreaded in-order 32-bit RISC core with NO cache hierarchy  24 hardware contexts (threads), each with 24 32-bit GP registers ❻  Hardware threads share IRAM and WRAM (for operands)  14-stage pipeline ❼ UPMEM PIM DPU Architecture UPMEM PIM White Paper, 2018 D F1 F2 F3 R1 R2 R3 F A1 A2 A3 A4 M1 M2 D F1 F2 F3 R1 R2 R3 F A1 A2 A3 A4 M1 M2 11 cycles Instruction overlapping
  • 69. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 69 • PIM DIMM: 8GB DRAM DDR4 2400 + 256 DPUs @ 400-500MHz  DPU max. frequency: 400 MHz  MRAM-WRAM bandwidth per DPU: 800 MB/s  MRAM-WRAM bandwidth per DIMM: 102 GB/s  Max. aggregated MRAM-WRAM bandwidth (20 DIMMS): ~2 TB/s UPMEM PIM Data Throughput UPMEM PIM White Paper, 2018
  • 70. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 70 • SPMD programming model  A software thread is called tasklet » (1) All tasklets execute the same code on different data pieces » (2) Tasklets can execute different control-flow paths at runtime  Up to 24 tasklets can run in parallel in a single DPU  Tasklets are statically assigned to each DPU  Intra-DPU tasklets can share data in MRAM and WRAM, and can synchronize  Inter-DPU tasklets do not share memory or any direct communication channel • Programs are written in C with library calls (UPMEM SDK) • UPMEM runtime library  Calls to move instructions from MRAM to IRAM  Calls to move data between MRAM and WRAM  Calls to move data between Main Memory and DPU MRAM  Calls for locks, barriers, handshaking, semaphores… UPMEM PIM Programming UPMEM PIM White Paper, 2018
  • 71. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 71 UPMEM PIM Evaluation (I) ACM SIGMETRICS (Fall Edition) 2021
  • 72. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 72 UPMEM PIM Evaluation (II) • PrIM Benchmarks
  • 73. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 73 UPMEM PIM Evaluation (III) • Key takeaways  Takeaway 1 » The UPMEM PIM architecture is fundamentally compute bound » As a result, the most suitable workloads are memory-bound  Takeaway 2 » The most well-suited workloads for the UPMEM PIM architecture use no arithmetic operations or use only simple operations (e.g., bitwise operations and integer addition/subtraction)  Takeaway 3 » The most well-suited workloads for the UPMEM PIM architecture require little or no communication across DRAM Processing Units (inter-DPU communication).  Takeaway 4 » UPMEM PIM systems outperform state-of-the-art CPUs in terms of performance and energy efficiency on most of PrIM benchmarks » UPMEM PIM systems outperform state-of-the-art GPUs on a majority of PrIM benchmarks, and the outlook is even more positive for future PIM systems » UPMEM PIM systems are more energy-efficient than state-of-the-art CPUs and GPUs on workloads that they provide performance improvements over the CPUs and the GPUs
  • 74. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 74 • Function-in-Memory DRAM based on HBM2 (HBM-PIM)  HBM2-based memory with integrated AI processors Samsung FIMDRAM: PIM AI Accelerator Processing in DRAM Engine ISSCC 2021
  • 75. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 75 • Programmable Computing Unit (PCU)  Interface unit to control data flow  Execution unit to perform operations Samsung FIMDRAM: PCU ISSCC 2021 Instruction memory Constants for MAC operations Weight and accumulation
  • 76. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 76 • Mixed design methodology  Full-custom + digital RTL for PCU block Samsung FIMDRAM: Chip Implementation ISSCC 2021
  • 77. Dept. Arquitectura de Computadores, Univ. de Málaga Oscar Plata May 2021 UCM, Madrid 77 • Performance and energy efficiency Samsung FIMDRAM: Experimental Results ISSCC 2021 HBM2 FIMDRAM Type of DRAM HBM2 HBM2 Process 20 nm 20 nm Memory density 8GB / cube 6GB / cube Data rate 2.4Gbps 2.4Gbps Bandwidth 307GB/s per cube 307GB/s per cube # channels 8 per cube 8 per cube # processing units No 128 per cube Processing operation speed - 300MHz Peak throughput - 1.2 TFLOPS per cube Operation precision - FP16
  • 78. 78 Processing in Memory Oscar Plata oplata@uma.es Dept. Arquitectura de Computadores Universidad de Málaga HPACuma Research Group ALDEBARAN Research Group HPACuma Research Group UCM, May 27, 2021