SlideShare a Scribd company logo
1 of 35
Download to read offline
An NSA Big Graph experiment
Paul Burkhardt, Chris Waring
U.S. National Security Agency
Research Directorate - R6
Technical Report NSA-RD-2013-056002v1
May 20, 2013
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Graphs are everywhere!
A graph is a collection of binary relationships, i.e. networks of
pairwise interactions including social networks, digital networks. . .
road networks
utility grids
internet
protein interactomes
M. tuberculosis
Vashisht, PLoS ONE 7(7), 2012
Brain network of C. elegans
Watts, Strogatz, Nature 393(6684), 1998
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Scale of the first graph
Nearly 300 years ago the first graph problem consisted of 4 vertices
and 7 edges—Seven Bridges of K¨onigsberg problem.
Crossing the River Pregel
Is it possible to cross each of the Seven
Bridges of K¨onigsberg exactly once?
Not too hard to fit in “memory”.
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Scale of real-world graphs
Graph scale in current Computer Science literature
On order of billions of edges, tens of gigabytes.
AS by Skitter (AS-Skitter) — internet topology in 2005 (n = router, m = traceroute)
LiveJournal (LJ) — social network (n = members, m = friendship)
U.S. Road Network (USRD) — road network (n = intersections, m = roads)
Billion Triple Challenge (BTC) — RDF dataset 2009 (n = object, m = relationship)
WWW of UK (WebUK) — Yahoo Web spam dataset (n = pages, m = hyperlinks)
Twitter graph (Twitter) — Twitter network1
(n = users, m = tweets)
Yahoo! Web Graph (YahooWeb) — WWW pages in 2002 (n = pages, m = hyperlinks)
Popular graph datasets in current literature
n (vertices in millions) m (edges in millions) size
AS-Skitter 1.7 11 142 MB
LJ 4.8 69 337.2 MB
USRD 24 58 586.7 MB
BTC 165 773 5.3 GB
WebUK 106 1877 8.6 GB
Twitter 42 1470 24 GB
YahooWeb 1413 6636 120 GB
1
http://an.kaist.ac.kr/traces/WWW2010.html
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Big Data begets Big Graphs
Increasing volume,velocity,variety of Big Data are significant
challenges to scalable algorithms.
Big Data Graphs
How will graph applications adapt to Big Data at petabyte scale?
Ability to store and process Big Graphs impacts typical data
structures.
Orders of magnitude
kilobyte (KB) = 210
terabyte (TB) = 240
megabyte (MB) = 220
petabyte (PB) = 250
gigabyte (GB) = 230
exabyte (EB) = 260
Undirected graph data structure space complexity
Θ(bytes) ×
8
><
>:
Θ(n2
) adjacency matrix
Θ(n + 4m) adjacency list
Θ(4m) edge list
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Social scale. . .
1 billion vertices, 100 billion edges
111 PB adjacency matrix
2.92 TB adjacency list
2.92 TB edge list
Twitter graph from Gephi dataset
(http://www.gephi.org)
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Web scale. . .
50 billion vertices, 1 trillion edges
271 EB adjacency matrix
29.5 TB adjacency list
29.1 TB edge list
Internet graph from the Opte Project
(http://www.opte.org/maps)
Web graph from the SNAP database
(http://snap.stanford.edu/data)
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Brain scale. . .
100 billion vertices, 100 trillion edges
2.08 mNA · bytes2 (molar bytes) adjacency matrix
2.84 PB adjacency list
2.84 PB edge list
Human connectome.
Gerhard et al., Frontiers in Neuroinformatics 5(3), 2011
2
NA = 6.022 × 1023
mol−1
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Benchmarking scalability on Big Graphs
Big Graphs challenge our conventional thinking on both algorithms
and computer architecture.
New Graph500.org benchmark provides a foundation for
conducting experiments on graph datasets.
Graph500 benchmark
Problem classes from 17 GB to 1 PB — many times larger than
common datasets in literature.
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Graph algorithms are challenging
Difficult to parallelize. . .
irregular data access increases latency
skewed data distribution creates bottlenecks
giant component
high degree vertices
Increased size imposes greater. . .
latency
resource contention (i.e. hot-spotting)
Algorithm complexity really matters!
Run-time of O(n2) on a trillion node graph is not
practical!
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Problem: How do we store and process Big Graphs?
Conventional approach is to store and compute in-memory.
SHARED-MEMORY
Parallel Random Access Machine (PRAM)
data in globally-shared memory
implicit communication by updating memory
fast-random access
DISTRIBUTED-MEMORY
Bulk Synchronous Parallel (BSP)
data distributed to local, private memory
explicit communication by sending messages
easier to scale by adding more machines
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Memory is fast but. . .
Algorithms must exploit computer memory hierarchy.
designed for spatial and temporal locality
registers, L1,L2,L3 cache, TLB, pages, disk. . .
great for unit-stride access common in many scientific codes,
e.g. linear algebra
But common graph algorithm implementations have. . .
lots of random access to memory causing. . .
many cache and TLB misses
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Poor locality increases latency. . .
QUESTION: What is the memory throughput if 90% TLB hit
and 0.01% page fault on miss?
Effective memory access time
Tn = pnln + (1 − pn)Tn−1
Example
TLB = 20ns, RAM = 100ns, DISK = 10ms (10 × 106ns)
T2 = p2l2 + (1 − p2)(p1l1 + (1 − p1)T0)
= .9(TLB+RAM) + .1(.9999(TLB+2RAM) + .0001(DISK))
= .9(120ns) + .1(.9999(220ns) + 1000ns) = 230ns
ANSWER:
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Poor locality increases latency. . .
QUESTION: What is the memory throughput if 90% TLB hit
and 0.01% page fault on miss?
Effective memory access time
Tn = pnln + (1 − pn)Tn−1
Example
TLB = 20ns, RAM = 100ns, DISK = 10ms (10 × 106ns)
T2 = p2l2 + (1 − p2)(p1l1 + (1 − p1)T0)
= .9(TLB+RAM) + .1(.9999(TLB+2RAM) + .0001(DISK))
= .9(120ns) + .1(.9999(220ns) + 1000ns) = 230ns
ANSWER: 33 MB/s
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
If it fits. . .
Graph problems that fit in memory can leverage excellent advances
in architecture and libraries. . .
Cray XMT2 designed for latency-hiding
SGI UV2 designed for large, cache-coherent shared-memory
body of literature and libraries
Parallel Boost Graph Library (PBGL) — Indiana University
Multithreaded Graph Library (MTGL) — Sandia National Labs
GraphCT/STINGER — Georgia Tech
GraphLab — Carnegie Mellon University
Giraph — Apache Software Foundation
But some graphs do not fit in memory. . .
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
We can add more memory but. . .
Memory capacity is limited by. . .
number of CPU pins, memory controller channels, DIMMs per
channel
memory bus width — parallel traces of same length, one line
per bit
Globally-shared memory limited by. . .
CPU address space
cache-coherency
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Larger systems, greater latency. . .
Increasing memory can increase latency.
traverse more memory addresses
larger system with greater physical distance between machines
Light is only so fast (but still faster than neutrinos!)
It takes approximately 1 nanosecond for light to travel 0.30 meters
Latency causes significant inefficiency in new CPU architectures.
IBM PowerPC A2
A single 1.6 GHz PowerPC A2 can perform 204.8
operations per nanosecond!
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Easier to increase capacity using disks
Current Intel Xeon E5 architectures:
384 GB max. per CPU (4 channels x 3 DIMMS x 32 GB)
64 TB max. globally-shared memory (46-bit address space)
3881 dual Xeon E5 motherboards to store Brain Graph — 98
racks
Disk capacity not unlimited but higher than memory.
Largest capacity HDD on market
4 TB HDD — need 728 to store Brain Graph which can fit in 5
racks (4 drives per chassis)
Disk is not enough—applications will still require memory for
processing
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Top supercomputer installations
Largest supercomputer installations do not have enough memory
to process the Brain Graph (3 PB)!
Titan Cray XK7 at ORNL — #1 Top500 in 2012
0.5 million cores
710 TB memory
8.2 Megawatts3
4300 sq.ft. (NBA basketball court is 4700 sq.ft.)
Sequoia IBM Blue Gene/Q at LLNL — #1 Graph500 in 2012
1.5 million cores
1 PB memory
7.9 Megawatts3
3000 sq.ft.
Electrical power cost
At 10 cents per kilowatt-hour — $7 million per year to
keep the lights on!
3
3
Power specs from http://www.top500.org/list/2012/11
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Cloud architectures can cope with graphs at Big Data
scales
Cloud technologies designed for Big Data problems.
scalable to massive sizes
fault-tolerant; restarting algorithms on Big Graphs is expensive
simpler programming model; MapReduce
Big Graphs from real data are not static. . .
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Distributed key, value repositories
Much better for storing a distributed, dynamic graph than a
distributed filesystem like HDFS.
create index on edges for fast random-access and . . .
sort edges for efficient block sequential access
updates can be fast but . . .
not transactional; eventually consistent
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
NSA BigTable — Apache Accumulo
Google BigTable paper inspired a small group of NSA researchers
to develop an implementation with cell-level security.
Apache Accumulo
Open-sourced in 2011 under the Apache Software Foundation.
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Using Apache Accumulo for Big Graphs
Accumulo records are flexible and can be used to store graphs with
typed edges and weights.
store graph as distributed edge list in an Edge Table
edges are natural key, value pairs, i.e. vertex pairs
updates to edges happen in memory and are immediately
available to queries and . . .
updates are written to write-ahead logs for fault-tolerance
Apache Accumulo key, value record
KEY
ROW ID
COLUMN
TIMESTAMP
VALUE
FAMILY QUALIFIER VISIBILITY
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
But for bulk-processing, use MapReduce
Big Graph processing suitable for MapReduce. . .
large, bulk writes to HDFS are faster (no need to sort)
temporary scratch space (local writes)
greater control over parallelization of algorithms
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Quick overview of MapReduce
MapReduce processes data as key, value pairs in three steps:
1 MAP
map tasks independently process blocks of data in parallel and
output key, value pairs
map tasks execute user-defined “map” function
2 SHUFFLE
sorts key, value pairs and distributes to reduce tasks
ensures value set for a key is processed by one reduce task
framework executes a user-defined “partitioner” function
3 REDUCE
reduce tasks process assigned (key, value set) in parallel
reduce tasks execute user-defined “reduce” function
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
MapReduce is not great for iterative algorithms
Iterative algorithms are implemented as a sequence of
MapReduce jobs, i.e. rounds.
But iteration in MapReduce is expensive. . .
temporary data written to disk
sort/shuffle may be unnecessary for each round
framework overhead for scheduling tasks
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Good principles for MapReduce algorithms
Be prepared to Think in MapReduce. . .
avoid iteration and minimize number of rounds
limit child JVM memory
number of concurrent tasks is limited by per machine RAM
set IO buffers carefully to avoid spills (requires memory!)
pick a good partitioner
write raw comparators
leverage compound keys
minimizes hot-spots by distributing on key
secondary-sort on compound keys is almost free
Round-Memory tradeoff
Constant, O(1), in memory and rounds.
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Best of both worlds – MapReduce and Accumulo
Store the Big Graph in an Accumulo Edge Table . . .
great for updating a big, typed, multi-graph
more timely (lower latency) access to graph
Then extract a sub-graph from the Edge Table and use
MapReduce for graph analysis.
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
How well does this really work?
Prove on the industry benchmark Graph500.org!
Breadth-First Search (BFS) on an undirected R-MAT Graph
count Traversed Edges per Second (TEPS)
n = 2scale, m = 16 × n
0
200
400
600
800
1000
1200
26 28 30 32 34 36 38 40 42 44
terabytes
scale
Problem Class
Graph500 Problem Sizes
Class Scale Storage
Toy 26 17 GB
Mini 29 140 GB
Small 32 1 TB
Medium 36 17 TB
Large 39 140 TB
Huge 42 1.1 PB
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Walking a ridiculously Big Graph. . .
Everything is harder at scale!
performance is dominated by ability to acquire adjacencies
requires specialized MapReduce partitioner to load-balance
queries from MapReduce to Accumulo Edge Table
Space-efficient Breadth-First Search is critical!
create vertex, distance records
sliding window over distance records to minimize input at
each k iteration
reduce tasks get adjacencies from Edge Table but only if. . .
all distance values for a vertex, v, are equal to k
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Graph500 Experiment
Graph500 Huge class — scale 42
242 (4.40 trillion) vertices
246 (70.4 trillion) edges
1 Petabyte
Cluster specs
1200 nodes
2 Intel quad-core per node
48 GB RAM per node
7200 RPM sata drives
Huge problem is 19.5x
more than cluster
memory
0
10
20
30
40
50
60
70
80
0 20 40 60 80 100 120 140
17
140
1126
36 39 42
TraversedEdges(trillion)
Terabytes
Time (h)
Problem Class
Problem Class
Memory = 57.6 TB
Graph500 Scalability Benchmark
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Graph500 Experiment
Graph500 Huge class — scale 42
242 (4.40 trillion) vertices
246 (70.4 trillion) edges
1 Petabyte
Cluster specs
1200 nodes
2 Intel quad-core per node
48 GB RAM per node
7200 RPM sata drives
Huge problem is 19.5x
more than cluster
memory
linear performance from
1 trillion to 70 trillion
edges. . . 0
10
20
30
40
50
60
70
80
0 20 40 60 80 100 120 140
17
140
1126
36 39 42
TraversedEdges(trillion)
Terabytes
Time (h)
Problem Class
Problem Class
Memory = 57.6 TB
Graph500 Scalability Benchmark
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Graph500 Experiment
Graph500 Huge class — scale 42
242 (4.40 trillion) vertices
246 (70.4 trillion) edges
1 Petabyte
Cluster specs
1200 nodes
2 Intel quad-core per node
48 GB RAM per node
7200 RPM sata drives
Huge problem is 19.5x
more than cluster
memory
linear performance from
1 trillion to 70 trillion
edges. . .
despite multiple
hardware failures!
0
10
20
30
40
50
60
70
80
0 20 40 60 80 100 120 140
17
140
1126
36 39 42
TraversedEdges(trillion)
Terabytes
Time (h)
Problem Class
Problem Class
Memory = 57.6 TB
Graph500 Scalability Benchmark
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Graph500 Experiment
Graph500 Huge class — scale 42
242 (4.40 trillion) vertices
246 (70.4 trillion) edges
1 Petabyte
Huge problem is 19.5x
more than cluster
memory
linear performance from
1 trillion to 70 trillion
edges. . .
despite multiple
hardware failures!
0
10
20
30
40
50
60
70
80
0 20 40 60 80 100 120 140
17
140
1126
36 39 42
TraversedEdges(trillion)
Terabytes
Time (h)
Problem Class
Problem Class
Memory = 57.6 TB
Graph500 Scalability Benchmark
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-2013-056001v1
Future work
What are the tradeoffs between disk- and memory-based Big
Graph solutions?
power–space–cooling efficiency
software development and maintenance
Can a hybrid approach be viable?
memory-based processing for subgraphs or incremental
updates
disk-based processing for global analysis
Advances in memory and disk technologies will blur the lines. Will
solid-state disks be another level of memory?
Paul Burkhardt, Chris Waring An NSA Big Graph experiment

More Related Content

What's hot

High Performance Computing - Challenges on the Road to Exascale Computing
High Performance Computing - Challenges on the Road to Exascale ComputingHigh Performance Computing - Challenges on the Road to Exascale Computing
High Performance Computing - Challenges on the Road to Exascale ComputingHeiko Joerg Schick
 
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Heiko Joerg Schick
 
Data Capacitor II at Indiana University
Data Capacitor II at Indiana UniversityData Capacitor II at Indiana University
Data Capacitor II at Indiana Universityinside-BigData.com
 
On the diversity and availability of temporal information in linked open data
On the diversity and availability of temporal information in linked open dataOn the diversity and availability of temporal information in linked open data
On the diversity and availability of temporal information in linked open dataAnisa Rula
 
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
Report on GPGPU at FCA  (Lyon, France, 11-15 October, 2010)Report on GPGPU at FCA  (Lyon, France, 11-15 October, 2010)
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)PhtRaveller
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech ProjectsJody Garnett
 
AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...
AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...
AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...Amazon Web Services
 
Collaboration, Big Data and the search for the Higgs Boson
Collaboration, Big Data and the  search for the Higgs BosonCollaboration, Big Data and the  search for the Higgs Boson
Collaboration, Big Data and the search for the Higgs BosonSuma Pria Tunggal
 
[Japanese]Obake-GAN (Perturbative GAN): GAN with Perturbation Layers
[Japanese]Obake-GAN (Perturbative GAN): GAN with Perturbation Layers[Japanese]Obake-GAN (Perturbative GAN): GAN with Perturbation Layers
[Japanese]Obake-GAN (Perturbative GAN): GAN with Perturbation Layersyumakishi
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGaiKohei KaiGai
 
SkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemSkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemJayjeetChakraborty
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechGeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechRob Emanuele
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationRob Emanuele
 
BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataZhong Wang
 
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic... NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...Igor Sfiligoi
 
Big Data for Big Discoveries
Big Data for Big DiscoveriesBig Data for Big Discoveries
Big Data for Big DiscoveriesGovnet Events
 
"Building and running the cloud GPU vacuum cleaner"
"Building and running the cloud GPU vacuum cleaner""Building and running the cloud GPU vacuum cleaner"
"Building and running the cloud GPU vacuum cleaner"Frank Wuerthwein
 

What's hot (20)

Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
High Performance Computing - Challenges on the Road to Exascale Computing
High Performance Computing - Challenges on the Road to Exascale ComputingHigh Performance Computing - Challenges on the Road to Exascale Computing
High Performance Computing - Challenges on the Road to Exascale Computing
 
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
 
Data Capacitor II at Indiana University
Data Capacitor II at Indiana UniversityData Capacitor II at Indiana University
Data Capacitor II at Indiana University
 
On the diversity and availability of temporal information in linked open data
On the diversity and availability of temporal information in linked open dataOn the diversity and availability of temporal information in linked open data
On the diversity and availability of temporal information in linked open data
 
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
Report on GPGPU at FCA  (Lyon, France, 11-15 October, 2010)Report on GPGPU at FCA  (Lyon, France, 11-15 October, 2010)
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech Projects
 
AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...
AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...
AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...
 
Collaboration, Big Data and the search for the Higgs Boson
Collaboration, Big Data and the  search for the Higgs BosonCollaboration, Big Data and the  search for the Higgs Boson
Collaboration, Big Data and the search for the Higgs Boson
 
[Japanese]Obake-GAN (Perturbative GAN): GAN with Perturbation Layers
[Japanese]Obake-GAN (Perturbative GAN): GAN with Perturbation Layers[Japanese]Obake-GAN (Perturbative GAN): GAN with Perturbation Layers
[Japanese]Obake-GAN (Perturbative GAN): GAN with Perturbation Layers
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai
 
SkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemSkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage System
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechGeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
 
BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing data
 
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic... NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 
Big Data for Big Discoveries
Big Data for Big DiscoveriesBig Data for Big Discoveries
Big Data for Big Discoveries
 
"Building and running the cloud GPU vacuum cleaner"
"Building and running the cloud GPU vacuum cleaner""Building and running the cloud GPU vacuum cleaner"
"Building and running the cloud GPU vacuum cleaner"
 

Similar to An NSA Big Graph experiment

High Performance Cyberinfrastructure is Needed to Enable Data-Intensive Scien...
High Performance Cyberinfrastructure is Needed to Enable Data-Intensive Scien...High Performance Cyberinfrastructure is Needed to Enable Data-Intensive Scien...
High Performance Cyberinfrastructure is Needed to Enable Data-Intensive Scien...Larry Smarr
 
Agents In An Exponential World Foster
Agents In An Exponential World FosterAgents In An Exponential World Foster
Agents In An Exponential World FosterIan Foster
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Matej Misik
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and DataGuy Coates
 
Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Larry Smarr
 
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...Larry Smarr
 
The next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engineThe next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engineG. Bruce Berriman
 
Auscert Finding needles in haystacks (the size of countries)
Auscert Finding needles in haystacks (the size of countries)Auscert Finding needles in haystacks (the size of countries)
Auscert Finding needles in haystacks (the size of countries)packetloop
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor DesignSri Prasanna
 
Opportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesOpportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesIan Foster
 
Making Sense of Information Through Planetary Scale Computing
Making Sense of Information Through Planetary Scale ComputingMaking Sense of Information Through Planetary Scale Computing
Making Sense of Information Through Planetary Scale ComputingLarry Smarr
 
Riding the Light: How Dedicated Optical Circuits are Enabling New Science
Riding the Light: How Dedicated Optical Circuits are Enabling New ScienceRiding the Light: How Dedicated Optical Circuits are Enabling New Science
Riding the Light: How Dedicated Optical Circuits are Enabling New ScienceLarry Smarr
 
Big Data presentation Tensing
Big Data presentation TensingBig Data presentation Tensing
Big Data presentation Tensingtensing-gis
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowskaguest43b4df3
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World LazowskaWCET
 
Machine Learning and Artificial Intelligence
Machine Learning and Artificial IntelligenceMachine Learning and Artificial Intelligence
Machine Learning and Artificial IntelligenceMarketingArrowECS_CZ
 

Similar to An NSA Big Graph experiment (20)

High Performance Cyberinfrastructure is Needed to Enable Data-Intensive Scien...
High Performance Cyberinfrastructure is Needed to Enable Data-Intensive Scien...High Performance Cyberinfrastructure is Needed to Enable Data-Intensive Scien...
High Performance Cyberinfrastructure is Needed to Enable Data-Intensive Scien...
 
Agents In An Exponential World Foster
Agents In An Exponential World FosterAgents In An Exponential World Foster
Agents In An Exponential World Foster
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and Data
 
Big Data and OSS at IBM
Big Data and OSS at IBMBig Data and OSS at IBM
Big Data and OSS at IBM
 
Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Panel: NRP Science Impacts​
Panel: NRP Science Impacts​
 
Addressing dm-cloud
Addressing dm-cloudAddressing dm-cloud
Addressing dm-cloud
 
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
 
The next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engineThe next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engine
 
Auscert Finding needles in haystacks (the size of countries)
Auscert Finding needles in haystacks (the size of countries)Auscert Finding needles in haystacks (the size of countries)
Auscert Finding needles in haystacks (the size of countries)
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
Opportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesOpportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architectures
 
Making Sense of Information Through Planetary Scale Computing
Making Sense of Information Through Planetary Scale ComputingMaking Sense of Information Through Planetary Scale Computing
Making Sense of Information Through Planetary Scale Computing
 
Riding the Light: How Dedicated Optical Circuits are Enabling New Science
Riding the Light: How Dedicated Optical Circuits are Enabling New ScienceRiding the Light: How Dedicated Optical Circuits are Enabling New Science
Riding the Light: How Dedicated Optical Circuits are Enabling New Science
 
Big Data presentation Tensing
Big Data presentation TensingBig Data presentation Tensing
Big Data presentation Tensing
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
Machine Learning and Artificial Intelligence
Machine Learning and Artificial IntelligenceMachine Learning and Artificial Intelligence
Machine Learning and Artificial Intelligence
 
Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013
 

More from Trieu Nguyen

Building Your Customer Data Platform with LEO CDP in Travel Industry.pdf
Building Your Customer Data Platform with LEO CDP in Travel Industry.pdfBuilding Your Customer Data Platform with LEO CDP in Travel Industry.pdf
Building Your Customer Data Platform with LEO CDP in Travel Industry.pdfTrieu Nguyen
 
Building Your Customer Data Platform with LEO CDP - Spa and Hotel Business
Building Your Customer Data Platform with LEO CDP - Spa and Hotel BusinessBuilding Your Customer Data Platform with LEO CDP - Spa and Hotel Business
Building Your Customer Data Platform with LEO CDP - Spa and Hotel BusinessTrieu Nguyen
 
Building Your Customer Data Platform with LEO CDP
Building Your Customer Data Platform with LEO CDP Building Your Customer Data Platform with LEO CDP
Building Your Customer Data Platform with LEO CDP Trieu Nguyen
 
How to track and improve Customer Experience with LEO CDP
How to track and improve Customer Experience with LEO CDPHow to track and improve Customer Experience with LEO CDP
How to track and improve Customer Experience with LEO CDPTrieu Nguyen
 
[Notes] Customer 360 Analytics with LEO CDP
[Notes] Customer 360 Analytics with LEO CDP[Notes] Customer 360 Analytics with LEO CDP
[Notes] Customer 360 Analytics with LEO CDPTrieu Nguyen
 
Leo CDP - Pitch Deck
Leo CDP - Pitch DeckLeo CDP - Pitch Deck
Leo CDP - Pitch DeckTrieu Nguyen
 
LEO CDP - What's new in 2022
LEO CDP  - What's new in 2022LEO CDP  - What's new in 2022
LEO CDP - What's new in 2022Trieu Nguyen
 
Lộ trình triển khai LEO CDP cho ngành bất động sản
Lộ trình triển khai LEO CDP cho ngành bất động sảnLộ trình triển khai LEO CDP cho ngành bất động sản
Lộ trình triển khai LEO CDP cho ngành bất động sảnTrieu Nguyen
 
Why is LEO CDP important for digital business ?
Why is LEO CDP important for digital business ?Why is LEO CDP important for digital business ?
Why is LEO CDP important for digital business ?Trieu Nguyen
 
From Dataism to Customer Data Platform
From Dataism to Customer Data PlatformFrom Dataism to Customer Data Platform
From Dataism to Customer Data PlatformTrieu Nguyen
 
Data collection, processing & organization with USPA framework
Data collection, processing & organization with USPA frameworkData collection, processing & organization with USPA framework
Data collection, processing & organization with USPA frameworkTrieu Nguyen
 
Part 1: Introduction to digital marketing technology
Part 1: Introduction to digital marketing technologyPart 1: Introduction to digital marketing technology
Part 1: Introduction to digital marketing technologyTrieu Nguyen
 
Why is Customer Data Platform (CDP) ?
Why is Customer Data Platform (CDP) ?Why is Customer Data Platform (CDP) ?
Why is Customer Data Platform (CDP) ?Trieu Nguyen
 
How to build a Personalized News Recommendation Platform
How to build a Personalized News Recommendation PlatformHow to build a Personalized News Recommendation Platform
How to build a Personalized News Recommendation PlatformTrieu Nguyen
 
How to grow your business in the age of digital marketing 4.0
How to grow your business  in the age of digital marketing 4.0How to grow your business  in the age of digital marketing 4.0
How to grow your business in the age of digital marketing 4.0Trieu Nguyen
 
Video Ecosystem and some ideas about video big data
Video Ecosystem and some ideas about video big dataVideo Ecosystem and some ideas about video big data
Video Ecosystem and some ideas about video big dataTrieu Nguyen
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Trieu Nguyen
 
Open OTT - Video Content Platform
Open OTT - Video Content PlatformOpen OTT - Video Content Platform
Open OTT - Video Content PlatformTrieu Nguyen
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisTrieu Nguyen
 
Introduction to Recommendation Systems (Vietnam Web Submit)
Introduction to Recommendation Systems (Vietnam Web Submit)Introduction to Recommendation Systems (Vietnam Web Submit)
Introduction to Recommendation Systems (Vietnam Web Submit)Trieu Nguyen
 

More from Trieu Nguyen (20)

Building Your Customer Data Platform with LEO CDP in Travel Industry.pdf
Building Your Customer Data Platform with LEO CDP in Travel Industry.pdfBuilding Your Customer Data Platform with LEO CDP in Travel Industry.pdf
Building Your Customer Data Platform with LEO CDP in Travel Industry.pdf
 
Building Your Customer Data Platform with LEO CDP - Spa and Hotel Business
Building Your Customer Data Platform with LEO CDP - Spa and Hotel BusinessBuilding Your Customer Data Platform with LEO CDP - Spa and Hotel Business
Building Your Customer Data Platform with LEO CDP - Spa and Hotel Business
 
Building Your Customer Data Platform with LEO CDP
Building Your Customer Data Platform with LEO CDP Building Your Customer Data Platform with LEO CDP
Building Your Customer Data Platform with LEO CDP
 
How to track and improve Customer Experience with LEO CDP
How to track and improve Customer Experience with LEO CDPHow to track and improve Customer Experience with LEO CDP
How to track and improve Customer Experience with LEO CDP
 
[Notes] Customer 360 Analytics with LEO CDP
[Notes] Customer 360 Analytics with LEO CDP[Notes] Customer 360 Analytics with LEO CDP
[Notes] Customer 360 Analytics with LEO CDP
 
Leo CDP - Pitch Deck
Leo CDP - Pitch DeckLeo CDP - Pitch Deck
Leo CDP - Pitch Deck
 
LEO CDP - What's new in 2022
LEO CDP  - What's new in 2022LEO CDP  - What's new in 2022
LEO CDP - What's new in 2022
 
Lộ trình triển khai LEO CDP cho ngành bất động sản
Lộ trình triển khai LEO CDP cho ngành bất động sảnLộ trình triển khai LEO CDP cho ngành bất động sản
Lộ trình triển khai LEO CDP cho ngành bất động sản
 
Why is LEO CDP important for digital business ?
Why is LEO CDP important for digital business ?Why is LEO CDP important for digital business ?
Why is LEO CDP important for digital business ?
 
From Dataism to Customer Data Platform
From Dataism to Customer Data PlatformFrom Dataism to Customer Data Platform
From Dataism to Customer Data Platform
 
Data collection, processing & organization with USPA framework
Data collection, processing & organization with USPA frameworkData collection, processing & organization with USPA framework
Data collection, processing & organization with USPA framework
 
Part 1: Introduction to digital marketing technology
Part 1: Introduction to digital marketing technologyPart 1: Introduction to digital marketing technology
Part 1: Introduction to digital marketing technology
 
Why is Customer Data Platform (CDP) ?
Why is Customer Data Platform (CDP) ?Why is Customer Data Platform (CDP) ?
Why is Customer Data Platform (CDP) ?
 
How to build a Personalized News Recommendation Platform
How to build a Personalized News Recommendation PlatformHow to build a Personalized News Recommendation Platform
How to build a Personalized News Recommendation Platform
 
How to grow your business in the age of digital marketing 4.0
How to grow your business  in the age of digital marketing 4.0How to grow your business  in the age of digital marketing 4.0
How to grow your business in the age of digital marketing 4.0
 
Video Ecosystem and some ideas about video big data
Video Ecosystem and some ideas about video big dataVideo Ecosystem and some ideas about video big data
Video Ecosystem and some ideas about video big data
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)
 
Open OTT - Video Content Platform
Open OTT - Video Content PlatformOpen OTT - Video Content Platform
Open OTT - Video Content Platform
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
 
Introduction to Recommendation Systems (Vietnam Web Submit)
Introduction to Recommendation Systems (Vietnam Web Submit)Introduction to Recommendation Systems (Vietnam Web Submit)
Introduction to Recommendation Systems (Vietnam Web Submit)
 

Recently uploaded

Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 

Recently uploaded (20)

Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 

An NSA Big Graph experiment

  • 1. An NSA Big Graph experiment Paul Burkhardt, Chris Waring U.S. National Security Agency Research Directorate - R6 Technical Report NSA-RD-2013-056002v1 May 20, 2013 Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 2. NSA-RD-2013-056001v1 Graphs are everywhere! A graph is a collection of binary relationships, i.e. networks of pairwise interactions including social networks, digital networks. . . road networks utility grids internet protein interactomes M. tuberculosis Vashisht, PLoS ONE 7(7), 2012 Brain network of C. elegans Watts, Strogatz, Nature 393(6684), 1998 Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 3. NSA-RD-2013-056001v1 Scale of the first graph Nearly 300 years ago the first graph problem consisted of 4 vertices and 7 edges—Seven Bridges of K¨onigsberg problem. Crossing the River Pregel Is it possible to cross each of the Seven Bridges of K¨onigsberg exactly once? Not too hard to fit in “memory”. Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 4. NSA-RD-2013-056001v1 Scale of real-world graphs Graph scale in current Computer Science literature On order of billions of edges, tens of gigabytes. AS by Skitter (AS-Skitter) — internet topology in 2005 (n = router, m = traceroute) LiveJournal (LJ) — social network (n = members, m = friendship) U.S. Road Network (USRD) — road network (n = intersections, m = roads) Billion Triple Challenge (BTC) — RDF dataset 2009 (n = object, m = relationship) WWW of UK (WebUK) — Yahoo Web spam dataset (n = pages, m = hyperlinks) Twitter graph (Twitter) — Twitter network1 (n = users, m = tweets) Yahoo! Web Graph (YahooWeb) — WWW pages in 2002 (n = pages, m = hyperlinks) Popular graph datasets in current literature n (vertices in millions) m (edges in millions) size AS-Skitter 1.7 11 142 MB LJ 4.8 69 337.2 MB USRD 24 58 586.7 MB BTC 165 773 5.3 GB WebUK 106 1877 8.6 GB Twitter 42 1470 24 GB YahooWeb 1413 6636 120 GB 1 http://an.kaist.ac.kr/traces/WWW2010.html Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 5. NSA-RD-2013-056001v1 Big Data begets Big Graphs Increasing volume,velocity,variety of Big Data are significant challenges to scalable algorithms. Big Data Graphs How will graph applications adapt to Big Data at petabyte scale? Ability to store and process Big Graphs impacts typical data structures. Orders of magnitude kilobyte (KB) = 210 terabyte (TB) = 240 megabyte (MB) = 220 petabyte (PB) = 250 gigabyte (GB) = 230 exabyte (EB) = 260 Undirected graph data structure space complexity Θ(bytes) × 8 >< >: Θ(n2 ) adjacency matrix Θ(n + 4m) adjacency list Θ(4m) edge list Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 6. NSA-RD-2013-056001v1 Social scale. . . 1 billion vertices, 100 billion edges 111 PB adjacency matrix 2.92 TB adjacency list 2.92 TB edge list Twitter graph from Gephi dataset (http://www.gephi.org) Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 7. NSA-RD-2013-056001v1 Web scale. . . 50 billion vertices, 1 trillion edges 271 EB adjacency matrix 29.5 TB adjacency list 29.1 TB edge list Internet graph from the Opte Project (http://www.opte.org/maps) Web graph from the SNAP database (http://snap.stanford.edu/data) Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 8. NSA-RD-2013-056001v1 Brain scale. . . 100 billion vertices, 100 trillion edges 2.08 mNA · bytes2 (molar bytes) adjacency matrix 2.84 PB adjacency list 2.84 PB edge list Human connectome. Gerhard et al., Frontiers in Neuroinformatics 5(3), 2011 2 NA = 6.022 × 1023 mol−1 Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 9. NSA-RD-2013-056001v1 Benchmarking scalability on Big Graphs Big Graphs challenge our conventional thinking on both algorithms and computer architecture. New Graph500.org benchmark provides a foundation for conducting experiments on graph datasets. Graph500 benchmark Problem classes from 17 GB to 1 PB — many times larger than common datasets in literature. Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 10. NSA-RD-2013-056001v1 Graph algorithms are challenging Difficult to parallelize. . . irregular data access increases latency skewed data distribution creates bottlenecks giant component high degree vertices Increased size imposes greater. . . latency resource contention (i.e. hot-spotting) Algorithm complexity really matters! Run-time of O(n2) on a trillion node graph is not practical! Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 11. NSA-RD-2013-056001v1 Problem: How do we store and process Big Graphs? Conventional approach is to store and compute in-memory. SHARED-MEMORY Parallel Random Access Machine (PRAM) data in globally-shared memory implicit communication by updating memory fast-random access DISTRIBUTED-MEMORY Bulk Synchronous Parallel (BSP) data distributed to local, private memory explicit communication by sending messages easier to scale by adding more machines Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 12. NSA-RD-2013-056001v1 Memory is fast but. . . Algorithms must exploit computer memory hierarchy. designed for spatial and temporal locality registers, L1,L2,L3 cache, TLB, pages, disk. . . great for unit-stride access common in many scientific codes, e.g. linear algebra But common graph algorithm implementations have. . . lots of random access to memory causing. . . many cache and TLB misses Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 13. NSA-RD-2013-056001v1 Poor locality increases latency. . . QUESTION: What is the memory throughput if 90% TLB hit and 0.01% page fault on miss? Effective memory access time Tn = pnln + (1 − pn)Tn−1 Example TLB = 20ns, RAM = 100ns, DISK = 10ms (10 × 106ns) T2 = p2l2 + (1 − p2)(p1l1 + (1 − p1)T0) = .9(TLB+RAM) + .1(.9999(TLB+2RAM) + .0001(DISK)) = .9(120ns) + .1(.9999(220ns) + 1000ns) = 230ns ANSWER: Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 14. NSA-RD-2013-056001v1 Poor locality increases latency. . . QUESTION: What is the memory throughput if 90% TLB hit and 0.01% page fault on miss? Effective memory access time Tn = pnln + (1 − pn)Tn−1 Example TLB = 20ns, RAM = 100ns, DISK = 10ms (10 × 106ns) T2 = p2l2 + (1 − p2)(p1l1 + (1 − p1)T0) = .9(TLB+RAM) + .1(.9999(TLB+2RAM) + .0001(DISK)) = .9(120ns) + .1(.9999(220ns) + 1000ns) = 230ns ANSWER: 33 MB/s Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 15. NSA-RD-2013-056001v1 If it fits. . . Graph problems that fit in memory can leverage excellent advances in architecture and libraries. . . Cray XMT2 designed for latency-hiding SGI UV2 designed for large, cache-coherent shared-memory body of literature and libraries Parallel Boost Graph Library (PBGL) — Indiana University Multithreaded Graph Library (MTGL) — Sandia National Labs GraphCT/STINGER — Georgia Tech GraphLab — Carnegie Mellon University Giraph — Apache Software Foundation But some graphs do not fit in memory. . . Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 16. NSA-RD-2013-056001v1 We can add more memory but. . . Memory capacity is limited by. . . number of CPU pins, memory controller channels, DIMMs per channel memory bus width — parallel traces of same length, one line per bit Globally-shared memory limited by. . . CPU address space cache-coherency Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 17. NSA-RD-2013-056001v1 Larger systems, greater latency. . . Increasing memory can increase latency. traverse more memory addresses larger system with greater physical distance between machines Light is only so fast (but still faster than neutrinos!) It takes approximately 1 nanosecond for light to travel 0.30 meters Latency causes significant inefficiency in new CPU architectures. IBM PowerPC A2 A single 1.6 GHz PowerPC A2 can perform 204.8 operations per nanosecond! Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 18. NSA-RD-2013-056001v1 Easier to increase capacity using disks Current Intel Xeon E5 architectures: 384 GB max. per CPU (4 channels x 3 DIMMS x 32 GB) 64 TB max. globally-shared memory (46-bit address space) 3881 dual Xeon E5 motherboards to store Brain Graph — 98 racks Disk capacity not unlimited but higher than memory. Largest capacity HDD on market 4 TB HDD — need 728 to store Brain Graph which can fit in 5 racks (4 drives per chassis) Disk is not enough—applications will still require memory for processing Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 19. NSA-RD-2013-056001v1 Top supercomputer installations Largest supercomputer installations do not have enough memory to process the Brain Graph (3 PB)! Titan Cray XK7 at ORNL — #1 Top500 in 2012 0.5 million cores 710 TB memory 8.2 Megawatts3 4300 sq.ft. (NBA basketball court is 4700 sq.ft.) Sequoia IBM Blue Gene/Q at LLNL — #1 Graph500 in 2012 1.5 million cores 1 PB memory 7.9 Megawatts3 3000 sq.ft. Electrical power cost At 10 cents per kilowatt-hour — $7 million per year to keep the lights on! 3 3 Power specs from http://www.top500.org/list/2012/11 Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 20. NSA-RD-2013-056001v1 Cloud architectures can cope with graphs at Big Data scales Cloud technologies designed for Big Data problems. scalable to massive sizes fault-tolerant; restarting algorithms on Big Graphs is expensive simpler programming model; MapReduce Big Graphs from real data are not static. . . Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 21. NSA-RD-2013-056001v1 Distributed key, value repositories Much better for storing a distributed, dynamic graph than a distributed filesystem like HDFS. create index on edges for fast random-access and . . . sort edges for efficient block sequential access updates can be fast but . . . not transactional; eventually consistent Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 22. NSA-RD-2013-056001v1 NSA BigTable — Apache Accumulo Google BigTable paper inspired a small group of NSA researchers to develop an implementation with cell-level security. Apache Accumulo Open-sourced in 2011 under the Apache Software Foundation. Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 23. NSA-RD-2013-056001v1 Using Apache Accumulo for Big Graphs Accumulo records are flexible and can be used to store graphs with typed edges and weights. store graph as distributed edge list in an Edge Table edges are natural key, value pairs, i.e. vertex pairs updates to edges happen in memory and are immediately available to queries and . . . updates are written to write-ahead logs for fault-tolerance Apache Accumulo key, value record KEY ROW ID COLUMN TIMESTAMP VALUE FAMILY QUALIFIER VISIBILITY Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 24. NSA-RD-2013-056001v1 But for bulk-processing, use MapReduce Big Graph processing suitable for MapReduce. . . large, bulk writes to HDFS are faster (no need to sort) temporary scratch space (local writes) greater control over parallelization of algorithms Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 25. NSA-RD-2013-056001v1 Quick overview of MapReduce MapReduce processes data as key, value pairs in three steps: 1 MAP map tasks independently process blocks of data in parallel and output key, value pairs map tasks execute user-defined “map” function 2 SHUFFLE sorts key, value pairs and distributes to reduce tasks ensures value set for a key is processed by one reduce task framework executes a user-defined “partitioner” function 3 REDUCE reduce tasks process assigned (key, value set) in parallel reduce tasks execute user-defined “reduce” function Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 26. NSA-RD-2013-056001v1 MapReduce is not great for iterative algorithms Iterative algorithms are implemented as a sequence of MapReduce jobs, i.e. rounds. But iteration in MapReduce is expensive. . . temporary data written to disk sort/shuffle may be unnecessary for each round framework overhead for scheduling tasks Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 27. NSA-RD-2013-056001v1 Good principles for MapReduce algorithms Be prepared to Think in MapReduce. . . avoid iteration and minimize number of rounds limit child JVM memory number of concurrent tasks is limited by per machine RAM set IO buffers carefully to avoid spills (requires memory!) pick a good partitioner write raw comparators leverage compound keys minimizes hot-spots by distributing on key secondary-sort on compound keys is almost free Round-Memory tradeoff Constant, O(1), in memory and rounds. Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 28. NSA-RD-2013-056001v1 Best of both worlds – MapReduce and Accumulo Store the Big Graph in an Accumulo Edge Table . . . great for updating a big, typed, multi-graph more timely (lower latency) access to graph Then extract a sub-graph from the Edge Table and use MapReduce for graph analysis. Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 29. NSA-RD-2013-056001v1 How well does this really work? Prove on the industry benchmark Graph500.org! Breadth-First Search (BFS) on an undirected R-MAT Graph count Traversed Edges per Second (TEPS) n = 2scale, m = 16 × n 0 200 400 600 800 1000 1200 26 28 30 32 34 36 38 40 42 44 terabytes scale Problem Class Graph500 Problem Sizes Class Scale Storage Toy 26 17 GB Mini 29 140 GB Small 32 1 TB Medium 36 17 TB Large 39 140 TB Huge 42 1.1 PB Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 30. NSA-RD-2013-056001v1 Walking a ridiculously Big Graph. . . Everything is harder at scale! performance is dominated by ability to acquire adjacencies requires specialized MapReduce partitioner to load-balance queries from MapReduce to Accumulo Edge Table Space-efficient Breadth-First Search is critical! create vertex, distance records sliding window over distance records to minimize input at each k iteration reduce tasks get adjacencies from Edge Table but only if. . . all distance values for a vertex, v, are equal to k Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 31. NSA-RD-2013-056001v1 Graph500 Experiment Graph500 Huge class — scale 42 242 (4.40 trillion) vertices 246 (70.4 trillion) edges 1 Petabyte Cluster specs 1200 nodes 2 Intel quad-core per node 48 GB RAM per node 7200 RPM sata drives Huge problem is 19.5x more than cluster memory 0 10 20 30 40 50 60 70 80 0 20 40 60 80 100 120 140 17 140 1126 36 39 42 TraversedEdges(trillion) Terabytes Time (h) Problem Class Problem Class Memory = 57.6 TB Graph500 Scalability Benchmark Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 32. NSA-RD-2013-056001v1 Graph500 Experiment Graph500 Huge class — scale 42 242 (4.40 trillion) vertices 246 (70.4 trillion) edges 1 Petabyte Cluster specs 1200 nodes 2 Intel quad-core per node 48 GB RAM per node 7200 RPM sata drives Huge problem is 19.5x more than cluster memory linear performance from 1 trillion to 70 trillion edges. . . 0 10 20 30 40 50 60 70 80 0 20 40 60 80 100 120 140 17 140 1126 36 39 42 TraversedEdges(trillion) Terabytes Time (h) Problem Class Problem Class Memory = 57.6 TB Graph500 Scalability Benchmark Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 33. NSA-RD-2013-056001v1 Graph500 Experiment Graph500 Huge class — scale 42 242 (4.40 trillion) vertices 246 (70.4 trillion) edges 1 Petabyte Cluster specs 1200 nodes 2 Intel quad-core per node 48 GB RAM per node 7200 RPM sata drives Huge problem is 19.5x more than cluster memory linear performance from 1 trillion to 70 trillion edges. . . despite multiple hardware failures! 0 10 20 30 40 50 60 70 80 0 20 40 60 80 100 120 140 17 140 1126 36 39 42 TraversedEdges(trillion) Terabytes Time (h) Problem Class Problem Class Memory = 57.6 TB Graph500 Scalability Benchmark Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 34. NSA-RD-2013-056001v1 Graph500 Experiment Graph500 Huge class — scale 42 242 (4.40 trillion) vertices 246 (70.4 trillion) edges 1 Petabyte Huge problem is 19.5x more than cluster memory linear performance from 1 trillion to 70 trillion edges. . . despite multiple hardware failures! 0 10 20 30 40 50 60 70 80 0 20 40 60 80 100 120 140 17 140 1126 36 39 42 TraversedEdges(trillion) Terabytes Time (h) Problem Class Problem Class Memory = 57.6 TB Graph500 Scalability Benchmark Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 35. NSA-RD-2013-056001v1 Future work What are the tradeoffs between disk- and memory-based Big Graph solutions? power–space–cooling efficiency software development and maintenance Can a hybrid approach be viable? memory-based processing for subgraphs or incremental updates disk-based processing for global analysis Advances in memory and disk technologies will blur the lines. Will solid-state disks be another level of memory? Paul Burkhardt, Chris Waring An NSA Big Graph experiment