At StampedeCon 2014, John Tran of NVIDIA presented "GPUs in Big Data." Modern graphics processing units (GPUs) are massively parallel general-purpose processors that are taking Big Data by storm. In terms of power efficiency, compute density, and scalability, it is clear now that commodity GPUs are the future of parallel computing. In this talk, we will cover diverse examples of how GPUs are revolutionizing Big Data in fields such as machine learning, databases, genomics, and other computational sciences.
5. Math and memory peak throughput
4.29
TFLOPS Xeon E5-2687-W Tesla K40
0.35 0.17
1.43
5
4
3
2
1
0
SP TFLOPS DP TFLOPS
51.2
288
400
300
200
100
0
Memory BW
GB/s
Xeon E5-2687W Tesla K40
6. The Chickens are Winning
! Parallel computing is no longer “the future”
! If you are not parallel, you are already behind
! GPUs win in
! Performance == $$
! Power == $$
! Cost == $$
14. The Basic Idea – Accelerated Computing
Application Code
Compute-Intensive Functions
Rest of Sequential
CPU Code
GPU CPU
CUDA
15. Quick CUDA C example
Standard C Code Parallel C Code
void saxpy(int n, float a,
float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
int N = 1<<20;
// Perform SAXPY on 1M elements
saxpy(N, 2.0, x, y);
__global__
void saxpy(int n, float a,
float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
int N = 1<<20;
cudaMemcpy(x, d_x, N, cudaMemcpyHostToDevice);
cudaMemcpy(y, d_y, N, cudaMemcpyHostToDevice);
// Perform SAXPY on 1M elements
saxpy<<<4096,256>>>(N, 2.0, x, y);
cudaMemcpy(d_y, y, N, cudaMemcpyDeviceToHost);
http://developer.nvidia.com/cuda-toolkit
16. How else can you program it?
! Libraries
! Thrust, BLAS, SPARSE, FFT, NPP, RAND
! Directives
! OpenACC
! Languages
! CUDA C, CUDA C++, thrust, python, fortran, C++ proposal, matlab, gpu.net
! Learn
! “get cuda,” Udacity, Coursera
23. ! 90 M monthly active users
! 17 M tracks tagged / day
! 27 M tracks in DB
“GPUs enable us to handle our tremendous processing needs at a
substantial cost savings, delivering twice the performance per dollar
compared to a CPU-based system.”
-Jason Titus, CTO, Shazam
27. Google Datacenter Stanford AI Lab
1000 CPU Servers
600 kWatts
$5,000,000
3 GPU-Accelerated Servers
3.6 kWatts
$21,000
Deep learning with COTS HPC systems, A Coates, B Huval, T Wang, D Wu, A Ng, B Catanzaro, NIPS 2013
31. The DataScope at JHU
5PB of science data (in 2010)
“The Data-Scope will allow us to mine out
relationships among data that already exist
but that we can’t yet handle and to sift
discoveries from what seems like an
overwhelming flow of information.
New discoveries will definitely emerge this
way. There are relationships and patterns
that we just cannot fathom buried in that
onslaught of data. Data-Scope will tease
these out.”
– Alex Szalay, JHU
33. Beating Heart Surgery
Patient stands to lose 1 point of IQ every
10 min with heart stopped
Only ~2% of heart surgeons will operate
on a beating heart
GPU enables real-time motion
compensation to virtually stop beating
heart for surgeons:
Courtesy Laboratoire d’Informatique de Robotique et de Microelectronique de Montpellier
36. Final Thoughts
! Parallel computing is here
! Re-think parallel or get left behind
! Scale up before scaling out
! Several orders of magnitude parallelism increase by using a GPU
! Do you really need a cluster?
! GPUs are the most efficient solution for parallel problems
! Perf / $
! Perf / Watt