FPGA Accelerated Genomics Using AWS F1 Instances

FPGA Accelerated Computing Using
Amazon EC2 F1 Instances
D a v i d P e l l e r i n
H e a d o f W W B u s i n e s s D e v e l o p m e n t , I n f o t e c h , A W S
P i e t e r v a n R o o y e n
C E O a n d F o u n d e r , E d i c o G e n o m e
R a m i M e h i o
V P o f E n g i n e e r i n g , E d i c o G e n o m e
C M P 3 0 8
N o v e m b e r 3 0 , 2 0 1 7
AWS re:INVENT

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
WHY USE ACCELERATED COMPUTING?
P A R A L L E L I S M I N C R E A S E S T H R O U G H O U T …
CPU: high speed, highly flexible GPU/FPGA: high throughput, high efficiency
GPUs and FPGAs can provide massive parallelism and higher efficiency than
CPUs for many categories of applications

NVIDIA Tesla
V100 GPU
P3: GPU-accelerated computing
§ Enabling a high degree of parallelism–each
GPU has thousands of cores
§ Consistent, well documented set of APIs
(CUDA, OpenACC, OpenCL)
§ Supported by a wide variety of ISVs and
open source frameworks
Xilinx
UltraScale+
FPGA
F1: FPGA-accelerated computing
§ Massively parallel–each FPGA includes millions
of parallel system logic cells
§ Flexible–no fixed instruction set, can
implement wide or narrow datapaths
§ Programmable using available, cloud-based
FPGA development tools
ACCELERATED COMPUTING ON AWS

PARALLEL PROCESSING IN GPU AND FPGA
A GPU is effective at processing the same instruction in
parallel, for example, calculating pixel values in parallel
for graphics shading, or running many parallel financial
computations. A GPU has a well-defined instruction-set,
and fixed word sizes.
An FPGA is effective at processing the same or
different instructions in parallel, for example, creating a
complex pipeline of parallel, multistage operations on a
video stream, or performing a sequence of dependent
calculations and data manipulations for genomics
processing. An FPGA does not have a predefined
instruction-set, or a fixed data width.

PARALLEL PROCESSING IN GPU AND FPGA
• Tens to hundreds of
processing cores
• Pre-defined instruction set
and datapath widths
• Optimized for general
purpose computing
CPU
• Thousands of processing
cores
• Pre-defined instruction set
and datapath widths
• Highly effective at parallel
execution
GPU
• Millions of programmable
digital logic cells
• No predefined instruction
set or datapath widths
• Hardware-timed
execution, massively
parallel
FPGA
DRAM
Control
ALU
ALU
Cache
DRAM
ALU
ALU
Control
ALU
ALU
Cache
DRAM
ALU
ALU
Control
ALU
ALU
Cache
DRAM
ALU
ALU
Control
ALU
ALU
Cache
DRAM
ALU
ALU

§ Make FPGAs available as standard AWS instances to a
large community of developers and to millions of
potential customers
§ Simplify the development process by providing cloud-
based FPGA and C/C++ software development flows
§ Allow developers to focus on algorithm design by
abstracting FPGA I/O using well-defined interfaces
§ Provide a Marketplace for FPGA applications, providing
more choice and easy access for all AWS customers
FPGA ACCELERATION USING F1: GOALS

Amazon
Machine
Image (AMI)
Amazon FPGA
Image (AFI)
EC2 F1
Instance
CPU
Application
on F1
DDR-4
Attached
MemoryDDR-4
Attached
Memory
PCIe
DDR
Controllers
Launch Instance
and Load AFI
An F1 instance
can have any
number of AFIs
An AFI can be
loaded into the
FPGA in seconds
FPGA ACCELERATION USING F1

§ Up to eight Xilinx UltraScale+ 16nm VU9P FPGA devices in a single instance
§ The f1.16xlarge size provides:
§ Eight FPGAs, each with over two million customer-accessible FPGA
programmable logic cells and over 5000 programmable DSP blocks
§ Each of the eight FPGAs has four DDR-4 interfaces, with each interface
accessing a 16GiB, 72-bit wide, ECC-protected memory
Instance Size FPGAs DDR-4
(GiB)
vCPUs Instance
Memory (GiB)
NVMe Instance
Storage (GB)
Network
Bandwidth
f1.2xlarge 1 4 x 16 8 122 1 x 470 Up to 10 Gbps
f1.16xlarge 8 32 x 16 64 976 4 x 940 25 Gbps
F1 FPGA INSTANCE TYPES

CUSTOM LOGIC AND THE FPGA SHELL
AWS FPGA Shell
provides standard, pre-tested, and secure
I/O components, allowing FPGA developers
to focus on their differentiating value
The FPGA Shell removes the need to
develop I/O related FPGA hardware
Software Development Kit (SDK)
provides required software interfaces for
FPGA management and communication
Hardware Development Kit (HDK)
provides required FPGA Shell components
SDK
HDK
Software
application
FPGA
Custom Logic
Custom Logic

CREATE THE AMAZON FPGA IMAGE (AFI)
GENERATE AN ENCRYPTED AFI USING THE GENERATED DCP

OPENCL IS AVAILABLE FOR F1
§ Familiar development experience to accelerate C/C++
applications
§ 50+ F1 code examples available that span multiple
domains: security, image processing, and accelerated
algorithms
§ Already supported on the FPGA Developer AMI, no need to
upgrade/install

CREATE THE AMAZON FPGA IMAGE (AFI)
XILINX SDACCEL PROVIDES AN ALTERNATIVE, C/C++/OPENCL BASED DESIGN FLOW

Amazon EC2 FPGA deployment
via Marketplace
Amazon
Machine
Image (AMI)
Amazon FPGA Image (AFI)
AFI is secured, encrypted,
dynamically loaded into the FPGA—
can’t be copied or downloaded
Customers
AWS Marketplace
DELIVERING FPGA PARTNER SOLUTIONS
VIA AWS MARKETPLACE

§ Financial computing
§ Genomics sequencing
§ Image and video processing
§ Big data and machine learning
§ Test and measurement
§ Security, compression
§ Developer tools
§ …and more
F1 USE CASES AND PARTNERS

AWS MARKETPLACE
DISCOVER, PROCURE, DEPLOY, AND MANAGE SOFTWARE IN THE CLOUD

GETTING STARTED WITH F1
https://github.com/awslabs/aws-fpga-app-notes/tree/master/reInvent17_Developer_Workshop
§ Gain hands-on experience
with AWS F1
§ Learn how to develop
FPGA-accelerated
applications
§ Learn the OpenCL flow
and the Xilinx SDAccel
development environment

DRAGEN on AWS
Marketplace
P i e te r va n R ooy e n, C EO a nd F ounde r
R a mi M e hi o, V P of Eng i ne e r i ng

Edico Genome overview
50
Employees
World Record for
Fastest Genetic
Diagnosis
Founded in
Jan 2013
Located in
San Diego, CA
11Issued
20Pending
Patents
17PetaBytes
Processed by
Customers to Date
Lead Investors
Qualcomm
Dell EMC
Cloud
App
Major Tech
Partnerships

Genomic big data
By 2025, genomics could well represent the biggest of big
data fields
Source: Challenges For Genomics In The Age Of Big Data, July 2015,
Forbes
Twitter GenomicsYouTube Astronomy
1 Zettabyte

Genomic data and Moore’s Law
2016 2017 2018 2019 2020
Genomic Data
Doubles Every Seven
Months
Moore’s Law
Doubles Every Two
Years?
Alternative
technologies are
needed to
address big data
challenges

DRAGEN Complete Suite
Somatic V2 RNA
Tumor-Only
and
Tumor/Normal
Analysis
Transcriptome
Analysis with
Splice Junction
Alignment
Germline V2
Clinical Grade
End-to-End
BCLàVCF
Including
Advanced PCR
Error
Correction
Available
Today!
GATK Best
Practices
100% GATK
Concordance
Population
Flexible Family
Trio or Large
Scale Joint
Genotyping
Cohort Analysis
VLRD
Virtual Long
Read Detection
on
CNV
Copy Number
Variant Analysis
for Somatic
Exome
Methylation
Methyl-Seq
or BS-Seq
Available
Soon
RNA V2
Transcriptome
Analysis with
Splice Junction
Alignment
Coming Soon:
Differential
Expression

Acceleration: How do we do it?
DRAGEN FPGA platform enables massive parallel processing resulting in revolutionary data analysis
capabilities

DRAGEN software/hardware stack
FPGA accelerator is the foundation and the key driver of revolutionary compute+storage platform applications
User Interface Layer
HAL
DMA Driver
IO Layer
Pipeline Layer
SW Stack
Arbiter
CROSSBAR
4x
DDR4
Ctrlr
Accelerator
Engine 2
Accelerator
Engine 4
Accelerator
Engine 1
Accelerator
Engine 3
4x16 GB
DDR4
Memory
PCIe 3.0 x8 Interface
N channel DMA
Application Host Memory
APPLICATION
AppspecificGeneric

DRAGEN architecture
a n d h a r d w a r e p o r t t o F 1
Specificity
Architecture key points
• SW HAL to insulate application code
from the platform
• Edico DMA SW driver and HW DMA
channel to be independent of FPGA
device vendor
• Separate HW infrastructure layer from
acceleration layer
• Integrate DRAGEN HW infrastructure
layer with F1 instance HDK
• Size acceleration clusters for VU9P
device
• Tradeoff cluster size as opposed to clock
speed

DRAGEN run time acceleration
o v e r C P U - o n l y s o l u t i o n s
Mapping/Aligning MAP/A/Sort/Dedup/VC
Onsite AWS Onsite AWS F1.2X AWS F1.16X
30X Whole
Human Genome
8 min 4 min 20 min 59 min 17 min
Exome 1 min 30 sec 2 min 3 min 1.5 min
Acceleration over CPU Only Normalized by Number of Cores
Current
Times
Acceleration over CPU
only solution
Projected
Times
Acceleration Over CPU
Only Solution
F1 – 2X 59 min 32x 44 min 43x
F1 – 16X 17 min 26x 10-13 min 40x
Onsite 20 min 29x 14 min 40x

DRAGEN Germline Pipeline: Analysis
Time for Genomes
FASTQ BAM
VCF/gVCF
Whole Genome, Exome & Panels
Version 2
DRAGEN Execution Time
FASTQ
on
S3
FASTQ
on
Instance
Disk
Input file
download
BAM/VCF
on
instance
Disk
BAM/VCF
on
S3
Output file upload
Hash
Table
S3
Hash
Table on
Instance
Disk
Reference
download

DRAGEN Genome Pipeline execution:
F1.2Xlarge
Whole Genome, Exome & Panels
Version 2
10 Min20 Min 60 min 15 min
FASTQ BAM
VCF/gVCF
DRAGEN Execution Time
FASTQ
on
S3
FASTQ
on
Instance
Disk
Input file
download
BAM/VCF
on
instance
Disk
BAM/VCF
on
S3
Output file upload
Hash
Table
S3
Hash
Table on
Instance
Disk
Reference
download

Input streaming: F1.2Xlarge
DRAGEN execution time
S3
streaming Output file
upload
Reference
download
10 Min
60 min 15 min
30
s
FASTQ BAM
VCF/gVCF
BAM/VCF
on
instance
Disk
BAM/VCF
on
S3
Hash
Table
S3
Hash
Table on
Instance
Disk
Reference
download
10 Min
60 min 15 min

Output file streaming to Amazon S3
FASTQ BAM
VCF/gVCF
DRAGEN execution time
Input S3
streaming
Output file
streaming
Reference
download
2 min
60 min30s 30s

Optimized solution on F1.16Xlarge
FASTQ BAM
VCF/gVCF
DRAGEN
execution time
Input S3
streaming
Output file
streaming
Reference
download
1 min
17 min30s 30s

Product release roadmap
• Map/Align
• Sort/Dedup
• Variant Calling
Complete Suite
• Alt-Aware
Mapping
• Adv. Error
Detection
• Next
Generation
Accuracy
• Discrete VLRD
• Integrated VLRD
• Integrated FRD
• CNV
V1 V2 V3
Previous Available Today! Q1 2018
For Genomes and Exomes
Somatic V2 RNA Germline V2 GATK Best
Practices
Population VLRD

DRAGEN Germline V2 pipeline
gain in SNP detection performance large gain in indel detection
performance
Comparison against best-performing GATK-HC mode (BQSR)

DRAGEN Somatic V2 pipeline
DRAGEN Somatic v. 2
Mutect2
DRAGEN Somatic v. 2
Mutect2
DRAGEN Somatic v. 2
Mutect2
DRAGEN Somatic v. 2
Mutect2

PrecisionFDA Challenge
PrecisionFDA Hidden Treasures: Warm Up Challenge, Oct. 2017
Best Overall Performance
https://precision.fda.gov/challenges/1/view/result
s

DRAGEN Workflow on AWS
AWS Services Used:
• EC2 instances
• AWS Batch
• F1 instances
https://aws.amazon.com/blogs/compute/accelerating-precision-medicine-at-scale/

Network
architecture
Control
• Web VPC + Database VPC
• No customer data
Compute (region specific)
• Auto scaled Dragen instances
• Dragen receives job description
from control channel
• Dragen streams data from
Amazon S3, performs
computation and uploads it
back to S3
• All Dragen <=> S3
communication is over HTTPS
• No inter-Dragen instance
communication

Fastest Analysis of 1000 Whole Human Genomes

Guinness World Record: Analysis Overview
DRAGEN Germline Pipeline V2
1000x f1.2xlarge instances
Upload VCF files to
S3
Download FASTQs from
S3 to EBS
Average: 111 min
1,020 Genomes Analyzed

Summary
§ FPGA acceleration results in up to 43X improvement for genomics
applications
§ Streaming I/O using Amazon S3 greatly increases throughput
§ Parallelizing across multiple FPGAs using F1.16xlarge results in
another 4X+ acceleration
§ Per-second billing and Spot instances provide opportunities for
additional cost savings
§ Deployment to F1 FPGA instances via Marketplace makes accelerated
genomics widely available

FPGA Accelerated Genomics Using AWS F1 Instances

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to FPGA Accelerated Genomics Using AWS F1 Instances

Similar to FPGA Accelerated Genomics Using AWS F1 Instances (20)

More from Amazon Web Services

More from Amazon Web Services (20)

FPGA Accelerated Genomics Using AWS F1 Instances