Producing vaccines is a significant and complex effort that spans manufacturing, biological materials, streaming data, and complex computational challenges. In this session, speakers from Merck and Booz Allen Hamilton discuss how they partnered to leverage AWS and data science techniques, enabling them to pioneer new approaches for analyzing vaccine production yields. The solution they created combines a shared data lake service built on AWS services-such as Amazon EC2 and Amazon VPC-as well as Hadoop MapReduce, HDFS, Hive, and R to implement the data science infrastructure and analysis that created models of complex biological processes. As a result of this project, Merck has analyzed 12 years of vaccine manufacturing data from 16 data sources, conducted over 15 billion calculations, and was recognized with the InformationWeek Elite Business Innovation Award for the innovative application of data science towards enhancing vaccine yield rates and saving lives.
Nell’iperspazio con Rocket: il Framework Web di Rust!
(HLS201) Using AWS and Data Science to Analyze Vaccine Yield | AWS re:Invent 2014
1. 11.018.14
Brian Keller, Data Science Lead, Booz Allen Hamilton
Jerry Megaro, Director, Advanced Analytics and Innovation, Merck Manufacturing
Nic Perez, Cloud Architecture Lead, Booz Allen Hamilton
Making a difference with data
6. Parametric models
Let the data tell the story
Input/Output modeling
Data experimentsto enable discovery
Avoid failure
Failureis powerful… learn fast and adjust
Narrowscope of analysis
Ask biggerquestions using atypical data
7. Human Insight + Actions
Data Management
Infrastructure
Machine Learning Free-Computation Alerting
Geographic
Language
Translation
Entity
Relationship
Event Grab
Dense/
Sparse
Structured Unstructured Streaming
Provisioning Deployment Monitoring Workflow
Streaming Analytics
Streaming
indexes
Services (SOA)
Analytics and
Discovery
Views and Indexes
HDFS/Data Lake
Metadata Tagging
Data Sources
Infrastructure/
Management
Visualization,
Reporting, Dashboards,
and Query Interface
14. Clustering in this region indicates parameter similarity is associated with high yield
Clustering in this region indicates parameter similarity is associated with low yield
Similarity
Score
(low)
(high)
Batch 2
Batch 1
Batch 3
Batch 5
Batch 4
Batch 1
Batch 3
Batch 2
Batch 5
Batch 4
Increasing yield
Increasing yield SimilarityMatrix
17. Redshift-Based
Data Marts
Amazon EC2
Elastic Map/
Reduce
Hadoop, Solr Search Solution
Legacy
Enterpise RDS
AES Encypted S3 Data Lake
VPC
Enterprise
Active Directory
JAXRS/Tomcat-Based Rest
Services on Elastic Bean Stalk
Insights Angular, D3.js Web UI
Accelerated Reasoning
Security
Cell-Level Visibilty,
Life Science Informatics via
Custom Solr Plug-ins
Flexible Data Processing
Pipelines
Business Users
Data Scientists
22. –Monitor, identify, and alert on abnormal user activity
–Govern administrative rights; policy based enforcement
–Hardened virtual appliance; do not allow direct RDP/SSH access to management/security appliances
–IA has purview into every log (firewall/router logs, crypto logs, application logs, systemdlogs, OS logs, SCCM, etc.)