Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Ben Snively
Specialist Solutions Architect,
Demo: Kate Werling
Solutions Architect
Big Data Meets AI
Driving Insights and Adding Intelligence to Your Solutions

What we’ll cover
• Big Data and why organization care
• Common Challenge - Which,What,Hows…
• Demonstration
• Big Data Driving Machine Learning
• Demonstration
• Final Design Tenants

VisualizationVariability
Big Data Is Defined Many Different Ways
Volume Velocity Variety Veracity Value

https://www.promptcloud.com
https://john-popelaars.blogspot.com
https://ww.signiant.com
https://www.linkedin.com/pulse/world-today-data-rich-information-poor-guru-p-mohapatra-pmp/
What do the analysts say?

Organizations that successfully generate business
value from their data will outperform their peers.An
Aberdeen survey saw organizations who
implemented a data lake outperforming similar
companies by 9% in organic revenue growth.*
24%
15%
Leaders Followers
Organic revenue growth
*Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence
Most Important: DrivingValue from Data

Data Is Changing  Analytics Are Adopting
Capture and store
new data at PB-EB
scale
Do new type of analytics
in a cost effective way
• Machine learning
• Big data processing
• Real-time analytics
• Full-text search
New types of
analytics

Data Lakes Extend the Traditional Approach
Data warehouse
Business intelligence
OLTP ERP CRM LOB
• Relational and nonrelational data
• TBs–EBs scale
• Diverse analytical engines
• Low-cost storage & analytics
Devices Web Sensors Social
Data lake
Big data processing,
real-time, machine learning

Data Lakes from AWS
Analytics
• Unmatched durability, and availability at EB scale
• Best security, compliance, and audit capabilities
• Object-level controls for fine-grain access
• Fastest performance by retrieving subsets of data
• The most ways to bring data in
• 2x as many integrations with partners
• Analyze with broadest set of analytics & ML services
Machine
learning
Real-time dataOn-premises
Data Lake
on AWS
movementdata movement

Managed ML Service
Deep Learning AMIs
Video and Image Recognition
Conversational Interfaces
Deep-Learning Video Camera
Natural Language Processing
Language Translation
Speech Recognition
Text-to-Speech
Interactive Analysis
Hadoop & Spark
Data Warehousing
Full-text search
Real-time analytics
Dashboards & Visualizations
Dedicated Network connection
Secure appliances
Ruggedized Shipping Container
Database migration
Connect Devices to AWS
Real-time Data Streams
Real-time Video Streams
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
Data Lakes, Analytics, and IoT Portfolio from AWS
Broadest, deepest set of analytic services

Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight Amazon AI EMR Redshift
Athena Kinesis RDS
Central Storage
Secure, cost-effective
Storage in Amazon S3
S3
Snowball Database Migration
Service
Kinesis Firehose Direct Connect
Data Ingestion
Get your data into S3
Quickly and securely
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
Processing & Analytics
Use of predictive and prescriptive
analytics to gain better understanding
Security Token
Service
CloudWatch CloudTrail Key Management
Service
Data Lake Components
Catalog & Search
Management
Cognito
QuickSight
Central Storage
Metadata User Access
Security/Governance
Data Movement Analytics and Machine Learning

Catalog & Search
Management
Cognito
QuickSight Amazon AI EMR Redshift
Athena Kinesis RDS
Central Storage
S3
Snowball Database Migration
Service
Kinesis Firehose Direct Connect
Data Ingestion
Get your data into S3
Quickly and securely
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
Processing & Analytics
Use of predictive and prescriptive
analytics to gain better understanding
Security Token
Service
CloudWatch CloudTrail Key Management
Service
Data Lake Components
Catalog & Search
Management
Cognito
QuickSight A
Central Storage
Glue ETL

Common Big Data Challenges

WHICH tool should I use?
One tool that doesn’t do any
thing very well..
Organized suite of tools that do
each purpose very well..

Purpose-built engines.
Right tool for the right
job.

WHAT Data Do I Have?
Gartner:
“Through 2018, 80% of data lakes will not include effective metadata
management capabilities, making them inefficient."
Data Lake
on AWS

Job AuthoringData Catalog Job Execution
Apache Hive Metastore compatible
Integrated with AWS services
Automatic crawling
Discover
Auto-generates ETL code
Python and Apache Spark
Edit, debug, and share
Develop
Serverless execution
Flexible scheduling
Monitoring and alerting
Deploy
WHAT Data Do I Have? – AWS Glue

On-premises data
Web app data
Amazon RDS
Other databases
Streaming data
Your data
AMAZON
QUICKSIGHT
WHAT Data Do I Have?

Other Ways to populate your Data Catalog
Call the AWS Glue CreateTable API
Create table manually Run Hive DDL statement
Apache Hive
Metastore
AWS GLUE ETL AWS GLUE
DATA CATALOG
Import from Apache Hive Metastore

MOST Important: Selecting an Agile Framework
Start with a tool that will serve the purpose
Experiment, Test, Iterate, Adopt.
Let’s look at an example:
HOW can I get started?
Evolution of Netflix Data pipeline

Aggregate and upload events to
Hadoop/Hive for batch processing
EXPERIMENT new things
Batch  Batch+ Real-time

Chukwa front-end  Kafka
Kafka front-endKafka
ADOPT your solution

“Amazon Kinesis Streams processes multiple terabytes of log
data each day, yet events show up in our analytics in seconds,”
Bennett. “We can discover and respond to issues in real time,
ensuring high availability and a great customer experience.””
FOCUS on business value

Select environment that allows you to try out different tools.
Focus on tools that all you to do as much focusing on analytics as possible…
AGILITY for the business

Agility of Analytics
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS

Agility - Hadoop/Spark Analytics
• Distributed processing
• Diverse analytics
• Batch/Script (Hive/Pig)
• Interactive (Spark, Presto)
• Real-time (Spark)
• Machine Learning (Spark)
• NoSQL (HBase)
• For many use cases
• Log and clickstream analysis
• Machine learning
• Real-time analytics
• Large-scale analytics
• Genomics
• ETL
YARN (Hadoop Resource Manager)
NoSQLMachine
learning
Real-timeInteractiveScriptBatch
Data Lake
on AWS

Agility - Hadoop/Spark Analytics on AWS
YARN (Hadoop Resource Manager)
NoSQLMachine
learning
Real-timeInteractiveScriptBatch
Data Lake
on AWS
Amazon S3
Amazon EMR
Managed Hadoop/Spark
Object Storage

Amazon S3 – Source of Truth, Multiple Clusters
Amazon S3
Interactive Spark Cluster
Amazon EMR
Amazon EMR
HDFS
HDF
S
EC2 Instance Memory
Intermediates
stored on local disk
or HDFS
Loca
l
HDF
S
EC2 Instance Memory
Intermediates
stored on local disk
or HDFS
Loca
l
Transient ETL Job
Source of Truth
HDFS
HDFS
HDFS
Local Intermediate
HDFS/Storage
Local Intermediate
HDFS/Storage

Fitting this into the Common Data Catalog
Amazon S3
Interactive Spark Cluster
Amazon EMR
Amazon EMR
HDFS
Transient ETL Job
Source of Truth
HDFS
Describes Data in S3
MySQL DB
instance
Customershaveoptions
Glue Data
Catalog

Amazon Athena is an interactive query
service that makes it easy to analyze data
directly from Amazon S3 using Standard
SQL

Demonstration
Data Catalog and Analyzing your
Data

Machine Learning and Big Data

Big Data driving Machine Learning
Better
Decisions
Object Storage
Databases
Data warehouse
Streaming analytics
BI
Hadoop
Spark/Presto
Elasticsearch
Better
Products Machine Learning
Deep Learning/ AI
More
Users
More
Data
Click stream
User activity
Generated content
Purchases
Clicks
Likes
Sensor data

Agility in Machine Learning
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS

Machine Learning requires new tools and interfaces
Machine Learning/Deep Learning
Business
Reporting
Data Scientists
Data Engineer
IDE
Data
Catalog
Central
Storage
Sagemaker

Agility in Machine Learning – for all users
Application
Services
• Designed for Application Developers
• Solution-oriented Prebuilt Models Available via APIs
• Image Analysis, Text-to-Speech, Conversational UX
Platforms
• Designed for Data Scientists to Address Common Needs
• Fully Managed Platform for Model Building
• Reduces the Heavy Lifting in Model Building & Deployment
Frameworks
• Designed for Data Scientists to Address Advanced / Emerging Needs
• Provides Maximum Flexibility to develop on the leading AI Frameworks
• Enables Expert AI Systems to be Developed & Deployed

Digital Globe – Using ML to Find the Right Data
Data lake:
• 100 PB of data in cloud
• Optimize storage tiers
Solution:
• Optimize their data lake
storage, cut costs in half

FINRA - Data Is Central to Our Mission
Reconstruct the market from trillions of
events
• Data from broker-dealers and exchanges
• Equities, Options, Fixed Income
• Build a graph of market order events
Analyze the data looking for financial
fraud
• Insider trading, layering, cross-product
manipulation, front running & many more
• Looking for a needle in a haystack
4

FINRA - From data puddles to Data Lake
Database1
Storage
Query/Compute
Catalog
Database2
Storage
Query/Compute
Catalog
Databasen
Storage
Query/Compute
Catalog
Storage
Query/
Compute
Catalog
EMR Spark LambdaEMR Presto EMR HBase
herd Hive
metastore
FINRA in Data Center FINRA in AWS
Scales Silo
Amazon
S3

Enabled Machine Learning on their Data Lake
Data
Scientist
Logical ‘Database’
EMR Cluster
Still one copy
of data!
Spark Cluster
DS-in-a-box
AuthN
Data
Scientist
Data
Scientist
Catalog

UDSP – Inventory – not just R
• R 3.2.5, Python (2.7.12 and 3.4.3)
• Packages
• R: 300+ Python: 100+
• Tools for Building Packages
• gcc, gfortran, make, java, maven, ant…
• IDEs
• Jupyter, RStudio Server
• Deep Learning
• CUDA, CuDNN (if GPU present)
• Theano, Caffe, Torch
• TensorFlow
16

Machine Learning Demonstration

Final Design Tenants/Best Practices

Core Tenants
• Loose Coupling, but highly performant
• Storage, Analytics, Metadata Management, etc..
• Future proof your analytics
• Choosing the best tool for the job
• Elasticity and multiple clusters for dedicated purposes
• Replace capacity planning with a consumption model
• Don’t forget metadata management

Use the right Storage Tier
Data structure → Fixed schema, JSON, key-value
Access patterns → Store data in the format you will access it
Data characteristics → Hot, warm, cold
Cost → Right cost

Please complete the session survey in
the mobile summit app!

Thank you!

Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions

Similar to Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions

Editor's Notes