More Related Content Similar to Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018 (20) More from Amazon Web Services (20) Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 20181. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Adir Sharabi
Solutions Architect, Amazon Web Services
Yair Weinberger
CTO and Co-Founder, Alooma
Big Data on AWS: To infinity and
beyond!
2. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Big Data Track Agenda
# Time Title Speaker
1 13:15 – 14:00 Big Data on AWS: To infinity and beyond! Adir Sharabi – Solutions Architect
2 14:10 – 14:55 Amazon Kinesis – Building Serverless real-
time solution
Roy Ben Alta – Business
Development Manager
3 15:05 – 15:50 Data preparation and transformation: Spin
your straw into gold
Daniel Haviv – Specialist Solutions
Architect, Analytics
4 16:00 – 16:45 Success has many query engines Eden Perry – Partner Solutions
Architect
5 16:50 – 17:30 Connecting the dots: How Amazon Neptune
and Graph Databases can transform your
business
• Andi Gutmans - GM, Neptune
& Elasticsearch
• Brad Bebee - Principal Prod
Mgmt - Tech, AWS Neptune
3. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Documents and files Streams
Your Data Sources
Multiple sources and formats… and growing everyday
Records
Amazon
RDS
Amazon
DynamoDB
AWS IoT
On Premises
databases
Spreadsheets Infrastructure
logs
Clickstream data Mobile app data
Social media data Amazon
Redshift
Device data
Sensor data
ERP WEB
Clickstream
Mobile Apps
4. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Data Challenges
Data Visibility Multiple consumers
and requirements
Multiple Access
Mechanisms
1990 2000 2010 2020
Analysts Applications
Data Scientists
Business Users API Access BI Tools
Notebooks
5. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Traditionally, Analytics Used to Look Like This
OLTP ERP CRM LOB
Data Warehouse
Business Intelligence
Relational data
Schema defined prior to data load
TBs-PBs Scale
Operational reporting and ad hoc
Large initial capex + $10K–$50K/TB/Year
6. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Data Lakes Extend the Traditional Approach
Relational and non-relational data
Schema defined during analysis
Scale storage and compute independently
Diverse analytical engines to gain insights
Designed for low-cost storage and analytics
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
100110000100101011100
101010111001010100001
011111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
Learning
DW
Queries
Big data
processing
Interactive Real-time
7. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Snowball
Snowmobile Kinesis
Data Firehose
Kinesis
Data Streams
Amazon S3
Many ways to bring all kinds of data
Unmatched durability and availability at EB scale
Best security, compliance, and audit capabilities
Integration with Big Data Tools
Run any analytics on the same data without movement
Cost effective - Store data at $0.023 / GB / Month
Redshift
EMR
Athena Kinesis
Elasticsearch Service
Amazon S3 as Data Lakes Storage Layer
Kinesis
Video Streams
AI Services
8. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Store
Simplified Big Data Pipeline
Amazon S3
Ingest
Process &
Analyze Consume
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
9. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Lots of ingestion tools
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Process &
Analyze Consume
Store
Amazon S3
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service
10. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Variety of data processing tools
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Consume
Amazon Athena
Amazon EMR
Amazon Redshift
& Spectrum
Amazon
Elasticsearch
Amazon AI/ML/DL
Services
Store
Amazon S3
Process & Analyze
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service
11. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Multiple ways to consume the data
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Consume
Amazon Athena
Amazon EMR
Amazon Redshift
& Spectrum
Amazon
Elasticsearch
Amazon AI/ML/DL
Services
Store
Amazon S3
Process & Analyze
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service
Amazon
QuickSight
Jupyter, Zeppelin,
HUE
Amazon API
Gateway
BI Tools
12. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Because data is NEVER perfect
Amazon EMR
Spark and Hive running on EMR
• Clean
• Transform
• Concatenate
• Convert to better formats
• Schedule transformations
• Event-driven transformations
• Transformations expressed as code
AWS Glue
Event based Server-less ETL engine
AWS Lambda
Trigger-based Code Execution
13. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
ETL when you need it
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Consume
Amazon Athena
Amazon EMR
Amazon Redshift
& Spectrum
Amazon
Elasticsearch
Amazon AI/ML/DL
Services
Store
Amazon S3
Process & Analyze
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service
Amazon
QuickSight
Jupyter, Zeppelin,
HUE
Amazon API
Gateway
BI Tools
14. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Realtime - in-stream processing
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Consume
Amazon Athena
Amazon EMR
Amazon Redshift
& Spectrum
Amazon
Elasticsearch
Amazon AI/ML/DL
Services
Store
Amazon S3
Process & Analyze
Spark
Streaming
& Flink
Amazon
Kinesis
Analytics
In stream process
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service
Amazon
QuickSight
Jupyter, Zeppelin,
HUE
Amazon API
Gateway
BI Tools
15. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
AWS Glue Data Catalog
One per account
Allows you to share metadata between Amazon Athena, Amazon Redshift
Spectrum, EMR & JDBC sources
Serverless
We added a few extensions:
§ Search over metadata for data discovery
§ Manage Connections – JDBC URLs, credentials
§ Classification for identifying and parsing files
§ Versioning of table metadata as schemas evolve and other
metadata are updated
Central Metadata
Catalog for the
data lake
16. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
AWS Glue Data Catalog Crawlers
Crawlers automatically build your Data Catalog and keep it in sync
Automatically discover new data, extracts schema definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
Built-in classifiers for popular types; custom classifiers using Grok
expression
Run ad hoc or on a schedule; serverless – only pay when crawler runs
Catalogs Your
Data
17. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Write once, catalog once, read multiple, ETL Anywhere
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Consume
Amazon Athena
Amazon EMR
Amazon Redshift
& Spectrum
Amazon
Elasticsearch
Amazon AI/ML/DL
Services
Data Catalog
Store
Amazon S3
Process & Analyze
Amazon
QuickSight
Jupyter, Zeppelin,
HUE
Amazon API
Gateway
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service
Spark
Streaming
& Flink
Amazon
Kinesis
Analytics
In stream process
BI Tools
18. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Yair Weinberger
CTO and Co-Founder, Alooma
21. Data Lake - The Goal
Emilykil (Own work) [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons
22. Data Lake - What Sometimes Happens
NatalieMaynor from Jackson, Mississippi, USA - Winter Ugliness, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=5503067
23. Data Lake VS DataMart
Emilykil (Own work) [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons, Abras2010 - WalmartUploaded by
SchuminWeb, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=14571617
25. Alooma Usage of S3 as a Data Lake
● Separate between data of different tenants
○ IAM Role based access ensures data isolation
● Allow Alooma tenants to replay their data from any data
source or time
● Staging area before loading into Data Warehouse
● Storage for things that need infinite retention (e.g. audit logs)
27. S3 as Data Lake - Tips and Tricks
● Use Server-Side encryption to provide automatic encryption at
rest - but it does impact performance
● Loading data in high volume
○ Keys in S3 are partitioned by prefix
○ Use Randomly prefixed or at least sharded filenames
● Use Object Expiration to avoid storing unnecessary data
Important resource:
https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-
perf-considerations.html
29. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Let’s take an Example
30. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
AWS Tweets Example
Record-level data
• What’s the overall sentiment today?
• What’s the sentiment trend now?
• What’s the most popular Language?
• What’s the Temp. affect on the tweet sentiment?
• Scale
• Resilience
• Minimal Operational overhead
• Agile
• Cost Effective
Business Questions
Technical Requirements
31. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
ConsumeStore Process & AnalyzeIngest
Kinesis Data Streams
Kinesis Firehose
Delivery Streams
DynamoDB
AWS Lambda
Kinesis
Analytics
Raw Bucket
Parquet Bucket
Athena Redshift
Spectrum
QuickSight
SpeedLayerBatchLayer
Glue Data
Catalog
Spark/EMR Glue ETL
Real time
Web UI
32. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Thank You!