With hundreds of new and sometimes disparate tools, it’s hard to keep pace. Amazon Web Services provides a broad and fully integrated portfolio of cloud computing services to help you build, secure and deploy your big data applications.
Attend this webinar to get an overview of the different big data options available in the AWS Cloud – including popular big data frameworks such as Hadoop, Spark, NoSQL databases, and more. Learn about ideal use cases, cases to avoid, performance, interfaces, and more. Finally, learn how you can build valuable applications with a real-life example.
Learning Objectives:
Learn about big data tools available at AWS
Understand ideal use cases
Learn some of the key considerations such as performance, scalability, elasticity and availability, when selecting big data tools
Who Should Attend:
Data Architects, Data Scientists, Developers
2. Table of Contents
• Big Data Introduction for AWS
• Big Data Analytics Option on AWS
• Usage Patterns & Anti-Patterns
• Performance & Cost
• Durability & Scalability
• Interfaces
• Building Big Data Analytic Solutions – The AWS Approach
• Example Scenarios
3. Big Data on AWS
Immediate Availability. Deploy instantly. No hardware to
procure, no infrastructure to maintain & scale
Trusted & Secure. Designed to meet the strictest
requirements. Continuously audited, including certifications
such as ISO 27001, FedRAMP, DoD CSM, and PCI DSS.
Broad & Deep Capabilities. Over 50 services and 100s of
features to support virtually any big data application &
workload
Hundreds of Partners & Solutions. Get help from a
consulting partner or choose from hundreds of tools and
applications across the entire data management stack.
6. Amazon Redshift
• Ideal Usage Patterns - Analyze
• Sales data
• Historical data
• Gaming data
• Social trends
• Ad data
• Performance
• Massively Parallel Processing
• Columnar Storage
• Data Compression
• Zone Maps
• Direct-attached Storage
• Cost model
• No upfront costs or long term commitments
• Free backup storage equivalent to 100% of
provisioned storage
With columnar storage, you
only read the data you need
7. Amazon Redshift
• Scalability & Elasticity
• Resize or scale - Number or type of nodes
can be changed with a few clicks
• Durability and Availability
• Replication
• Backup
• Automated recovery from failed drives &
nodes
• Interfaces
• JDBC/ODBC interface with BI/ETL tools
• Amazon S3 or DynamoDB
• Anti-patterns
• Small datasets
• OLTP
• Unstructured Data
• Blob Data
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
9. Amazon Kinesis Streams
• Ideal Usage Patterns – Streaming
data ingestion and processing
• Real-time data analytics
• Data feed intake and processing e.g. logs
• Real-time metrics and reporting
• Performance
• Throughput capacity in terms of shards
• Cost model
• No upfront costs or long term
commitments
• Pay as you go pricing
• Hourly charge per shard
• Charge for 1 million PUT transactions
10. Amazon Kinesis Streams
• Scalability & Elasticity
• Scale – increase number of shards
• Durability and Availability
• Replication
• Cursor preservation
• Interfaces
• Input – data coming in
• Output – data going out
• Kinesis Firehose
• Anti-patterns
• Small scale consistent throughput
• Long term data storage and analytics
11. Launch a cluster in minutes
Pay by the hour and save with spot
MapReduce, Apache Spark, Presto
Amazon EMR
12. Amazon EMR
• Ideal Usage Patterns
• Log processing and analytics
• Large ETL and data movement
• Risk modeling and threat analytics
• Ad targeting and click stream analytics
• Genomics
• Predictive analytics
• Ad-hoc data mining and analytics
• Performance – driven by
• Type of instance
• Number of instances
• Cost model
• Only pay for hours the cluster is up
• EC2 instance and EMR price
13. Amazon EMR
• Scalability & Elasticity
• Resize a running cluster
• Add more core or task nodes
• Durability and Availability
• Fault tolerant for slave node (HDFS)
• Backup to S3 for resilience against master
node failures
• Interfaces
• Hive, Pig, Spark, Hbase, Impala, Hunk,
Presto, other popular tools
• Anti-patterns
• Small data sets
• ACID (Atomicity, Consistency, Isolation and
Durability)
Amazon EMR Cluster
Amazon EMR Cluster
Amazon EMR Cluster
14. Fully managed NoSQL database
Single-Digit Millisecond latency at scale
Supports document and key-value
Amazon
DynamoDB
15. Amazon DynamoDB
• Ideal Usage Patterns
• Mobile apps, gaming, digital ad serving, live
voting, sensor networks, log ingestion
• Access control for web-based content, e-
commerce shopping carts
• Web session management
• Performance
• SSD
• Provision throughput by table
• Scalability & Elasticity
• No limit to the amount of data stored
• Dial-up or dial-down the read and write capacity
of a table
• Cost Model
• Pay as you go
• Provisioned throughput capacity (per hour)
• Indexed data storage (per GB per month)
• Data transfer in or out (per GB per month)
Provisioned read/write performance per table.
Predictable high performance scaled via console or API
16. Amazon DynamoDB
• Durability and Availability
• Three Availability Zones (AZ)
• Interfaces
• AWS Management Console
• API’s
• SDK’s
• Anti-patterns
• Application tied to traditional relational
database
• Joins and or complex transactions
• BLOB data
• Large data with low I/O rate
AZ-A
AZ-B
AZ-C
17. Managed service designed to make it easy
for developers of all levels to use machine
learning
Based on the same ML technology used for
years by Amazon’s internal data scientists
Amazon Machine Learning uses scalable
and robust implementations of industry-
standard ML algorithms
Amazon
Machine Learning
18. Amazon Machine Learning
• Ideal Usage Patterns
• Enable applications that flag suspicious
transactions
• Personalize application content
• Predict user activity
• Listen to social media
• Cost Model
• Pay for what you use
• No need to manage instances, only pay for
the service
• Performance
• Real-time predictions designed to return
within 100ms
• 200 transactions can be handled per second
by default (can be raised)
19. Amazon Machine Learning
• Durability and Availability
• No maintenance windows or scheduled
downtimes
• Designed across multiple availability
zones
• Scalability & Elasticity
• Model training up to 100GB
• Multiple ML jobs can run simultaneously
• Interfaces
• Create a data source from S3, RDS and
Redshift
• Interact with ML via console, SDKs, and
the ML API
• Anti-patterns
• Massive Data Sets for modeling >
100GB
• Sequence prediction or unsupervised
clustering task
20. Event driven, fully managed
compute
No Infrastructure to Manage
Automatic Scaling
AWS Lambda
21. AWS Lambda
• Ideal Usage Patterns
• Real-time file processing
• Extract, Transform, Load
• Performance
• Process events within
milliseconds
• Cost Model
• Pay for what you use
• No need to manage instances,
only pay for the service
• Lambda free tier includes 1M free
requests
1 2 3
Serverless Event-Driven Scale Subsecond Billing
22. AWS Lambda
• Durability and Availability
• No maintenance windows or
scheduled downtime
• Async functions are retried 3 times if
there is a failure
• Scalability & Elasticity
• Any number of concurrent functions that
can be run
• AWS Lambda will dynamically allocate
capacity to match the rate of incoming
events.
• Interfaces
• Lambda supports Java, Node.js, and
Python
• Trigger via event or schedule
• Anti-patterns
• Long running applications
• Stateful applications in Lambda
23. Setup Elasticsearch cluster in minutes
Integrated with Logstash and Kibana
Scale Elasticsearch cluster seamlessly
Amazon
Elasticsearch
Service
24. Amazon Elasticsearch
• Ideal Usage Patterns
• Analyze logs
• Analyze data stream updates from other AWS
services
• Provide customers a rich search and navigation
experience
• Usage monitoring for mobile applications
• Performance
• Depends on multiple factors including instance
type, workload, index, number of shards used, read
replicas
• Storage configurations –instance storage or EBS
storage
• Cost Model
• Pay as you go
• Only pay for compute and storage
25. Amazon Elasticsearch
• Durability and Availability
• Zone Awareness
• Automatic and Manual snapshots
• Scalability & Elasticity
• Add or remove instances
• Modify EBS volumes for data growth
• Interfaces
• AWS Management Console
• API’s
• SDK’s
• Kibana and Logstash (ELK Stack)
• Anti-patterns
• OLTP
• Workloads needing larger than 5TB of
storage requirements
Elasticsearch + Logstash + Kibana =
real-time analytics & visualization
26. Build visualizations
Perform ad-hoc analysis
Share and collaborate via storyboards
Native access on major mobile platforms
Amazon
QuickSight
27. Introducing Amazon QuickSight
Cloud-Powered Business Intelligence Service For
1/10th the Cost of Traditional BI Software
No IT effort. No dimensional modeling
Auto-discovery of all AWS data sources
Super-fast, Parallel, In-memory Calculation Engine
(SPICE)
Fully Managed
Available in Preview
aws.amazon.com/quicksight
28. Scale up or down as needed
Pay for what you use
Multiple options
Do-it-yourself big data applications
Amazon EC2
29. The AWS Approach
• Flexible Use the best tool for the job
• Data structure, latency, throughput, access patterns
• Scalable Immutable (append-only)
• Batch/speed/serving layer
• Minimum Admin Overhead Leverage AWS managed services
• No or very low admin
• Low Cost Big data ≠ big cost
30. Scenario 1: Enterprise Data Warehouse
Scenario 2: Capturing and Analyzing Sensor Data
Scenario 3: Sentiment Analysis of Social Media
Big Data
Scenarios
31. Scenario 1: Enterprise Data Warehouse
Data Warehouse Architecture
Data
Sources
Amazon
S3
Amazon
EMR
Amazon
S3
Amazon
Redshift
Amazon
QuickSight
33. Scenario 3: Sentiment Analysis of Social Media
Social
Media Data
Amazon
EC2
Amazon
Lambda
Amazon
ML
Amazon
Kinesis
Amazon
S3
Amazon
SNS
1 2 4 5 6
3 7
34. Next Steps
• Subscribe to the AWS Big Data Blog
blogs.aws.amazon.com/bigdata
• Learn more, check the tutorials, guides, and self-paced labs
aws.amazon.com/big-data
• Register for the next Big Data Webinar
Building Smart Applications with Amazon Machine Learning
aws.amazon.com/about-aws/events/monthlywebinarseries
Thu, Jan 28 2016 | 9AM PST
Editor's Notes
Follow Up Email
Amazon
https://www.youtube.com/watch?v=P4KPPvEb_QI
Generates weblogs @ 2TB/day, growing 67% YoY
Oracle RAC legacy system
Scan rate: 1 week of data/hour
Hit RAC node limit of 32 nodes
More data => Slower queries
Migrated to Redshift
Scan rate: 15 months of data (2.25 trillion rows) in 14 min
Scaled to a 101 node DS1.8XL cluster – Petabytes
More than 10X performance
21B rows joined with 10B rows in under 2 hours from days
security, HasOffers loads 60M rows per day in 2 min intervals, Desk: high concurrency user facing portal (read/write cluster), Amazon.com/NTT PB scale. Pinterest saw 50-100x speed ups when moved 300TB from Hadoop to Redshift. Nokia saw 50% reduction in costs.
https://www.youtube.com/watch?v=O4wAH5FQjS8
30 Million Ad opportunities per month.
Yelp uses Amazon S3 to store daily logs and photos, generating around 1.2TB of logs per day. The company also uses Amazon EMR to power approximately 20 separate batch scripts, most of those processing the logs. Features powered by Amazon Elastic MapReduce include:
Yelp developers advise others working with AWS to use the boto API as well as mrjob to ensure full utilization of Amazon Elastic MapReduce job flows. Yelp runs approximately 250 Amazon Elastic MapReduce jobs per day, processing 30TB of data and is grateful for AWS Support that helped with their Hadoop application development.
Dropcam - Dropcam runs video streaming and storage servers on Amazon EC2 and Amazon S3, and uses Amazon DynamoDB to scale and maintain throughput. “DynamoDB grows with the number of cameras that are connected to the service,” says Nelson. “Throughput is very steady as cameras come online. By using DynamoDB, we reduced delivery time for video events to less than 50 milliseconds,” says Nelson.
Dropcam - Dropcam runs video streaming and storage servers on Amazon EC2 and Amazon S3, and uses Amazon DynamoDB to scale and maintain throughput. “DynamoDB grows with the number of cameras that are connected to the service,” says Nelson. “Throughput is very steady as cameras come online. By using DynamoDB, we reduced delivery time for video events to less than 50 milliseconds,” says Nelson.
Build Fax - Uses Amazon Machine Learning to provide roof-age and job-cost estimations for insurers and builders, with property-specific values that don’t need to rely on broad, ZIP code-level estimates. Models that previously took six months or longer to create are now complete in four weeks or less. Creates opportunities for new data analytics services that BuildFax can offer to customers, such as text analysis in Amazon ML to estimate job costs with 80 percent accuracy.
VidRoll - AWS Lambda enables NoOps, allowing us to start and stay at scale without having to worry about infrastructure. As an exponential organization, it is critical that our developers focus on innovation. Lambda frees us from ever having to code for issues like concurrency, distributed file systems and other ‘success problems’ that typically present themselves when systems need to scale. We save time and money with Lambda.
Amazon Elasticsearch service allows you to easily and securely deploy and scale an ELK stack in minutes. Integration with Logstash is tightly coupled and a Kibana instance is automatically configured for you. The service automatically detects and replaces failed Elasticsearch nodes, reducing the overhead associated with self-managed infrastructure and Elasticsearch software.
https://aws.amazon.com/solutions/case-studies/major-league-baseball-mlbam/
Major League Baseball Advanced Media, L.P, which operates MLB.com, uses Elasticsearch extensively on its advanced game day statistics application. “Elasticsearch allows us to easily and quickly build bleeding edge big data and analytics applications using the ELK stack.” said Sean Curtis, Architect at MLB.com. “By offering direct access to the Elasticsearch API while offloading administrative tasks, Amazon Swift gives us the manageability, flexibility and control we need.”
Before we go into solving the Big architecture, I want to introduce some “tried and test” architecture principles.
Here at AWS we believe you should be using the right tool for the job – “instead of using a big swiss army knife for using a screw dreive, it will be best to use a screw drive - this is especially important for big data architectures. We’ll talk about this more.
Decoupled architecture http://whatis.techtarget.com/definition/decoupled-architecture - In general, a decoupled architecture is a framework for complex work that allows components to remain completely autonomous and unaware of each other…this has been tried and battle test.
Managed services – this is relatively now - Should I install Cassandra or MongoDB or CouchDB on AWS. You obviously can. Sometimes there are good reasons for doing this. Many customers still do this. Netflix is a great example. They run a multi-region Cassandra and are a poster child for how to do this. But for most customers, delegating this task to AWS makes more sense….you are better of spending your time on building features for your customers rather than building highly scalable distributed systems.
Lambda Architecture -