Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS January 2016 Webinar Series - Getting Started with Big Data on AWS


Published on

With hundreds of new and sometimes disparate tools, it’s hard to keep pace. Amazon Web Services provides a broad and fully integrated portfolio of cloud computing services to help you build, secure and deploy your big data applications.

Attend this webinar to get an overview of the different big data options available in the AWS Cloud – including popular big data frameworks such as Hadoop, Spark, NoSQL databases, and more. Learn about ideal use cases, cases to avoid, performance, interfaces, and more. Finally, learn how you can build valuable applications with a real-life example.

Learning Objectives:
Learn about big data tools available at AWS
Understand ideal use cases
Learn some of the key considerations such as performance, scalability, elasticity and availability, when selecting big data tools
Who Should Attend:
Data Architects, Data Scientists, Developers

Published in: Technology
  • Login to see the comments

AWS January 2016 Webinar Series - Getting Started with Big Data on AWS

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Erik Swensson, Shree Kenghe & Erick Dame January 26, 2016 Getting Started with Big Data Analytic Options on AWS & Common Use Cases
  2. 2. Table of Contents • Big Data Introduction for AWS • Big Data Analytics Option on AWS • Usage Patterns & Anti-Patterns • Performance & Cost • Durability & Scalability • Interfaces • Building Big Data Analytic Solutions – The AWS Approach • Example Scenarios
  3. 3. Big Data on AWS Immediate Availability. Deploy instantly. No hardware to procure, no infrastructure to maintain & scale Trusted & Secure. Designed to meet the strictest requirements. Continuously audited, including certifications such as ISO 27001, FedRAMP, DoD CSM, and PCI DSS. Broad & Deep Capabilities. Over 50 services and 100s of features to support virtually any big data application & workload Hundreds of Partners & Solutions. Get help from a consulting partner or choose from hundreds of tools and applications across the entire data management stack.
  4. 4. Real-time Amazon Kinesis Firehose Object Storage Amazon S3 RDBMS Amazon RDS NoSQL DynamoDB Hadoop Ecosystem Amazon EMR Real-time AWS Lambda Amazon Kinesis Analytics Data Warehousing Amazon Redshift Machine Learning Amazon Machine Learning Business Intelligence & Data Visualization Amazon QuickSight Real-time Amazon Kinesis Streams Elastic Search Analytics Amazon ElasticSearch Collect Store Process & Analyze Visualize Data Import Amazon Import/Export Snowball IoT Amazon IoT Broad, Tightly Integrated Capabilities
  5. 5. Petabyte scale Massively parallel Relational data warehouse Fully managed, zero admin As low as $1,000/TB/Year a lot faster a lot cheaper a whole lot simpler Amazon Redshift
  6. 6. Amazon Redshift • Ideal Usage Patterns - Analyze • Sales data • Historical data • Gaming data • Social trends • Ad data • Performance • Massively Parallel Processing • Columnar Storage • Data Compression • Zone Maps • Direct-attached Storage • Cost model • No upfront costs or long term commitments • Free backup storage equivalent to 100% of provisioned storage With columnar storage, you only read the data you need
  7. 7. Amazon Redshift • Scalability & Elasticity • Resize or scale - Number or type of nodes can be changed with a few clicks • Durability and Availability • Replication • Backup • Automated recovery from failed drives & nodes • Interfaces • JDBC/ODBC interface with BI/ETL tools • Amazon S3 or DynamoDB • Anti-patterns • Small datasets • OLTP • Unstructured Data • Blob Data 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC
  8. 8. Ingest streaming data Process data in real-time Store terabytes of data per hour Amazon Kinesis
  9. 9. Amazon Kinesis Streams • Ideal Usage Patterns – Streaming data ingestion and processing • Real-time data analytics • Data feed intake and processing e.g. logs • Real-time metrics and reporting • Performance • Throughput capacity in terms of shards • Cost model • No upfront costs or long term commitments • Pay as you go pricing • Hourly charge per shard • Charge for 1 million PUT transactions
  10. 10. Amazon Kinesis Streams • Scalability & Elasticity • Scale – increase number of shards • Durability and Availability • Replication • Cursor preservation • Interfaces • Input – data coming in • Output – data going out • Kinesis Firehose • Anti-patterns • Small scale consistent throughput • Long term data storage and analytics
  11. 11. Launch a cluster in minutes Pay by the hour and save with spot MapReduce, Apache Spark, Presto Amazon EMR
  12. 12. Amazon EMR • Ideal Usage Patterns • Log processing and analytics • Large ETL and data movement • Risk modeling and threat analytics • Ad targeting and click stream analytics • Genomics • Predictive analytics • Ad-hoc data mining and analytics • Performance – driven by • Type of instance • Number of instances • Cost model • Only pay for hours the cluster is up • EC2 instance and EMR price
  13. 13. Amazon EMR • Scalability & Elasticity • Resize a running cluster • Add more core or task nodes • Durability and Availability • Fault tolerant for slave node (HDFS) • Backup to S3 for resilience against master node failures • Interfaces • Hive, Pig, Spark, Hbase, Impala, Hunk, Presto, other popular tools • Anti-patterns • Small data sets • ACID (Atomicity, Consistency, Isolation and Durability) Amazon EMR Cluster Amazon EMR Cluster Amazon EMR Cluster
  14. 14. Fully managed NoSQL database Single-Digit Millisecond latency at scale Supports document and key-value Amazon DynamoDB
  15. 15. Amazon DynamoDB • Ideal Usage Patterns • Mobile apps, gaming, digital ad serving, live voting, sensor networks, log ingestion • Access control for web-based content, e- commerce shopping carts • Web session management • Performance • SSD • Provision throughput by table • Scalability & Elasticity • No limit to the amount of data stored • Dial-up or dial-down the read and write capacity of a table • Cost Model • Pay as you go • Provisioned throughput capacity (per hour) • Indexed data storage (per GB per month) • Data transfer in or out (per GB per month)  Provisioned read/write performance per table.  Predictable high performance scaled via console or API
  16. 16. Amazon DynamoDB • Durability and Availability • Three Availability Zones (AZ) • Interfaces • AWS Management Console • API’s • SDK’s • Anti-patterns • Application tied to traditional relational database • Joins and or complex transactions • BLOB data • Large data with low I/O rate AZ-A AZ-B AZ-C
  17. 17. Managed service designed to make it easy for developers of all levels to use machine learning Based on the same ML technology used for years by Amazon’s internal data scientists Amazon Machine Learning uses scalable and robust implementations of industry- standard ML algorithms Amazon Machine Learning
  18. 18. Amazon Machine Learning • Ideal Usage Patterns • Enable applications that flag suspicious transactions • Personalize application content • Predict user activity • Listen to social media • Cost Model • Pay for what you use • No need to manage instances, only pay for the service • Performance • Real-time predictions designed to return within 100ms • 200 transactions can be handled per second by default (can be raised)
  19. 19. Amazon Machine Learning • Durability and Availability • No maintenance windows or scheduled downtimes • Designed across multiple availability zones • Scalability & Elasticity • Model training up to 100GB • Multiple ML jobs can run simultaneously • Interfaces • Create a data source from S3, RDS and Redshift • Interact with ML via console, SDKs, and the ML API • Anti-patterns • Massive Data Sets for modeling > 100GB • Sequence prediction or unsupervised clustering task
  20. 20. Event driven, fully managed compute No Infrastructure to Manage Automatic Scaling AWS Lambda
  21. 21. AWS Lambda • Ideal Usage Patterns • Real-time file processing • Extract, Transform, Load • Performance • Process events within milliseconds • Cost Model • Pay for what you use • No need to manage instances, only pay for the service • Lambda free tier includes 1M free requests 1 2 3 Serverless Event-Driven Scale Subsecond Billing
  22. 22. AWS Lambda • Durability and Availability • No maintenance windows or scheduled downtime • Async functions are retried 3 times if there is a failure • Scalability & Elasticity • Any number of concurrent functions that can be run • AWS Lambda will dynamically allocate capacity to match the rate of incoming events. • Interfaces • Lambda supports Java, Node.js, and Python • Trigger via event or schedule • Anti-patterns • Long running applications • Stateful applications in Lambda
  23. 23. Setup Elasticsearch cluster in minutes Integrated with Logstash and Kibana Scale Elasticsearch cluster seamlessly Amazon Elasticsearch Service
  24. 24. Amazon Elasticsearch • Ideal Usage Patterns • Analyze logs • Analyze data stream updates from other AWS services • Provide customers a rich search and navigation experience • Usage monitoring for mobile applications • Performance • Depends on multiple factors including instance type, workload, index, number of shards used, read replicas • Storage configurations –instance storage or EBS storage • Cost Model • Pay as you go • Only pay for compute and storage
  25. 25. Amazon Elasticsearch • Durability and Availability • Zone Awareness • Automatic and Manual snapshots • Scalability & Elasticity • Add or remove instances • Modify EBS volumes for data growth • Interfaces • AWS Management Console • API’s • SDK’s • Kibana and Logstash (ELK Stack) • Anti-patterns • OLTP • Workloads needing larger than 5TB of storage requirements Elasticsearch + Logstash + Kibana = real-time analytics & visualization
  26. 26. Build visualizations Perform ad-hoc analysis Share and collaborate via storyboards Native access on major mobile platforms Amazon QuickSight
  27. 27. Introducing Amazon QuickSight Cloud-Powered Business Intelligence Service For 1/10th the Cost of Traditional BI Software  No IT effort. No dimensional modeling  Auto-discovery of all AWS data sources  Super-fast, Parallel, In-memory Calculation Engine (SPICE)  Fully Managed Available in Preview
  28. 28. Scale up or down as needed Pay for what you use Multiple options Do-it-yourself big data applications Amazon EC2
  29. 29. The AWS Approach • Flexible Use the best tool for the job • Data structure, latency, throughput, access patterns • Scalable Immutable (append-only) • Batch/speed/serving layer • Minimum Admin Overhead Leverage AWS managed services • No or very low admin • Low Cost Big data ≠ big cost
  30. 30. Scenario 1: Enterprise Data Warehouse Scenario 2: Capturing and Analyzing Sensor Data Scenario 3: Sentiment Analysis of Social Media Big Data Scenarios
  31. 31. Scenario 1: Enterprise Data Warehouse Data Warehouse Architecture Data Sources Amazon S3 Amazon EMR Amazon S3 Amazon Redshift Amazon QuickSight
  32. 32. Scenario 2: Capturing and Analyzing Sensor Data Data Sources Amazon S3 Amazon Redshift Amazon QuickSight Amazon Kinesis Enabled App Amazon Kinesis Enabled App Amazon DynamoDB Reporting Dashboard Customer Access Amazon Kinesis 1 2 3 4 5 6 7 8 9
  33. 33. Scenario 3: Sentiment Analysis of Social Media Social Media Data Amazon EC2 Amazon Lambda Amazon ML Amazon Kinesis Amazon S3 Amazon SNS 1 2 4 5 6 3 7
  34. 34. Next Steps • Subscribe to the AWS Big Data Blog • Learn more, check the tutorials, guides, and self-paced labs • Register for the next Big Data Webinar Building Smart Applications with Amazon Machine Learning Thu, Jan 28 2016 | 9AM PST