This document discusses how big data and machine learning can be combined using Amazon Web Services (AWS). It covers common big data challenges around which tools to use, what data is available, and how to get started. It then demonstrates how to populate and query a data catalog on AWS to understand available data. Finally, it shows how machine learning can be driven by big data to generate better insights and products using agile AWS services.
10. Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight Amazon AI EMR Redshift
Athena Kinesis RDS
Central Storage
Secure, cost-effective
Storage in Amazon S3
S3
Snowball Database Migration
Service
Kinesis Firehose Direct Connect
Data Ingestion
Get your data into S3
Quickly and securely
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
Processing & Analytics
Use of predictive and prescriptive
analytics to gain better understanding
Security Token
Service
CloudWatch CloudTrail Key Management
Service
Data Lake Components
Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight
Central Storage
Secure, cost-effective
Storage in Amazon S3
Metadata User Access
Security/Governance
Data Movement Analytics and Machine Learning
11. Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight Amazon AI EMR Redshift
Athena Kinesis RDS
Central Storage
Secure, cost-effective
Storage in Amazon S3
S3
Snowball Database Migration
Service
Kinesis Firehose Direct Connect
Data Ingestion
Get your data into S3
Quickly and securely
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
Processing & Analytics
Use of predictive and prescriptive
analytics to gain better understanding
Security Token
Service
CloudWatch CloudTrail Key Management
Service
Data Lake Components
Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight A
Central Storage
Secure, cost-effective
Storage in Amazon S3
Glue ETL
35. Machine Learning requires new tools and interfaces
Machine Learning/Deep Learning
Business
Reporting
Data Scientists
Data Engineer
IDE
Data
Catalog
Central
Storage
Sagemaker
We are at a big data and BI Summit, so I think most folks are familiar with Big Data, and some form of Vs. 3Vs, 5Vs, 7Vs – which represent some definition of a big data system.
You don’t have to take my word for it… reports on the growth of data are readily available most everywhere you look.
Top-Left – growth of unstructured data is vastly outpacing structured data
Top-Right – the amount of data will grow 50x between 2010 and 2020
Bottom-Left – We already have PB/day customers. We’re trending towards EB and ZB data sets
Bottom-Right – Data from sensors/connected-devices and social media are now described in multiples of the global population
Organizations that successfully generate business value from their data, will outperform their peers. An Aberdeen survey saw organizations who implemented a Data Lake outperforming similar companies by 9% in organic revenue growth. These leaders were able to do new types of analytics like machine learning over new sources like log files, data from click-streams, social media, and internet connected devices stored in the Data Lake. This helped them to identify, and act upon opportunities for business growth faster by attracting and retaining customers, boosting productivity, proactively maintaining devices, and making informed decisions.
Lock, Michael (Aberdeen), Angling for Insight In Today’s Data Lake (Oct 2017), pg 7.
Amazon S3 provides object storage built to store and retrieve any amount of data.
S3 has unmatched durability, and availability, built from the ground up to deliver a customer promise of 99.999999999% of durability at Exabyte scale. Only S3 automatically replicates your data in three availability zones within a single region, giving you unmatched resilience to single data center issues like power failures. Only S3 lets you do cross region replication seamlessly without having to use a separate storage class. Finally, only S3 allows you to do cross region replication where you choose any number of specified regions to replicate to.
Amazon S3 has the best security, compliance, and audit capabilities of any storage service. It can automatically encrypt your data, and gives you three choices for key management through S3 Key Management, customer-provided keys, and with AWS Key Management Service (KMS). Only S3 gives you encryption when replicating data across regions, and lets you use separate accounts for the source and destination regions, protecting against malicious insider deletion of data. Only S3 has integration to an AI-powered security service to monitor, detect, and alert anomalies that might indicate early stages of an attack with Amazon Macie. To meet compliance regulations, you can log, and audit all account activity including how, when, and who is accessing objects in S3 through AWS CloudTrail. These features allow AWS to support security standards and compliance certifications for virtually every regulatory agency around the globe.
Amazon S3 is the only storage service that lets you operate at the object level, rather than the bucket level. This allows you to set fine-grain access controls, and security policies to restrict access to specific objects, and create lifecycle policies to automatically delete or tier groups of objects into lower-cost storage.
Amazon S3 is the only storage system that has the ability to retrieve only the subsets of data within an object that is needed with S3 Select, speeding up queries up to 400 percent, resulting in faster queries at lower costs.
AWS provides the most ways to bring data into your data lake than anywhere else. These include importing real-time, streaming data with Amazon Kinesis, establishing a dedicated network connection between your premises and AWS with AWS Direct Connect, using secure appliances to transfer large amounts of data with AWS Snowball, using a ruggedized shipping container to transfer data at Exabyte-scale with AWS Snowmobile, and migrating your databases with AWS Database Migration Service.
The Amazon S3 ecosystem has twice as many partner integrations than anyone else, with tens of thousands of consulting, systems integrator and independent software vendor partners. This means that it is easier to use S3 as primary storage, backup, archive, and disaster recovery with applications that you already own like from NetApp, EMC, Vertias, and others.
Once you have started to build your data lake, AWS provides the broadest, and diverse set of options to analyze, and extract value from your data whether it be for analytics, machine learning, or IoT use cases. You are given the tools and frameworks of your choice, with the broadest set of purpose-built services available that all run directly on the data lake, without the need to move data into a separate analytics system.
Many customers spend time and effort in analysis to find the perfect tool for their needs.
At the rate the ecosystem is evolving, that tool might no longer be the best if you’ve spent so much time in research.
Now that I know what data I have…
In the old world, you knew your schema, you got a BI tool, and you asked it questions based on the structure.
You knew exactly which questions you wanted to ask, which drove a very predictable collection and storage model
When think about data in the context of the 3 V’s, you need different tooling… and you’re going to want ask questions of data that isn’t structured.
In the new world of data analysis your questions are going to evolve and change over time.
You need to be able to collect, store and analyze data without being constrained by resources, whether compute, storage, or even the tool being used.
You want a purpose-built tool to derive the type of analysis – the type of insight – that you’re looking for.
With the rise of Big Data, the ecosystem is quite active, and the tools are rapidly changing…
You need the ability to evolve with the tools and your own needs.
Many customers spend time and effort in analysis to find the perfect tool for their needs.
At the rate the ecosystem is evolving, that tool might no longer be the best if you’ve spent so much time in research.
Our recommendation: Find a tool that meets the need, then iterate in the tooling as you learn more about your actual needs.
In order to do that, you need to have a good metadata management process, portable data formats, and easy access to the data.
Otherwise, your data is in jail.
Many customers tell us about the pain they experience when their data is locked behind a vendor-specific format for vendor-controlled interface.
Problem #1 – Many organizations don’t know what they have.
When you accumulate such a diversity of data, you need mechanisms to understand what data you have, where it is located, and what format.
This is metadata management. And if not managed properly (or at all), the data is essentially lost. It is taking up space, but you have no means to put it to use.
A common issue, regardless of whether it is on-prem or in the cloud, is the lack of a metadata management approach from the onset.
The Financial Industry Regulatory Authority (FINRA) oversees more than 3,900 securities firms with approximately 640,000 brokers.
FINRA processes approximately 6 terabytes of data and 37 billion records on an average day to build a complete, holistic picture of market trading in the U.S. On busy days, the stock markets can generate 75 billion+ records.
The way they’re able to make all this data useful, whether to data scientists or business users or others, is through a metadata system they developed and open sourced, called HERD.
This is the same platform that is used by LinkedIn, for example.
But most organizations don’t actually go off and built their own tooling.
Ivy Tech is a community college - 60,000 online and in-person course sections, 8,300 on staff, 170,000 students, and130 locations.
Ivy Tech uses metadata capabilities provided by AWS to manage their information.
These are the main components of Glue.
Glue comprises of a data catalog which is a central metadata repository, an ETL engine that can auto-generate Python code, and a flexible scheduler that handles dependency resolution, job monitoring and retries. Together, these automate much of the heavy lifting involved with discovering, categorizing, cleaning, enriching, and moving data, so you can spend more time analyzing your data.
Glue automatically discovers your data, determines the schema, and builds your data catalog. The Glue data catalog provides out-of-box integration with Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.
The ETL code Glue generates is just Python code that is entirely customizable, reusable, and portable. You can edit this code using your favorite IDE or notebook and share it with others using GitHub.
And finally, Glue is serverless. There are no resources to manage and you only pay for the resources your jobs consume while they run.
Glue includes a feature called the AWS Glue crawlers. These crawlers allow you to discover your metadata for the catalog automatically. These can operate over obth your relational databases and data warehouses, but also your data lakes on S3. when crawling sources such as S3, it will first identify the format of the data, for example, is the dataset CSV, JSON, Parquet, Avro, etc, and then it will determine fields and type of each field within the data.
It really does a great job, but you can also go in and modify the outputs. It can also identify both hive compliant as well as non-compliant partitioning of data.
<21-28 to be screenshot heavy>
OK – I’m jazzed… I know the pitfalls. Now… What do I do?
Netflix data pipeline
~500 billion events and ~1.3 PB per day
~8 million events and ~24 GB per second during peak demand
There are several hundred event streams flowing through the pipeline. For example:
Video viewing activities
UI activities
Error logs
Performance events
Troubleshooting & diagnostic events
We see netflix started with a batch analytics. Collecting their data using apache Chukwa and saving it in an S3 backed data lake.
After they built this, they needed to start doing real-time analytics on the data. They easily pushed the new version out branching off and creating a Kafka based backend.
To improve the reliability, and scale, they shifted from the Chukwa front-end pushing to Kafka to having Kafka publishing and routing specific messages to the consumer kafka topics.
They shifted and built the log analytics on Kinesis Data Streams.
built on Amazon Kinesis enables us to identify ways to increase efficiency, reduce costs, and improve resiliency for the best customer experience
Zillow Group increases machine-learning calculation performance and
scalability and delivers near-real-time home-valuation data to customers
using AWS. The company houses a portfolio of the largest online real-estate
and home-related brands. Zillow Group runs the Zestimate, its machine
learning–based home-valuation tool, on Amazon Kinesis and Apache Spark
on Amazon EMR.
Zillow uses Kinesis Streams to collect public record data and MLS listings, and then update home value estimates in near real-time so home buyers and sellers can get the most up to date home value estimates. Zillow also sends the same data to its S3 data lake using Kinesis Firehose, so that all the applications can work with the most recent information. Using structured data, unstructured data like image, etc.
DigitalGlobe went all in on AWS to meet the growing demand for commercial geo-intelligence, migrating its entire 17-year imagery archive to the cloud. DigitalGlobe is one of the world’s leading providers of high-resolution earth imagery, data, and analysis. The company used AWS Snowmobile to move 100 petabytes of data to the cloud, allowing it to move away from large file-transfer protocols and delivery workflows. DigitalGlobe also uses Amazon SageMaker to handle machine learning at scale. Dr. Walter Scott, CTO and founder at DigitalGlobe, spoke at re:Invent 2017.
Cache rate improved by more than a factor of 2. Went up to 83% sometimes trending to 90%
Stripe uses Athena
Amazon.com uses DyanmoDB and a suite of other servlerless services in Herd.
rocessing delays decrease from 1 second to 100 milliseconds;
Herd controls the business logic for processing all Amazon.com customer orders worldwide, orchestrating more than 1,300 workflows for everything from order processing to fulfillment-center operations to coordinating parts of the Amazon Alexa backend. A mission-critical system used by more than 300 Amazon engineering teams, Herd executes more than 4 billion workflows on peak days.
Requests from Alexa, the Amazon.com sites, and the Amazon fulfillment centers totaled 3.34 trillion, peaking at 12.9 million per second. According to the team, the extreme scale, consistent performance, and high availability of DynamoDB let them meet needs of Prime Day without breaking a sweat.
DynamoDB is used by Lyft to store GPS locations for all their rides, Tinder to store millions of user profiles and make billions of matches, Redfin to scale to millions of users and manage data for hundreds of millions of properties, Comcast to power their XFINITY X1 video service running on more than 20 million devices, BMW to run its car-as-a-sensor service that can scale up and down by two orders of magnitude within 24 hours, Nordstrom for their recommendations engine reducing processing time from 20 minutes to a few seconds, Under Armour to support its connected fitness community of 200 million users, Toyota Racing to make real time decisions on pit-stops, tire changes, and race strategy, and another 100,000+ AWS customers for a wide variety of high-scale, high-performance use cases.
You simply put your Data in S3 and submit SQL against it
Why is AI/ML often talked about side by side with Data conferences.
Data really fuels AI/ML. AI/ML is all about finding patterns in the data and using that patterns to make predictions, recognitions images,create speech and provide other intelligent capabilities.
This in turn creates a flywheel effect where these new intelligent capabilities in tern increases user base and customer usage which create more data that allows organizations to better under their users drive analytics and new intelligent systems.
“By using Amazon SageMaker, DigitalGlobe cache rate improved by more than a factor of two, often times being around 83% and sometimes trending to 90% cache hit. This allowed them to also cut their cloud storage cost in half by better utilizing their S3 Optimized cache and retrieving less from their 100+ PB Archive.”
Purpose: showcase the power of ML to identify data utility
The blue dots represent what humans decided to cache (almost the whole world) and the orange dots represent what our customers requested access to over a three month period. We were missing the mark by a long shot.
http://blog.digitalglobe.com/industry/using-machine-learning-to-save-money-on-cloud-data-storage/
Digital Globe: 2 different use cases –
As the world’s leading provider of high-resolution Earth imagery, data and analysis, DigitalGlobe works with enormous amounts of data every day.
Use Case 1:
As more and more imagery is collected from their growing constellation of satellites it is critical for DG to predict and cache only the most relevant imagery at any given point in time, allowing them to take advantage of AWS’ tiered storage products to optimize their costs. They are relying on machine learning as the business grows. By analyzing 17 years of changing access patterns to this imagery data, they can predict how long to keep the data readily available in Amazon S3 before moving it to cold archive in Amazon Glacier, by example. With Amazon SageMaker’s machine learning algorithms, they can identify and predict exactly what imagery is going to be used and requested in real-time, to drive down the cost of managing petabytes of data at-scale – and the engineers that are using SageMaker to do this knew nothing about Machine Learning when they started!
Use Case 2:
DigitalGlobe is making it easier for people to find, access, and run compute against their entire data archive in the cloud in order to apply deep learning to satellite imagery. They plan to use Amazon SageMaker to train models against petabytes of earth observation imagery datasets using hosted Jupyter notebooks, so DigitalGlobe's Geospatial Big Data Platform (GBDX) users can just push a button, create a model, and deploy it all within one scalable distributed environment at scale.