Understand the Big Data ecosystem on the Cloud and the building blocks that help you build application for Data Mining and Visualization. Also learn from Latentview Analytics on how they built “PanelMiner” a Platform That Efficiently Transforms Unstructured HTML Data to Structured Data to gain Insights about consumer behavior from large data sets.
Presenter:
Ganesh Raja, Solution Architect, Amazon Internet Services
Ganesh Sankarlingam, Head of Delivery (US West Coast), LatentView Analytics
Shrirang Bapat, Vice President – Engineering, Pubmatic
2. Mining Information from Data on Cloud
Ganesh Raja, Solutions Architect
Amazon Internet Services
3. What is Big Data ?
When your data sets become
so large that you have to start innovating how to Collect,
Store, Organize, Analyze and Share it
Its tough because of
Velocity, Volume and Variety
9. Big data and AWS Cloud computing
Big Data AWS Cloud Computing
Variety, volume, and velocity
requiring new tools
Variety of compute, storage, and
networking options
Massive datasets Massive, virtually unlimited capacity
Iterative, experimental style of data
Iterative, experimental style of
manipulation and analysis
infrastructure deployment/usage
Frequently not steady-state workload;
peaks and valleys
At its most efficient with highly
variable workloads
10. Big Data Technology
Technologies and techniques for
working productively with data,
at any scale
12. Big Data & Analytics @ AWS
COLLECT STORE ANALYZE SHARE
Direct Connect S3
Import Export
S3 EC2
DynamoDB Redshift
Glacier
EMR
Data Pipeline
AWS BIG DATA
PORTFOLIO
Amazon Kinesis
21. Panel Miner: Data Mining and
Visualization using AWS
Ganesh Sankaralingam
22. LatentView at a Glance
Build Analytics
Centers of Excellence (COEs)
Analyze Business problems both Qualitatively &
Quantitatively and provide Actionable Insights
Onsite-offshore global delivery model that
helps in-house teams do more with less
Identified as “Cool Vendor” in
Analytics by Gartner 2014
Won the Deloitte Technology
Fast 50 India award for 5
consecutive years (2009 – 13)
‘Top Innovator’ awarded to
LatentView by Developer Week
(Conference & Festival 2013)
Recognized as a global 'Market
Leader‘ in the Analytics space
by SourcingLine
Top Finalist in the ‘We Love Our
Workplace 2013’ category.
23. Business Pain Points:
Required to combine different types of data to make Business decisions
Within the firewall Outside the firewall
Internal
Structured
Data
External
Structured
Data
Within the firewall Outside the firewall
External
Unstructured
Data
Internal
Unstructured
Data
o ERP, Legacy data
o RDBMS or excel
format
o Email text, Customer
service notes, Yammer
o Webserver logs
o Survey
o Market Research
o Macroeconomics
o Promotions
o Social media, News
articles, Panel data
o Real time visualization
of Machine logs (IOT)
24. Technical Pain Points:
Required to combine different types of data
Transform unstructured data into
structured data queried using SQL
statements
Automated scalable framework to
process > 500K small files in
constant time
Achieve high efficiency converting
unstructured data to structured
data
Control Security and access for
different business users
Minimize the costs and time
running distributed jobs
Store and Retrieve data for Analysis
purposes in the cost and time
efficient manner
Track various processes in the
AWS platform
25. Why AWS?
Cost of Ownership
Scalable, Easy to use
Easy to acquire additional machines based on needs
PetaByte level scalability (1 000 000 000 000 000 Bytes)
Data Security
High Availability
Technology Breadth and Technical support
26. PANEL MINER
Converting Unstructured to Structured Data using AWS Infrastructure
Unstructured
Data
Data Collection
EC2, S3
Download,
Extract, Clean
and Stage Data
for Processing
Python
Parser to Convert
Unstructured
Data into
Structured Data
EMR
Hadoop
Optimized Data
Processing
Redshift
Data
Warehousing,
and Reporting
Structured
Data
Analysis Using
excel, tableau
and other
visualization tools
30. Analytics in the Cloud
Leverage AWS to scale Big Data Analytics
Shrirang Bapat, VP Engineering, PubMatic
31. Shrirang Bapat
Data Enthusiast
Innovation Agent
Agile Evangelist
VP Engineering at PubMatic
Your Speaker Today
32. Every Ad
Every Screen
IAB Standard Banners
IAB Rising Stars
Native and custom units
Mobile Applications
Tablet Applications
Rich media: MRAID 1 &
2, ORRMA, interstitial
Video: VAST, VPAID
Mobile & Tablet
Optimized Web
Desktop Web
One Platform
Multi-Format, Multi-Screen, Multi-Channel
Every Sales
Channel
Direct Sales Integration
Programmatic Direct
• Private Marketplace
• Automated Guaranteed
Open Auction
Spot-buys
33. Premium at Scale, Across All Buying Channels
33
Programmatic
Direct
Channels Definition Value
Automated
Guaranteed
Direct bought
guaranteed inventory
access, non-RTB
Predictable and
scalable high value
placements
Open Market
RTB based inventory
access in open
marketplace
Efficient and targeted
audience buying
Private
Marketplace
Direct bought RTB
based inventory
access
Controlled buying with
price agreements for
bids
34. PubMatic is the Only
Publisher-Focused Software Platform at Scale
94.5% U.S. Reach, Larger Than Google
(comScore March 2014)
Industry’s Best Results, Independent & Flexible
5 Data Centers, 4 Trillion RTB Requests Monthly
500+ People Doing Business in 30 Countries
39. If you only take away 3 things…
Ease of Use
Reduced
Time to Market
DevOps
Editor's Notes
M2M, Sensor Data is the new King.
Stephen Forbes defined it that we live in SPIME world (Space and Time) and data will be collected and connected – which bringsthe Internet Of Thing concept
However, we don’t believe that there is one tool that can do everything, but rather if you use the right tools, you can build a highly configurable big data architcture to meet your specific needs.
This is how the AWS Big Data portfolio looks like. We have tools like Direct Connect and Import Export that can bring in a lot of data. We can push that data into a number of sources from S3 and DynamoDB to EMR and RedShift for analysis.
Amazon Redshift provides a fast, fully managed, petabyte-scale data warehouse for less than $1000 per terabyte per year. Amazon Elastic MapReduce provides a managed, easy to use analytics platform built around the powerful Hadoop framework. Recently we announced Amazon Kinesis, a managed service for real-time processing of streaming big data. Amazon Kinesis supports data throughput from megabytes to gigabytes of data per second and can scale seamlessly to handle streams from hundreds of thousands of different data sources.
The tools to support big data collection, computation along with collaboration and sharing are all available in a couple of clicks, with AWS.
Fundamental storage at internet scale, it can store any number of objects from 1 byte to 5 TB in size
It is engineered for 11 9’s of durability replicating your data at least three times in three distinct physical data centers we call availability zones
We have customers such as Dropbox, Spotify, Pinterest store billions of objects or files as photos, videos, songs, or any other type of file.
Amazon Kinesis is a fully managed service for real-time processing of streaming data at massive scale. Amazon Kinesis can collect and process hundreds of terabytes of data per hour from hundreds of thousands of sources.
For instance, instead of having to process log files in batch, you can have log events stream into Kinesis and then have workers with the Kinesis client library read from the stream and process the informaiton and drive a real time dashboard.
Later on today, we will have the product manager, Adi Krishnan, for Amazon Kinesis give a deep dive into the service
DynamoDB is a fast, fully managed NoSQL database service that makes it simple and cost-effective to store and retrieve any amount of data, and serve any level of request traffic. Its guaranteed throughput and single-digit millisecond latency make it a great fit for gaming, ad tech, mobile and many other applications.
Runs on solid state hard drives for high speed performance at scale and you can provision reads and writes to a table without having to worry about the admin of scaling or sharding, it is done all behind the scenes for you.
“We were able to start with only a few requests per minute and scale to over 40,000, all with just a few button presses on the way up” said Charles Ju, co-founder of PennyPop. “DynamoDB is easily the simplest and most scalable part of our application backend.”
“DynamoDB's biggest cost saving is not just the efficiencies and ease of use, but rather the opportunity cost of maintenance. Building, maintaining, and sharding large live data-centric real-time projects is incredibly hard and requires many people to both create and maintain these projects. DynamoDB has done an excellent job consolidating this within Amazon itself so the web community at large can focus on what it does best -- building great features and applications."
“We still only have 2 server engineers running a MMORPG where millions of players have come to enjoy the game…We are leaner than any other MMORPG at our size that I know of.”
Provision a petabyte scale cluster to handle complex SQL queries in just a few minutes.
You can get either a HDD drive based cluster or the recently introduced SSD based cluster that is smaller in total cluster size but higher performance per GB
This data warehouse solution is about a tenth of what traditional solutions cost of comparable size.
Redshift can drive business intelligence tools such as Jaspersoft or Microstrategy because it supports standard SQL and can connect using ODBC or JDBC drivers.
When you think of big data these days, Hadoop is always an integral part. When you take the benefits of what the cloud can do along with the computational paradigm of MapReduce, you get Elastic MapReduce. Customers have launched millions of clusters to run big data workloads. Amazon Elastic MapReduce
A key tool in the toolbox to help with ‘Big Data’ challenges Makes possible analytics processes previously not feasibleCost effective when leveraged with EC2 spot market
In summary, AWS provides you the tools so you can pick the right one at the scale that you need when you need it.