Atlanta Data Science Meetup | Qubole slides

September 16, 2015
Jason Huang
Senior Solutions Architect, Qubole Inc.

A little bit about Qubole
Ashish Thusoo
Founder & CEO
Joydeep Sen Sarma
Founder & CTO
Founded in 2011 by the pioneers of “big data” @
Facebook and the creator’s of the Apache Hive Project
Based in Mountain View, CA with offices in Bangalore,
India. Investments by Charles River, LightSpeed, Norwest
Ventures.
World class product and engineering team from:

Company Founding
Qubole founders built the Facebook data platform.
The Facebook model changed the role for data
in an enterprise.
• Needed to turn the data assets into a “utility” to make a viable
business.
– Collaborative: over 30% of employees use the
data directly.
– Accessible: developers, analysts, business analysts or
business users all running queries. Has made the
company more data driven and agile with data
use.
– Scalable: Exabyte's of data moving fast
It took the founders a team of over 30 people to create
this infrastructure and currently the team managing this
infrastructure has more than 100 people.
Work at Facebook inspired the founding of Qubole
Operations
Analyst
Marketing Ops
Analyst
Data
Architect
Business
Users
Product
Support
Customer
Support
Developer
Sales Ops
Product
Managers
Data
Infrastructure

Qubole works in:
• Adtech
• Media & Entertainment
• Healthcare
• Retail
• eCommerce
• Manufacturing
Qubole works best when:
• Born in Cloud
• Commitment to Public Cloud
• Data Driven
• Large scale data
• Lack Hadoop Skills
• Analysts & scientist need access

Impediments for an Aspiring Data Driven Enterprise
Where Big
Data falls
short:
• 6-18 month implementation time
• Only 27% of Big Data initiatives are
classified as “Successful” in 2014
Rigid and
inflexible
infrastructure
Non adaptive
software
services
Highly
specialized
systems
Difficult to
build and
operate
• Only 13% of organizations achieve full-scale production
• 57% of organizations cite skills gap as a major inhibitor

State of the Big Data Industry (n=417)
0%
10%
20%
30%
40%
50%
60%
70%
80%
Hadoop MapReduce Pig Spark Storm Presto Cassandra HBase Hive

• Hive translates SQL queries into multiple stages of MapReduce
– Allows for ad-hoc and batch data processing
– Provides fault-tolerance, intermediate results are written to disk,
automatic job retries in the event of failures (node, connectivity, etc.)
– Able to join tables with billions of rows
• Presto is an in-memory distributed SQL query engine
– Designed for interactive and near real-time SQL querying
– Multi-stage queries can run significantly faster than Hive
– Requires planning and optimizations when joining two large tables (data
must reside in memory)
Hive and Presto

Amazon Kinesis = a scalable and fully managed service for streaming large,
distributed data sets.
• Applications (mobile and wearable devices!) collect more and more data
– Kinesis is becoming the starting point for data ingestion into AWS
• Many solutions can consume Kinesis data streams for processing and
analyzing in various ways to influence business decisions, but none
provides near real-time querying of Kinesis using SQL.
– Qubole provides a Presto connector for Kinesis!
Presto with Kinesis

• Spark Streaming (Streaming Data)
• MLlib (Machine Learning)
• Spark SQL (Data Processing)
• GraphX (Graph Processing)
Spark Libraries

• Streaming Data
– Process streaming data with Spark built-in functions
– Applications such as fraud detection and log processing
– ETL via data ingestion
• Machine Learning
– Helps users run repeated queries and machine learning algorithms on
data sets
– MLlib can work in areas such as clustering, classification, and
dimensionality reduction
– Used for very common big data functions - predictive intelligence,
customer segmentation, and sentiment analysis
Apache Spark

• Interactive Analysis
– MapReduce was built to handle batch processing
– SQL-on-Hadoop engines such as Hive or Pig can be too slow for interactive
analysis
– Spark is fast enough to perform exploratory queries without sampling
– Provides multiple language-specific APIs including R, Python, Scala and Java.
• Fog Computing
– The Internet of Things - objects and devices with tiny embedded sensors that
communicate with each other and users, creating a fully interconnected world
– Decentralize data processing and storage and use Spark streaming analytics
and interactive real time queries
Apache Spark

Impediments for an Aspiring Data Driven Enterprise
What you need to work in the cloud:
Central
Governance &
Security
Internet
Scale
Instant
Deployment
Isolated
Multitenancy
Elastic
Object Store
Underpinnings

Qubole UI via
Browser
SDK
ODBC
User Access
Qubole’s
AWS Account
Customer’s AWS Account
REST API
(HTTPS)
SSH
Ephemeral Hadoop Clusters,
Managed by Qubole
Slave
Master
Data Flow within
Customer’s AWS
(optional)
Other RDS,
Redshift
Ephemeral
Web Tier
Web Servers
Encrypted
Result
Cache
Encrypted
HDFS
Slave
Encrypted
HDFS
RDS – Qubole
User, Account
Configurations
(Encrypted
credentials
Amazon S3
No HDFS Load
w/S3 Server Side
Encryption
Default Hive
Metastore
Encryption Options:
a)Qubole can encrypt the result cache
b)Qubole supports encryption of the ephemeral drives used for HDFS
c) Qubole supports S3 Server Side Encryption
(c)
(b)
(a)
(optional)
Custom Hive
Metastore
SSH
Ephemeral Clusters:
• Auto-Scaling - both up and down
• Spot Instances - data management and back-fill
• VMs deployed with awareness of time

Qubole Case Study
Qubole Case Study
• 1 out of 3 employees
leverages Big Data
• Stores 60PB+ of data
• Logs 20TB+ of new data
per day
• Processes 3PB+ per day
over 2,000+ jobs

Qubole Case Study
Qubole Case Study
Why Hive?
“Qubole has enabled more
users within Pinterest to
get to the data and has
made the data platform lot
more scalable and stable”
Mohammad Shahangian
Lead, Data Science
and Infrastructure
Hive
Metastore
Pig
Cascading
Hive
HDFS/S3
Hive’s metastore serves as the canonical source of truth for all Hadoop jobs
Metadata Data

Qubole Case Study
Qubole Case Study
Operations
Analyst
Marketing
Ops
Analyst
Data
Architect
Busines
s
Users
Product
Support
Customer
Support
Developer
Sales Ops
Product
Managers
Ease of use
for analysts
• Dozens of Data
Scientist and
Analyst users
• Produces double-
digit TBs of data
per day
• Does not have
dedicated staff
to setup and
manage clusters
and Hadoop
Distributions

0101
1010
1010
Qubole Case Study
Qubole Case Study
Producers Continuous Processing Storage Analytics
CDN
Real Time
Bidding
Retargeting
Platform
ETL
Kinesis S3 Redshift
Machine LearningStreaming
Customer Data
Why Spark?
0101
1010
1010
0101
1010
1010
0101
1010
1010
“Qubole put our cluster
management, auto-scaling
and ad-hoc queries on
autopilot. Its higher
performance for Big Data
queries translates directly
into faster and more
actionable marketing
intelligence for our
customers.”
Yekesa Kosuru
VP, Technology

Qubole Case Study
Qubole Case Study
• Designed for
scientists &
clinicians
• Leveraging
massive
datasets from
institutes,
public sources
and more…
• Cloud-based
product
delivered via
web

Qubole Case Study
Qubole Case Study
"Our customers have varying
needs: clinical researchers
might use GenePool to
examine genomic data from a
single patient, while a major
research institution might use
the platform to perform
analyses over 10,000 patients
at once”
Anish Kejariwal - Senior Director of
Engineering• Unified Metadata
• Auto-Scaling
• Spot Optimized
• Policy Keeper
• Cloud Tuned
• Cluster Lifecycle Management
Developer
Center
Analyst Workbench UI Policy, Governance &
Security Center
QDS Unified Control Panel
QDS Data Engines
Why Presto?

Atlanta Data Science Meetup | Qubole slides

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Atlanta Data Science Meetup | Qubole slides

Similar to Atlanta Data Science Meetup | Qubole slides (20)

More from Qubole

More from Qubole (12)

Recently uploaded

Recently uploaded (20)

Atlanta Data Science Meetup | Qubole slides