SlideShare a Scribd company logo
1 of 21
September 16, 2015
Jason Huang
Senior Solutions Architect, Qubole Inc.
A little bit about Qubole
Ashish Thusoo
Founder & CEO
Joydeep Sen Sarma
Founder & CTO
Founded in 2011 by the pioneers of “big data” @
Facebook and the creator’s of the Apache Hive Project
Based in Mountain View, CA with offices in Bangalore,
India. Investments by Charles River, LightSpeed, Norwest
Ventures.
World class product and engineering team from:
Company Founding
Qubole founders built the Facebook data platform.
The Facebook model changed the role for data
in an enterprise.
• Needed to turn the data assets into a “utility” to make a viable
business.
– Collaborative: over 30% of employees use the
data directly.
– Accessible: developers, analysts, business analysts or
business users all running queries. Has made the
company more data driven and agile with data
use.
– Scalable: Exabyte's of data moving fast
It took the founders a team of over 30 people to create
this infrastructure and currently the team managing this
infrastructure has more than 100 people.
Work at Facebook inspired the founding of Qubole
Operations
Analyst
Marketing Ops
Analyst
Data
Architect
Business
Users
Product
Support
Customer
Support
Developer
Sales Ops
Product
Managers
Data
Infrastructure
Qubole works in:
• Adtech
• Media & Entertainment
• Healthcare
• Retail
• eCommerce
• Manufacturing
Qubole works best when:
• Born in Cloud
• Commitment to Public Cloud
• Data Driven
• Large scale data
• Lack Hadoop Skills
• Analysts & scientist need access
Impediments for an Aspiring Data Driven Enterprise
Where Big
Data falls
short:
• 6-18 month implementation time
• Only 27% of Big Data initiatives are
classified as “Successful” in 2014
Rigid and
inflexible
infrastructure
Non adaptive
software
services
Highly
specialized
systems
Difficult to
build and
operate
• Only 13% of organizations achieve full-scale production
• 57% of organizations cite skills gap as a major inhibitor
State of the Big Data Industry (n=417)
0%
10%
20%
30%
40%
50%
60%
70%
80%
Hadoop MapReduce Pig Spark Storm Presto Cassandra HBase Hive
• Hive translates SQL queries into multiple stages of MapReduce
– Allows for ad-hoc and batch data processing
– Provides fault-tolerance, intermediate results are written to disk,
automatic job retries in the event of failures (node, connectivity, etc.)
– Able to join tables with billions of rows
• Presto is an in-memory distributed SQL query engine
– Designed for interactive and near real-time SQL querying
– Multi-stage queries can run significantly faster than Hive
– Requires planning and optimizations when joining two large tables (data
must reside in memory)
Hive and Presto
Amazon Kinesis = a scalable and fully managed service for streaming large,
distributed data sets.
• Applications (mobile and wearable devices!) collect more and more data
– Kinesis is becoming the starting point for data ingestion into AWS
• Many solutions can consume Kinesis data streams for processing and
analyzing in various ways to influence business decisions, but none
provides near real-time querying of Kinesis using SQL.
– Qubole provides a Presto connector for Kinesis!
Presto with Kinesis
• Spark Streaming (Streaming Data)
• MLlib (Machine Learning)
• Spark SQL (Data Processing)
• GraphX (Graph Processing)
Spark Libraries
• Streaming Data
– Process streaming data with Spark built-in functions
– Applications such as fraud detection and log processing
– ETL via data ingestion
• Machine Learning
– Helps users run repeated queries and machine learning algorithms on
data sets
– MLlib can work in areas such as clustering, classification, and
dimensionality reduction
– Used for very common big data functions - predictive intelligence,
customer segmentation, and sentiment analysis
Apache Spark
• Interactive Analysis
– MapReduce was built to handle batch processing
– SQL-on-Hadoop engines such as Hive or Pig can be too slow for interactive
analysis
– Spark is fast enough to perform exploratory queries without sampling
– Provides multiple language-specific APIs including R, Python, Scala and Java.
• Fog Computing
– The Internet of Things - objects and devices with tiny embedded sensors that
communicate with each other and users, creating a fully interconnected world
– Decentralize data processing and storage and use Spark streaming analytics
and interactive real time queries
Apache Spark
Impediments for an Aspiring Data Driven Enterprise
What you need to work in the cloud:
Central
Governance &
Security
Internet
Scale
Instant
Deployment
Isolated
Multitenancy
Elastic
Object Store
Underpinnings
Qubole UI via
Browser
SDK
ODBC
User Access
Qubole’s
AWS Account
Customer’s AWS Account
REST API
(HTTPS)
SSH
Ephemeral Hadoop Clusters,
Managed by Qubole
Slave
Master
Data Flow within
Customer’s AWS
(optional)
Other RDS,
Redshift
Ephemeral
Web Tier
Web Servers
Encrypted
Result
Cache
Encrypted
HDFS
Slave
Encrypted
HDFS
RDS – Qubole
User, Account
Configurations
(Encrypted
credentials
Amazon S3
No HDFS Load
w/S3 Server Side
Encryption
Default Hive
Metastore
Encryption Options:
a)Qubole can encrypt the result cache
b)Qubole supports encryption of the ephemeral drives used for HDFS
c) Qubole supports S3 Server Side Encryption
(c)
(b)
(a)
(optional)
Custom Hive
Metastore
SSH
Ephemeral Clusters:
• Auto-Scaling - both up and down
• Spot Instances - data management and back-fill
• VMs deployed with awareness of time
Demo
Qubole Case Study
Qubole Case Study
• 1 out of 3 employees
leverages Big Data
• Stores 60PB+ of data
• Logs 20TB+ of new data
per day
• Processes 3PB+ per day
over 2,000+ jobs
Qubole Case Study
Qubole Case Study
Why Hive?
“Qubole has enabled more
users within Pinterest to
get to the data and has
made the data platform lot
more scalable and stable”
Mohammad Shahangian
Lead, Data Science
and Infrastructure
Hive
Metastore
Pig
Cascading
Hive
HDFS/S3
Hive’s metastore serves as the canonical source of truth for all Hadoop jobs
Metadata Data
Qubole Case Study
Qubole Case Study
Operations
Analyst
Marketing
Ops
Analyst
Data
Architect
Busines
s
Users
Product
Support
Customer
Support
Developer
Sales Ops
Product
Managers
Ease of use
for analysts
• Dozens of Data
Scientist and
Analyst users
• Produces double-
digit TBs of data
per day
• Does not have
dedicated staff
to setup and
manage clusters
and Hadoop
Distributions
0101
1010
1010
Qubole Case Study
Qubole Case Study
Producers Continuous Processing Storage Analytics
CDN
Real Time
Bidding
Retargeting
Platform
ETL
Kinesis S3 Redshift
Machine LearningStreaming
Customer Data
Why Spark?
0101
1010
1010
0101
1010
1010
0101
1010
1010
“Qubole put our cluster
management, auto-scaling
and ad-hoc queries on
autopilot. Its higher
performance for Big Data
queries translates directly
into faster and more
actionable marketing
intelligence for our
customers.”
Yekesa Kosuru
VP, Technology
Qubole Case Study
Qubole Case Study
• Designed for
scientists &
clinicians
• Leveraging
massive
datasets from
institutes,
public sources
and more…
• Cloud-based
product
delivered via
web
Qubole Case Study
Qubole Case Study
"Our customers have varying
needs: clinical researchers
might use GenePool to
examine genomic data from a
single patient, while a major
research institution might use
the platform to perform
analyses over 10,000 patients
at once”
Anish Kejariwal - Senior Director of
Engineering• Unified Metadata
• Auto-Scaling
• Spot Optimized
• Policy Keeper
• Cloud Tuned
• Cluster Lifecycle Management
Developer
Center
Analyst Workbench UI Policy, Governance &
Security Center
QDS Unified Control Panel
QDS Data Engines
Why Presto?

More Related Content

What's hot

A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data FactoryBizTalk360
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringAnant Corporation
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data AnalyticsAttunity
 
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europeQubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europeJoydeep Sen Sarma
 
Which data should you move to Hadoop?
Which data should you move to Hadoop?Which data should you move to Hadoop?
Which data should you move to Hadoop?Attunity
 
How Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon RedshiftHow Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon RedshiftAttunity
 
The Microsoft BigData Story
The Microsoft BigData StoryThe Microsoft BigData Story
The Microsoft BigData StoryLynn Langit
 
Big Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use casesBig Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use casesJeff Kelly
 
Optimize Data for the Logical Data Warehouse
Optimize Data for the Logical Data WarehouseOptimize Data for the Logical Data Warehouse
Optimize Data for the Logical Data WarehouseAttunity
 
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannAzure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannDatabricks
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azureDavid Giard
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkMatt Ingenthron
 
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Holden Ackerman
 
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Michael Rys
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...DataStax
 
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...✔ Eric David Benari, PMP
 

What's hot (20)

A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data Factory
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europeQubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
 
Which data should you move to Hadoop?
Which data should you move to Hadoop?Which data should you move to Hadoop?
Which data should you move to Hadoop?
 
How Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon RedshiftHow Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon Redshift
 
The Microsoft BigData Story
The Microsoft BigData StoryThe Microsoft BigData Story
The Microsoft BigData Story
 
Big Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use casesBig Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use cases
 
Optimize Data for the Logical Data Warehouse
Optimize Data for the Logical Data WarehouseOptimize Data for the Logical Data Warehouse
Optimize Data for the Logical Data Warehouse
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannAzure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
Google App Engine
Google App EngineGoogle App Engine
Google App Engine
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
 
Interactive query using hadoop
Interactive query using hadoopInteractive query using hadoop
Interactive query using hadoop
 
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
 
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
 

Viewers also liked

Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataQubole
 
5 Crucial Considerations for Big data adoption
5 Crucial Considerations for Big data adoption5 Crucial Considerations for Big data adoption
5 Crucial Considerations for Big data adoptionQubole
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Joydeep Sen Sarma
 
Nw qubole overview_033015
Nw qubole overview_033015Nw qubole overview_033015
Nw qubole overview_033015Michael Mersch
 
Unlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWSUnlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWSAmazon Web Services
 
Cortana Analytics Suite
Cortana Analytics SuiteCortana Analytics Suite
Cortana Analytics SuiteJames Serra
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsITProceed
 
BIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleBIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleQubole
 
Fortinet Automates Migration onto Layered Secure Workloads
Fortinet Automates Migration onto Layered Secure WorkloadsFortinet Automates Migration onto Layered Secure Workloads
Fortinet Automates Migration onto Layered Secure WorkloadsAmazon Web Services
 
Azure ARM’d and Ready
Azure ARM’d and ReadyAzure ARM’d and Ready
Azure ARM’d and Readymscug
 
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...NoSQLmatters
 
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...Amazon Web Services
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
 
DataXu: Programmatic Premium Webinar - June 7, 2012
DataXu: Programmatic Premium Webinar - June 7, 2012DataXu: Programmatic Premium Webinar - June 7, 2012
DataXu: Programmatic Premium Webinar - June 7, 2012dataxu
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream AnalyticsMarco Parenzan
 
15 Years of Web Security: The Rebellious Teenage Years
15 Years of Web Security: The Rebellious Teenage Years15 Years of Web Security: The Rebellious Teenage Years
15 Years of Web Security: The Rebellious Teenage YearsJeremiah Grossman
 

Viewers also liked (20)

Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big Data
 
5 Crucial Considerations for Big data adoption
5 Crucial Considerations for Big data adoption5 Crucial Considerations for Big data adoption
5 Crucial Considerations for Big data adoption
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
 
Nw qubole overview_033015
Nw qubole overview_033015Nw qubole overview_033015
Nw qubole overview_033015
 
Unlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWSUnlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWS
 
RDO-Packstack Workshop
RDO-Packstack Workshop RDO-Packstack Workshop
RDO-Packstack Workshop
 
Cortana Analytics Suite
Cortana Analytics SuiteCortana Analytics Suite
Cortana Analytics Suite
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico Jacobs
 
BIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleBIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - Qubole
 
Creating a fortigate vpn network & security blog
Creating a fortigate vpn   network & security blogCreating a fortigate vpn   network & security blog
Creating a fortigate vpn network & security blog
 
Fortinet Automates Migration onto Layered Secure Workloads
Fortinet Automates Migration onto Layered Secure WorkloadsFortinet Automates Migration onto Layered Secure Workloads
Fortinet Automates Migration onto Layered Secure Workloads
 
Azure ARM’d and Ready
Azure ARM’d and ReadyAzure ARM’d and Ready
Azure ARM’d and Ready
 
Azure Document Db
Azure Document DbAzure Document Db
Azure Document Db
 
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
 
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 
DataXu: Programmatic Premium Webinar - June 7, 2012
DataXu: Programmatic Premium Webinar - June 7, 2012DataXu: Programmatic Premium Webinar - June 7, 2012
DataXu: Programmatic Premium Webinar - June 7, 2012
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream Analytics
 
15 Years of Web Security: The Rebellious Teenage Years
15 Years of Web Security: The Rebellious Teenage Years15 Years of Web Security: The Rebellious Teenage Years
15 Years of Web Security: The Rebellious Teenage Years
 

Similar to Atlanta Data Science Meetup | Qubole slides

Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15MLconf
 
Power BI - 2016 - Public
Power BI - 2016 - PublicPower BI - 2016 - Public
Power BI - 2016 - PublicJulian Payne
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Pentaho
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopHortonworks
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopPOSSCON
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop siliconsudipt
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsCaserta
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overviewRohit Jain
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudDataWorks Summit
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseRizaldy Ignacio
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusersBob Hardaway
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopCaserta
 
Initiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AIInitiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AIAmazon Web Services
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchHortonworks
 

Similar to Atlanta Data Science Meetup | Qubole slides (20)

Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
 
Power BI - 2016 - Public
Power BI - 2016 - PublicPower BI - 2016 - Public
Power BI - 2016 - Public
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
 
Accelerating Data Warehouse Modernization
Accelerating Data Warehouse ModernizationAccelerating Data Warehouse Modernization
Accelerating Data Warehouse Modernization
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment Options
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overview
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
 
Initiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AIInitiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AI
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 

More from Qubole

7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome ThemQubole
 
State of Big Data Adoption
State of Big Data AdoptionState of Big Data Adoption
State of Big Data AdoptionQubole
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance Qubole
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on YarnQubole
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on CloudQubole
 
Qubole State of the Big Data Industry
Qubole State of the Big Data IndustryQubole State of the Big Data Industry
Qubole State of the Big Data IndustryQubole
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at PinterestQubole
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data TipsQubole
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposalQubole
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloudQubole
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveQubole
 
Effective Hive Queries
Effective Hive QueriesEffective Hive Queries
Effective Hive QueriesQubole
 

More from Qubole (12)

7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them
 
State of Big Data Adoption
State of Big Data AdoptionState of Big Data Adoption
State of Big Data Adoption
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on Yarn
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on Cloud
 
Qubole State of the Big Data Industry
Qubole State of the Big Data IndustryQubole State of the Big Data Industry
Qubole State of the Big Data Industry
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at Pinterest
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data Tips
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposal
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloud
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using Hive
 
Effective Hive Queries
Effective Hive QueriesEffective Hive Queries
Effective Hive Queries
 

Recently uploaded

Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 

Recently uploaded (20)

Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 

Atlanta Data Science Meetup | Qubole slides

  • 1. September 16, 2015 Jason Huang Senior Solutions Architect, Qubole Inc.
  • 2. A little bit about Qubole Ashish Thusoo Founder & CEO Joydeep Sen Sarma Founder & CTO Founded in 2011 by the pioneers of “big data” @ Facebook and the creator’s of the Apache Hive Project Based in Mountain View, CA with offices in Bangalore, India. Investments by Charles River, LightSpeed, Norwest Ventures. World class product and engineering team from:
  • 3. Company Founding Qubole founders built the Facebook data platform. The Facebook model changed the role for data in an enterprise. • Needed to turn the data assets into a “utility” to make a viable business. – Collaborative: over 30% of employees use the data directly. – Accessible: developers, analysts, business analysts or business users all running queries. Has made the company more data driven and agile with data use. – Scalable: Exabyte's of data moving fast It took the founders a team of over 30 people to create this infrastructure and currently the team managing this infrastructure has more than 100 people. Work at Facebook inspired the founding of Qubole Operations Analyst Marketing Ops Analyst Data Architect Business Users Product Support Customer Support Developer Sales Ops Product Managers Data Infrastructure
  • 4. Qubole works in: • Adtech • Media & Entertainment • Healthcare • Retail • eCommerce • Manufacturing Qubole works best when: • Born in Cloud • Commitment to Public Cloud • Data Driven • Large scale data • Lack Hadoop Skills • Analysts & scientist need access
  • 5. Impediments for an Aspiring Data Driven Enterprise Where Big Data falls short: • 6-18 month implementation time • Only 27% of Big Data initiatives are classified as “Successful” in 2014 Rigid and inflexible infrastructure Non adaptive software services Highly specialized systems Difficult to build and operate • Only 13% of organizations achieve full-scale production • 57% of organizations cite skills gap as a major inhibitor
  • 6. State of the Big Data Industry (n=417) 0% 10% 20% 30% 40% 50% 60% 70% 80% Hadoop MapReduce Pig Spark Storm Presto Cassandra HBase Hive
  • 7. • Hive translates SQL queries into multiple stages of MapReduce – Allows for ad-hoc and batch data processing – Provides fault-tolerance, intermediate results are written to disk, automatic job retries in the event of failures (node, connectivity, etc.) – Able to join tables with billions of rows • Presto is an in-memory distributed SQL query engine – Designed for interactive and near real-time SQL querying – Multi-stage queries can run significantly faster than Hive – Requires planning and optimizations when joining two large tables (data must reside in memory) Hive and Presto
  • 8. Amazon Kinesis = a scalable and fully managed service for streaming large, distributed data sets. • Applications (mobile and wearable devices!) collect more and more data – Kinesis is becoming the starting point for data ingestion into AWS • Many solutions can consume Kinesis data streams for processing and analyzing in various ways to influence business decisions, but none provides near real-time querying of Kinesis using SQL. – Qubole provides a Presto connector for Kinesis! Presto with Kinesis
  • 9. • Spark Streaming (Streaming Data) • MLlib (Machine Learning) • Spark SQL (Data Processing) • GraphX (Graph Processing) Spark Libraries
  • 10. • Streaming Data – Process streaming data with Spark built-in functions – Applications such as fraud detection and log processing – ETL via data ingestion • Machine Learning – Helps users run repeated queries and machine learning algorithms on data sets – MLlib can work in areas such as clustering, classification, and dimensionality reduction – Used for very common big data functions - predictive intelligence, customer segmentation, and sentiment analysis Apache Spark
  • 11. • Interactive Analysis – MapReduce was built to handle batch processing – SQL-on-Hadoop engines such as Hive or Pig can be too slow for interactive analysis – Spark is fast enough to perform exploratory queries without sampling – Provides multiple language-specific APIs including R, Python, Scala and Java. • Fog Computing – The Internet of Things - objects and devices with tiny embedded sensors that communicate with each other and users, creating a fully interconnected world – Decentralize data processing and storage and use Spark streaming analytics and interactive real time queries Apache Spark
  • 12.
  • 13. Impediments for an Aspiring Data Driven Enterprise What you need to work in the cloud: Central Governance & Security Internet Scale Instant Deployment Isolated Multitenancy Elastic Object Store Underpinnings
  • 14. Qubole UI via Browser SDK ODBC User Access Qubole’s AWS Account Customer’s AWS Account REST API (HTTPS) SSH Ephemeral Hadoop Clusters, Managed by Qubole Slave Master Data Flow within Customer’s AWS (optional) Other RDS, Redshift Ephemeral Web Tier Web Servers Encrypted Result Cache Encrypted HDFS Slave Encrypted HDFS RDS – Qubole User, Account Configurations (Encrypted credentials Amazon S3 No HDFS Load w/S3 Server Side Encryption Default Hive Metastore Encryption Options: a)Qubole can encrypt the result cache b)Qubole supports encryption of the ephemeral drives used for HDFS c) Qubole supports S3 Server Side Encryption (c) (b) (a) (optional) Custom Hive Metastore SSH Ephemeral Clusters: • Auto-Scaling - both up and down • Spot Instances - data management and back-fill • VMs deployed with awareness of time
  • 15. Demo
  • 16. Qubole Case Study Qubole Case Study • 1 out of 3 employees leverages Big Data • Stores 60PB+ of data • Logs 20TB+ of new data per day • Processes 3PB+ per day over 2,000+ jobs
  • 17. Qubole Case Study Qubole Case Study Why Hive? “Qubole has enabled more users within Pinterest to get to the data and has made the data platform lot more scalable and stable” Mohammad Shahangian Lead, Data Science and Infrastructure Hive Metastore Pig Cascading Hive HDFS/S3 Hive’s metastore serves as the canonical source of truth for all Hadoop jobs Metadata Data
  • 18. Qubole Case Study Qubole Case Study Operations Analyst Marketing Ops Analyst Data Architect Busines s Users Product Support Customer Support Developer Sales Ops Product Managers Ease of use for analysts • Dozens of Data Scientist and Analyst users • Produces double- digit TBs of data per day • Does not have dedicated staff to setup and manage clusters and Hadoop Distributions
  • 19. 0101 1010 1010 Qubole Case Study Qubole Case Study Producers Continuous Processing Storage Analytics CDN Real Time Bidding Retargeting Platform ETL Kinesis S3 Redshift Machine LearningStreaming Customer Data Why Spark? 0101 1010 1010 0101 1010 1010 0101 1010 1010 “Qubole put our cluster management, auto-scaling and ad-hoc queries on autopilot. Its higher performance for Big Data queries translates directly into faster and more actionable marketing intelligence for our customers.” Yekesa Kosuru VP, Technology
  • 20. Qubole Case Study Qubole Case Study • Designed for scientists & clinicians • Leveraging massive datasets from institutes, public sources and more… • Cloud-based product delivered via web
  • 21. Qubole Case Study Qubole Case Study "Our customers have varying needs: clinical researchers might use GenePool to examine genomic data from a single patient, while a major research institution might use the platform to perform analyses over 10,000 patients at once” Anish Kejariwal - Senior Director of Engineering• Unified Metadata • Auto-Scaling • Spot Optimized • Policy Keeper • Cloud Tuned • Cluster Lifecycle Management Developer Center Analyst Workbench UI Policy, Governance & Security Center QDS Unified Control Panel QDS Data Engines Why Presto?