The Briefing Room with Dr. Robin Bloor and CASK
We all know the promise of big data, but who gets the value? There are plenty of success stories already, and most of them involve one key ingredient: facilitated access to important data sets. Most research studies suggest that the Pareto principle applies: 80 percent goes to data integration, and only 20 to analysis. Inverting that balance is the Holy Grail.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor explain why the time has finally come for turning the tables on the status quo in analytics. He'll be briefed by CASK CEO Jonathan Gray, who will showcase his company's big data integration platform, CDAP, which was specifically designed to expedite time-to-value for big data.
4. u Reveal the essential characteristics of enterprise
software, good and bad
u Provide a forum for detailed analysis of today s innovative
technologies
u Give vendors a chance to explain their product to savvy
analysts
u Allow audience members to pose serious questions... and
get answers!
Mission
5. Big Integration
u Old infrastructure
lacking
u New pipes are
needed
u Well begun is half
done!
7. CASK
u CASK offers a unified integration platform for big data
applications and data lakes
u Its CDAP architecture provides data containers,
program containers and application containers for data
and applications on Hadoop
u CASK also offers Hydrator for building and managing
data pipelines and data lakes, and Tracker for data
lake governance
8. Guest
Jonathan Gray
Jonathan Gray, Founder & CEO of Cask, is an entrepreneur
and software engineer with a background in startups, open
source and all things data. Prior to founding Cask, Jonathan
was a software engineer at Facebook where he drove HBase
engineering efforts, including Facebook Messages and
several other large-scale projects from inception to
production.
An open source evangelist, Jonathan was responsible for
helping build the Facebook engineering brand through
developer outreach and refocusing the open source strategy
of the company. Prior to Facebook, Jonathan founded
Streamy.com, where he became an early adopter of Hadoop
and HBase and is now a core contributor and active
committer in the community.
Jonathan holds a bachelor’s degree in Electrical and
Computer Engineering and Business Administration from
Carnegie Mellon University.
9. Big Data on Tap
cask.co November 1, 2016
The Briefing Room
Jonathan Gray
Founder & CEO
10. cask.co
Hadoop Enables New Applications and Architectures
2
ENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS
Batch and Realtime
Data Ingestion
Any type of data from any
type of source in any volume
Batch and Streaming ETL
Code-free self-service creation
and management of pipelines
SQL Exploration and
Data Science
All data is automatically
accessible via SQL and client SDKs
Data as a Service
Easily expose generic or
custom REST APIs on any data
360o
Customer View
Integrate data from any source
and expose through queries
and APIs
Realtime Dashboards
Perform realtime OLAP
aggregations and serve them
through REST APIs
Time Series Analysis
Store, process and serve massive
volumes of time-series data
Realtime Log Analytics
Ingestion and processing of
high-throughput streaming
log events
Recommendation Engines
Build models in batch using
historical data and serve them
in realtime
Anomaly Detection Systems
Process streaming events and
predictably compare them in
realtime to historical data
NRT Event Monitoring
Reliably monitor large streams of
data and perform defined actions
within a specified time
Internet of Things
Ingestion, storage and processing
of events that is highly-available,
scalable and consistent
ENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS
Batch and Realtime
Data Ingestion
Any type of data from any
type of source in any volume
Batch and Streaming ETL
Code-free self-service creation
and management of pipelines
SQL Exploration and
Data Science
All data is automatically
accessible via SQL and client SDKs
Data as a Service
Easily expose generic or
custom REST APIs on any data
360o
Customer View
Integrate data from any source
and expose through queries
and APIs
Realtime Dashboards
Perform realtime OLAP
aggregations and serve them
through REST APIs
Time Series Analysis
Store, process and serve massive
volumes of time-series data
Realtime Log Analytics
Ingestion and processing of
high-throughput streaming
log events
Recommendation Engines
Build models in batch using
historical data and serve them
in realtime
Anomaly Detection Systems
Process streaming events and
predictably compare them in
realtime to historical data
NRT Event Monitoring
Reliably monitor large streams of
data and perform defined actions
within a specified time
Internet of Things
Ingestion, storage and processing
of events that is highly-available,
scalable and consistent
ENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS
Batch and Realtime
Data Ingestion
Any type of data from any
type of source in any volume
Batch and Streaming ETL
Code-free self-service creation
and management of pipelines
SQL Exploration and
Data Science
All data is automatically
accessible via SQL and client SDKs
Data as a Service
Easily expose generic or
custom REST APIs on any data
360o
Customer View
Integrate data from any source
and expose through queries
and APIs
Realtime Dashboards
Perform realtime OLAP
aggregations and serve them
through REST APIs
Time Series Analysis
Store, process and serve massive
volumes of time-series data
Realtime Log Analytics
Ingestion and processing of
high-throughput streaming
log events
Recommendation Engines
Build models in batch using
historical data and serve them
in realtime
Anomaly Detection Systems
Process streaming events and
predictably compare them in
realtime to historical data
NRT Event Monitoring
Reliably monitor large streams of
data and perform defined actions
within a specified time
Internet of Things
Ingestion, storage and processing
of events that is highly-available,
scalable and consistent
Data Applications Drive Meaningful Business Value
11. cask.co3
But Getting Value from Big Data is Hard
Too much focus on infrastructure and integration, rather than applications and analytics
Divergence of distributions
and technologies
Integration silos created by
narrow point solutions
Proliferation of projects,
services and APIs
Complexity of technologies
and new user learning curve
12. cask.co4
Without a consistent set of tools, IT will not be an effective data enabler for the business
Developer
Architecture & Programming
Focused on Apps & Solutions
Ops
Configuring & Monitoring
Focused on Infrastructure & SLA’s
LOB / Product
Driving Revenue & Decision Making
Focused on Products & Insights
Data Scientist
Scripting & Machine Learning
Focused on Data & Algorithms
And There Are Many Faces of Hadoop
13. cask.co5
Enter Cask
AT&T, Cloudera and Ericsson
Strategic Investors
3.5 Cask Data Application Platform,
Cask Hydrator and Cask Tracker
Latest Release
AT&T, Ericsson, Lotame, Salesforce, Cloudera,
Hortonworks, MapR, Microsoft, IBM, Tableau…
Key Customers & Partners
By early Hadoop engineers from
Facebook and Yahoo!
Founded in 2011
Andreessen Horowitz, Safeguard,
Battery Venture and Ignition Partners
Raised $37+ Million
Featuring Cask Market,
the “big data app store”
NEW: CDAP 4 Preview
A Container Architecture that puts
Big Data on Tap
Why “Cask” ?
14. cask.co6
Convergence of Big Data Apps and Data Integration
The Evolution of the Cask Platform
Big Data Apps + Data Integration
• Data ingest
• Data pipelines
• Workflows and metadata
“WebLogic Meets Informatica”
CDAP
v3
Big Data App Server
• Abstractions & integrations
• Metrics & logs
• Debugging environment
“WebLogic for Hadoop”
CDAP
v2
Unified Integration for Big Data
• Security & governance
• Self-service environment
• Enterprise integrations
“Unified Big Data Integration”
CDAP
v4
15. cask.co
Introducing Cask Data Application Platform (CDAP)
7
First Unified Integration Platform for Big Data
Platform for distributed apps, bringing together
application management with data integration
• 100% open source and built for extensibility
• Supports all major Hadoop distributions and clouds
• Integrates the latest open source big data technologies
Data Lake
Fraud
Detection
Recommendation
Engine
Sensor Data
Analytics
Customer
360
Modern Data
Integration
Distributed
Application
Framework
Self-Service
User Experience
Enterprise-grade
Security &
Governance
16. cask.co8
• Real-time and Batch
• Reliable and Scalable
• Simple and Self-Service
Modern Data Integration
EXPLORE
for analytics and
data science
PROCESS
for ETL and
machine learning
SERVE
any data to any
destination
INGEST
any data from
any source
17. cask.co9
Distributed Application Framework
DEVELOP
rapidly build
applications
TEST
powerful test and
CI framework
DEPLOY
run any apps in
any environment
SCALE
horizontally scale
apps and data
• Real-time and Batch
• Memory, Local, Distributed
• Analytics and Applications
18. cask.co10
Security and Governance
CAPTURE
store all metadata
about your data
DISCOVER
easily locate any
of your data
TRACK
every audit plus
lineage graphs
ANALYZE
understand usage
patterns of data
AUTHENTICATE AUTHORIZEENCRYPT
19. cask.co11
A data discovery tool to explore metadata and usageA code-free framework to build and run data pipelines
Self-Service User Experience
Drag & drop
graphical
interface
Create,
debug,
deploy and
manage
Separation
of logic and
execution
environment
Native to
Hadoop &
Spark —
scales out
Rich app-
level
metadata
Track
lineage and
audits
Analyze
usage of
datasets
MDM
integration
framework
20. cask.co
The CDAP Architecture
12
Applications
Programs
MapReduce Spark
Tigon Workflow
Service Worker
Metadata
Datasets
Table Avro Parquet
Timeseries OLAP Cube
Geospatial ObjectStore
Metadata
Metadata
• Application Container Architecture
• Reusable Programming Abstractions
• Global User and Machine Metadata
• Highly Extensible Plugin Architecture
21. cask.co13
Single framework for building and running data apps and data lakes on Hadoop and Spark
Rapid
Development
• Standardization, deep
integrations, tools and docs
• Separation of app logic from
data logic and integration logic
• Conceptual integrity within
applications and consistency
across environments
Production
Operations &
Governance
• Simplified packaging, deployment
and monitoring of apps on Hadoop
• Enhanced security and governance
with centralized metrics and logs
• Tracking and exploration of
metadata, data provenance, audit
trails and usage analytics
CDAP Enables the Full Big Data Application Lifecycle
reduces time to develop and deploy big data apps by 80%
reduces time to insights and accelerates business value
removes barriers to innovation and future-proofs your apps
22. cask.co14
Customer Success Stories
Customer
Situation
Lack of existing Hadoop expertise
and frustration with hand-coding
and scripting tools
Cask Hydrator for rapid creation of
data pipelines and Cask Tracker for
data discovery
POC in 2 days
Production in 2 months
Cask
Solution
Small team and significant
technical challenges limit pace of
development and solution scale
CDAP for real-time ingestion and
consistent processing with
production operations support
Development in 1 month
Production in 3 months
Hundreds of Users
Thousands of Pipelines
Multiple teams and technologies
with widely varied skillsets and
incompatible design choices
CDAP for data lake management
and orchestration, tightly
integrated into existing systems
Health Insurance Provider
offloading clinical / immunization
reporting from Netezza
Leading SaaS Platform
taking new real-time, massive
scale products to market
Large Telco Enterprise
building a centralized, secured,
multi-tenant Data Lake
23. cask.co15
Cask was Named a
Gartner Cool
Vendor 2016
Cask was Certified a
Great Place to Work 2016
“ … for the rest of us who lack the technological chips or patience to
make it all work, there’s good news: it will soon get easier, thanks to the
work done by the big data pioneers, as well as vendors like Cask …”
(Alex Woodie, Managing Editor, Datanami)
Awards and Accolades
“ … “Cask has tilted the playing field, earning a massive unfair
advantage over proprietary point products for data integration and
ingest …”
(Nik Rouda, Senior Analyst, Enterprise Strategy Group)
“ … “CDAP is a big win for us … the amount of code we needed to
write was minimal with CDAP, and it was much easier and faster than
we ever expected …”
(Jia-Long Wu, Data Architect, Lotame)
24. cask.co16
NEW: CDAP 4 — Big Data Apps on Tap!
Available for download now!
Release of CDAP 4 Preview
“Big Data App Store”
Cask Market
Interactive Data Preparation
Cask Wrangler
Interactive Wizards for Common Tasks
Resource Center
Rewrite based on React
Reimagined CDAP UI
25. cask.co17
The “App Store for Big Data”
Cask Market
• Goal: Time to value in minutes w/ no existing experience
• Application and Library Ecosystem with pre-built Hadoop
solutions, reusable templates, and third-party plugins
• Available from anywhere inside the CDAP UI with a click
• Initially, everything in the Cask Market has been bootstrapped
by Cask based on ongoing work across our customers, is 100%
open source and available on GitHub
• Eventually, developers and ISVs will be able to showcase and
market their own applications and libraries (ex: Graylog)
Cask Market includes Interactive, Guided Wizards for Configuring Pre-Built Templates
NEW: CDAP 4 — Big Data Apps on Tap!
26. cask.co18
Building Data Pipelines on Hadoop with
Cask Hydrator
Data Lake Webinar
Introduction to Cask Hydrator
CDAP - Containers on Hadoop
CDAP Extensions - Cask Hydrator and
Cask Tracker
ESG Solution Spotlight
CDAP Technical Concepts (video)
Cask / Cloudera Solution Brief
Cask Resources
27. cask.co
● CDAP provides the first unified integration
platform for big data
● Cask Hydrator and Cask Tracker are visual
extensions of CDAP for self-service access
● CDAP empowers enterprise IT to deliver
faster time to value for Hadoop and Spark, from
prototype to production
● Cask Market is a “big data app store” available in
CDAP 4 with pre-built apps, pipelines, plugins
● CDAP is 100% open source, highly extensible,
enterprise-ready, and commercially supported
Big Data on Tap
Summary
31. Neither Hadoop Nor Spark Is a Solution
However, both are useful and
increasingly versatile components for
Big Data applications
32. The Evolution of the Little Elephant
u Hortonworks: Apache pure
play. No apparent vision.
u Cloudera: Some proprietary
components (Cloudera
Manager, Impala, Cloudera
Search). Vision is corporate
data hub(?)
u MapR: Also some proprietary
components (MapR-FS, MapR
Streams, MapR-DB)
u And then there’s the cloud.
33. The Ship of Fools
Until Hadoop’s direction is controlled by
a single “captain” we may have to
tolerate the ship of fools
34. The “Big Data Hype Cycle” Is Misleading
u Big Data is an ecosystem,
not a technology – which
distorts this graph
u Some analytics applications
have experienced “absurd
acceleration”
u Hadoop is, in many
instances, a laggard - Spark
too
u Nevertheless, we seem to
be exiting “the trough”
36. The System Management Issue
Mobile
Devices
DesktopsServers
IoT
The
Cloud
Archive
Data
Stores
Data
Assaying
Data
Capture
Real-Time
Streaming?
Data
Mgt
Data
Serving
The Prospecting Domain
Apps
Data
Life Cycle
Mgt
Staging
Area
(Hadoop?)
System
Management
37. The Fundamental Issue
Big Data does not really have a
foundation. Neither, imho, does the
Data Lake.
Luckily, there are third parties…
38. u Regarding Hadoop, do you have any “preferred
components?”
u How do you stay current with the various distros?
Backward compatibility? Can a customer upgrade
at will?
u How does your technology impact performance (if
at all)?
u Do you provide a consultancy service?
39. u Which companies/services do you regard as
competitive?
u Do you have any specific partners?
u What does an implementation look like?