SlideShare a Scribd company logo
1 of 33
Download to read offline
DREMIO
The Heterogeneous Data Lake
Tomer Shiran, Co-Founder & CEO at Dremio
tshiran@dremio.com | @tshiran
Hadoop Summit Europe 2016
April 13, 2016
DREMIO
Company Background
Jacques Nadeau
Founder & CTO
• Recognized SQL & NoSQL expert
• Apache Arrow & Drill PMC Chair
• Quigo (AOL); Offermatica (ADBE);
aQuantive (MSFT)
Tomer Shiran
Founder & CEO
• MapR (VP Product); Microsoft; IBM
Research
• Apache Drill Founder
• Carnegie Mellon, Technion
Julien Le Dem
Architect
• Apache Parquet Founder
• Apache Pig PMC Member
• Twitter (Lead, Analytics Data
Pipeline); Yahoo! (Architect)
Top Silicon Valley VCs• Stealth data analytics startup
• Founded in 2015
• Led by experts in Big Data and open source
DREMIO
The Rise of Heterogeneous Data Infrastructure
1980 2016
DREMIO
Can’t Simply Connect a BI Tool…
• Too slow for interactive
analysis
• Manual process to map
data to relational model
• NoSQL data often
inconsistent & unclean
(eg, mixed types)
X
DREMIO
Can’t Simply ETL the Data Into One System…
DWRDBMS RDBMS
RDBMS
RDBMS
RDBMSRDBMS
RDBMS RDBMS
• ETL between similar systems
• SQL -> SQL
• Flat -> flat
• Small & slowly evolving data
• Even then, ETL was hard!
DW
S3
HDFS
Solr S3
Oracle
Mongo
DB
SQL
Server
HBase
Elastic HDFS
• ETL between very different systems
• Search -> SQL
• Complex –> flat
• Big & rapidly evolving data
• ETL is now much harder…
The Relational World Today
DREMIO
DREMIO
Towards a Heterogeneous Data Lake…
• A platform that enables data analysis across disparate data sources
• Storage-agnostic
– The data can live anywhere
– Join across disparate data sources
– Leverage the strengths of each data source
• There’s a reason it was chosen to store that data…
• Client-agnostic
– Tableau, Qlik, Power BI, Excel, R, …
• Scalability & performance
– It’s the era of Big Data…
• Simple & complex analysis
DREMIO
Apache Arrow: Columnar In-Memory Execution
Arrow is backed by the lead developers of the major open source Big Data technologies
10-100x speedup
on modern CPUs
High-performance
sharing & interchange
High-speed Python
and R integration
Apache Arrow is the new standard for columnar in-memory execution technology
Data Sources:
Execution:
Data Science:
Parauet, HBase, Kudu, Phoenix, Hadoop, Cassandra
Drill, Spark, Impala, Storm
Pandas (Python), R, Ibis
DREMIO
Arrow Enables High Performance Interchange
Pre-Arrow With Arrow
• Each system has its own internal
memory format
• 70-80% CPU wasted on serialization
and deserialization
• Similar functionality implemented in
multiple projects
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg,
Parquet-to-Arrow reader)
DREMIO
Arrow is Designed for CPU Efficiency
Traditional
Memory Buffer
Arrow
Memory Buffer
• Cache locality
• Super-scalar & vectorized
operation
• Minimal structure overhead
• Constant value access
• Operate directly on
columnar compressed data
DREMIO
Apache Drill: A Storage-Agnostic Query Engine
Tableau, Excel, Qlik, … Custom Applications
MongoDB*
CLI
HBase Elasticsearch* MapR
HDFS NAS Local Files Amazon S3
* Currently being developed/enhanced
RDBMS*
Azure Blob Storage
Apache Drill
Query any data source as if it’s a
relational database
Join data from multiple data sources
in a single query
1 2
DREMIO
Omni-SQL (“SQL-on-Everything”)
Drill: Omni-SQL
Whereas the other engines we're discussing here create a relational
database environment on top of Hadoop, Drill instead enables a SQL
language interface to data in numerous formats, without requiring a formal
schema to be declared. This enables plug-and-play discovery over a huge
universe of data without prerequisites and preparation. So while Drill uses
SQL, and can connect to Hadoop, calling it SQL-on-Hadoop kind of misses
the point. A better name might be SQL-on-Everything, with very low setup
requirements.
“
”
DREMIO
ARCHITECTURE
DREMIO
Everything Starts With a Drillbit…
• High performance query executor
• In-memory columnar execution
• Directly interacts with data, acquiring
knowledge as it reads
• Built to leverage large amounts of memory
• Networked or not
• Exposes ODBC, JDBC, REST
• Built-in Web UI and CLI
• Extensible
Single process
(daemon or CLI)
drillbit
DREMIO
Data Lake, More Like Data Maelstrom
Clustered Services Desktops
HDFS HDFS HDFS
HBase HBase HBase
HDFS HDFS HDFS
ES ES ES
MongoDB MongoDB MongoDB
Cloud Services
DynamoDB
Amazon S3
Linux
Mac
Windows
MongoDB Cluster
Elasticsearch Cluster
Hadoop Cluster
HBase Cluster
DREMIO
Run Drill Co-Located with the Data, or Not
Clustered Services Desktops
HDFS HDFS HDFS
HBase HBase HBase
HDFS HDFS HDFS
ES ES ES
MongoDB MongoDB MongoDB
Cloud Services
DynamoDB
Amazon S3
Linux
Mac
Windows
drillbit drillbit drillbit
drillbit drillbit drillbit
drillbit drillbit drillbit
drillbit drillbit drillbit
drillbit drillbit
drillbit drillbit
drillbit drillbit
drillbit drillbit
drillbit
drillbit
drillbit
DREMIO
Extensible Datastore Architecture
Storage Plugin API
MongoDB
Plugin
File Plugin
Execution Engine
Format Plugin APIFileSystem API
HDFS
S3
MapR-FS
Parquet
JSON
CSV
HBase
Plugin
Hive
Plugin
Chapter 2: Connecting to Datastores
Kudu
Plugin
Phoenix
Plugin
DREMIO
QUERYING DATA
DREMIO
Referencing a Table
SELECT * FROM production.website.users;
Chapter 3: The Universal Namespace
Datastore Workspace Table
DREMIO
Run Your First Query
> SELECT name FROM mongo.yelp.business LIMIT 1;
+--------------------+
| name |
+--------------------+
| Eric Goldberg, MD |
+--------------------+
> SELECT name FROM dfs.root.`/opt/tutorial/yelp/business.json`
LIMIT 1;
+--------------------+
| name |
+--------------------+
| Eric Goldberg, MD |
+--------------------+
DREMIO
Namespaces & Tables
Storage Plugin Type Workspace Table
mongo Database Collection
hive Database Table
hbase Namespace Table
file (HDFS cluster, S3, …) Directory File or directory
… … …
User defines these in the
datastore configuration
DREMIO
> SELECT *
FROM dfs.root.`yelp/review.json` r,
mongo.yelp.business b
WHERE r.business_id = b.business_id;
Joining Across Datastores is Easy!
Alias to a specific file system (S3, HDFS, local, NAS)
Alias to a specific MongoDB cluster
DREMIO
> SELECT b.name AS name, COUNT(*) AS reviews
FROM dfs.yelp.`review.json` r,
mongo.yelp.business b
WHERE r.business_id = b.business_id
GROUP BY b.business_id, b.name
ORDER BY reviews DESC
LIMIT 3;
+-------------------+----------+
| name | reviews |
+-------------------+----------+
| Mon Ami Gabi | 3695 |
| Earl of Sandwich | 3263 |
| Wicked Spoon | 3011 |
+-------------------+----------+
What Business Has the Most Reviews on Yelp?
DREMIO
Native JSON Data Model
Access Arrays
SELECT categories[0]
{
"business_id": 123,
"name": "McDonalds",
"categories": ["restaurant", "fast food"],
"attributes": {
"family friendly": true,
"fast": true,
"romantic": false
}
}
Access Maps
WHERE t.attributes.romantic IS TRUE
Flatten Arrays
SELECT name, FLATTEN(categories)
Extract Keys
SELECT name, KVGEN(attributes)
Flatten Maps
SELECT name, FLATTEN(KVGEN(attributes))
Access Embedded JSON Blobs
SELECT d.address.state
FROM (SELECT CONVERT_FROM(t.data, JSON) d FROM t)
DREMIO
Accessing Array Elements
> SELECT categories FROM business LIMIT 2;
+-------------------------------------------+
| categories |
+-------------------------------------------+
| ["American (Traditional)","Restaurants"] |
| ["Chinese","Restaurants"] |
+-------------------------------------------+
> SELECT categories[0] FROM business LIMIT 2;
+-------------------------+
| EXPR$0 |
+-------------------------+
| American (Traditional) |
| Chinese |
+-------------------------+
DREMIO
FLATTEN
• FLATTEN converts single record with array field into multiple records
– One output record for each array element
• Non FLATTENed fields are repeated in each of the output records
> SELECT categories
FROM business LIMIT 2;
+-------------------------------------------+
| categories |
+-------------------------------------------+
| ["American (Traditional)","Restaurants"] |
| ["Chinese","Restaurants"] |
+-------------------------------------------+
> SELECT FLATTEN(categories)
FROM business LIMIT 4;
+-------------------------+
| EXPR$0 |
+-------------------------+
| American (Traditional) |
| Restaurants |
| Chinese |
| Restaurants |
+-------------------------+
DREMIO
Non-FLATTENed Fields are Repeated
> SELECT name, categories FROM business LIMIT 2;
+------------------------------+-------------------------------------------+
| name | categories |
+------------------------------+-------------------------------------------+
| Deforest Family Restaurant | ["American (Traditional)","Restaurants"] |
| Chang Jiang Chinese Kitchen | ["Chinese","Restaurants"] |
+------------------------------+-------------------------------------------+
> SELECT name, FLATTEN(categories) FROM business LIMIT 4;
+------------------------------+-------------------------+
| name | EXPR$1 |
+------------------------------+-------------------------+
| Deforest Family Restaurant | American (Traditional) |
| Deforest Family Restaurant | Restaurants |
| Chang Jiang Chinese Kitchen | Chinese |
| Chang Jiang Chinese Kitchen | Restaurants |
+------------------------------+-------------------------+
DREMIO
ODBC and JDBC
• Drill includes standard
ODBC/JDBC drivers
– ODBC for native apps
– JDBC for Java apps
• User installs the driver
on the client
– The same machine as
the BI tool
• Driver communicates
with Drill cluster(s)
• Make sure driver and
cluster are compatible
versions
Drill Cluster
Drill JDBC Driver
TIBCO Spotfire
Client
Drill ODBC Driver
Tableau
Client (eg, Laptop)
DREMIO
DEMO TIME!
DREMIO
Thank You
• Learn about Apache Arrow
• Jacques Nadeau’s blog post: www.dremio.com/blog/Apache-Arrow/
• Apache Arrow website: arrow.apache.org
• Download Apache Drill: drill.apache.org
• Reach out to learn more about the Dremio private beta
• Email me: tshiran@dremio.com
• Sign up on the site: www.dremio.com
DREMIO
APPENDIX
DREMIO
DREMIO
Questions
• User trends based on yelping_since (Mongo)
• Top business categories, with coloring by state
• Which businesses are gross? (Elastic<->Mongo)
• Which of those had the most website clicks?
– distinct(business_id) on elastic, mongo.business,
hdfs.default.click

More Related Content

What's hot

HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSDataWorks Summit
 
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseApache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseJosh Elser
 
Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...DataWorks Summit/Hadoop Summit
 
Cloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World ConsiderationsCloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World ConsiderationsDataWorks Summit/Hadoop Summit
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleDataWorks Summit/Hadoop Summit
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0DataWorks Summit
 
HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBaseCon
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonDataWorks Summit/Hadoop Summit
 

What's hot (20)

Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Apache Phoenix + Apache HBase
Apache Phoenix + Apache HBaseApache Phoenix + Apache HBase
Apache Phoenix + Apache HBase
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Curb your insecurity with HDP
Curb your insecurity with HDPCurb your insecurity with HDP
Curb your insecurity with HDP
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseApache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
 
Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Cloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World ConsiderationsCloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World Considerations
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
 
Apache phoenix
Apache phoenixApache phoenix
Apache phoenix
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
Empower Data-Driven Organizations
Empower Data-Driven OrganizationsEmpower Data-Driven Organizations
Empower Data-Driven Organizations
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0
 
HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and Spark
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region Replicas
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
 

Viewers also liked

Using a Data Lake at the core of a Life Assurance business
Using a Data Lake at the core of a Life Assurance businessUsing a Data Lake at the core of a Life Assurance business
Using a Data Lake at the core of a Life Assurance businessDataWorks Summit/Hadoop Summit
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran John Mulhall
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!DataWorks Summit/Hadoop Summit
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksDataWorks Summit/Hadoop Summit
 
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...DataWorks Summit/Hadoop Summit
 
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersHadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersDataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

Fast Distributed Online Classification
Fast Distributed Online Classification Fast Distributed Online Classification
Fast Distributed Online Classification
 
Using a Data Lake at the core of a Life Assurance business
Using a Data Lake at the core of a Life Assurance businessUsing a Data Lake at the core of a Life Assurance business
Using a Data Lake at the core of a Life Assurance business
 
Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big Data
 
Rocking the World of Big Data at Centrica
Rocking the World of Big Data at CentricaRocking the World of Big Data at Centrica
Rocking the World of Big Data at Centrica
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
 
Securing Spark Applications
Securing Spark ApplicationsSecuring Spark Applications
Securing Spark Applications
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Data Process Systems, connecting everything
Data Process Systems, connecting everythingData Process Systems, connecting everything
Data Process Systems, connecting everything
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!
 
Log I am your father
Log I am your fatherLog I am your father
Log I am your father
 
Cooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython NotebookCooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython Notebook
 
Powering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big DataPowering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big Data
 
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
 
A Continuously Deployed Hadoop Analytics Platform?
A Continuously Deployed Hadoop Analytics Platform?A Continuously Deployed Hadoop Analytics Platform?
A Continuously Deployed Hadoop Analytics Platform?
 
Hadoop Everywhere
Hadoop EverywhereHadoop Everywhere
Hadoop Everywhere
 
Practical advice to build a data driven company
Practical advice to build a data driven companyPractical advice to build a data driven company
Practical advice to build a data driven company
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
 
NLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-TextNLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-Text
 
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
 
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersHadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
 

Similar to The Heterogeneous Data lake

Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillMapR Technologies
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillDataWorks Summit
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Precisely
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Hortonworks
 
Experience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data PlatformExperience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data PlatformBob Ward
 
Securing your Big Data Environments in the Cloud
Securing your Big Data Environments in the CloudSecuring your Big Data Environments in the Cloud
Securing your Big Data Environments in the CloudDataWorks Summit
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Imply
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSStéphane Fréchette
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDBDenny Lee
 
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsWeb Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsKognitio
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
OrientDB for real & Web App development
OrientDB for real & Web App developmentOrientDB for real & Web App development
OrientDB for real & Web App developmentLuca Garulli
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Lace Lofranco
 
CCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialCCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialRoxycodone Online
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...Amazon Web Services
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataHostedbyConfluent
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 

Similar to The Heterogeneous Data lake (20)

Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014
 
Experience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data PlatformExperience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data Platform
 
Securing your Big Data Environments in the Cloud
Securing your Big Data Environments in the CloudSecuring your Big Data Environments in the Cloud
Securing your Big Data Environments in the Cloud
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsWeb Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
OrientDB for real & Web App development
OrientDB for real & Web App developmentOrientDB for real & Web App development
OrientDB for real & Web App development
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
 
CCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialCCD-410 Cloudera Study Material
CCD-410 Cloudera Study Material
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 

Recently uploaded

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 

Recently uploaded (20)

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 

The Heterogeneous Data lake

  • 1. DREMIO The Heterogeneous Data Lake Tomer Shiran, Co-Founder & CEO at Dremio tshiran@dremio.com | @tshiran Hadoop Summit Europe 2016 April 13, 2016
  • 2. DREMIO Company Background Jacques Nadeau Founder & CTO • Recognized SQL & NoSQL expert • Apache Arrow & Drill PMC Chair • Quigo (AOL); Offermatica (ADBE); aQuantive (MSFT) Tomer Shiran Founder & CEO • MapR (VP Product); Microsoft; IBM Research • Apache Drill Founder • Carnegie Mellon, Technion Julien Le Dem Architect • Apache Parquet Founder • Apache Pig PMC Member • Twitter (Lead, Analytics Data Pipeline); Yahoo! (Architect) Top Silicon Valley VCs• Stealth data analytics startup • Founded in 2015 • Led by experts in Big Data and open source
  • 3. DREMIO The Rise of Heterogeneous Data Infrastructure 1980 2016
  • 4. DREMIO Can’t Simply Connect a BI Tool… • Too slow for interactive analysis • Manual process to map data to relational model • NoSQL data often inconsistent & unclean (eg, mixed types) X
  • 5. DREMIO Can’t Simply ETL the Data Into One System… DWRDBMS RDBMS RDBMS RDBMS RDBMSRDBMS RDBMS RDBMS • ETL between similar systems • SQL -> SQL • Flat -> flat • Small & slowly evolving data • Even then, ETL was hard! DW S3 HDFS Solr S3 Oracle Mongo DB SQL Server HBase Elastic HDFS • ETL between very different systems • Search -> SQL • Complex –> flat • Big & rapidly evolving data • ETL is now much harder… The Relational World Today
  • 7. DREMIO Towards a Heterogeneous Data Lake… • A platform that enables data analysis across disparate data sources • Storage-agnostic – The data can live anywhere – Join across disparate data sources – Leverage the strengths of each data source • There’s a reason it was chosen to store that data… • Client-agnostic – Tableau, Qlik, Power BI, Excel, R, … • Scalability & performance – It’s the era of Big Data… • Simple & complex analysis
  • 8. DREMIO Apache Arrow: Columnar In-Memory Execution Arrow is backed by the lead developers of the major open source Big Data technologies 10-100x speedup on modern CPUs High-performance sharing & interchange High-speed Python and R integration Apache Arrow is the new standard for columnar in-memory execution technology Data Sources: Execution: Data Science: Parauet, HBase, Kudu, Phoenix, Hadoop, Cassandra Drill, Spark, Impala, Storm Pandas (Python), R, Ibis
  • 9. DREMIO Arrow Enables High Performance Interchange Pre-Arrow With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg, Parquet-to-Arrow reader)
  • 10. DREMIO Arrow is Designed for CPU Efficiency Traditional Memory Buffer Arrow Memory Buffer • Cache locality • Super-scalar & vectorized operation • Minimal structure overhead • Constant value access • Operate directly on columnar compressed data
  • 11. DREMIO Apache Drill: A Storage-Agnostic Query Engine Tableau, Excel, Qlik, … Custom Applications MongoDB* CLI HBase Elasticsearch* MapR HDFS NAS Local Files Amazon S3 * Currently being developed/enhanced RDBMS* Azure Blob Storage Apache Drill Query any data source as if it’s a relational database Join data from multiple data sources in a single query 1 2
  • 12. DREMIO Omni-SQL (“SQL-on-Everything”) Drill: Omni-SQL Whereas the other engines we're discussing here create a relational database environment on top of Hadoop, Drill instead enables a SQL language interface to data in numerous formats, without requiring a formal schema to be declared. This enables plug-and-play discovery over a huge universe of data without prerequisites and preparation. So while Drill uses SQL, and can connect to Hadoop, calling it SQL-on-Hadoop kind of misses the point. A better name might be SQL-on-Everything, with very low setup requirements. “ ”
  • 14. DREMIO Everything Starts With a Drillbit… • High performance query executor • In-memory columnar execution • Directly interacts with data, acquiring knowledge as it reads • Built to leverage large amounts of memory • Networked or not • Exposes ODBC, JDBC, REST • Built-in Web UI and CLI • Extensible Single process (daemon or CLI) drillbit
  • 15. DREMIO Data Lake, More Like Data Maelstrom Clustered Services Desktops HDFS HDFS HDFS HBase HBase HBase HDFS HDFS HDFS ES ES ES MongoDB MongoDB MongoDB Cloud Services DynamoDB Amazon S3 Linux Mac Windows MongoDB Cluster Elasticsearch Cluster Hadoop Cluster HBase Cluster
  • 16. DREMIO Run Drill Co-Located with the Data, or Not Clustered Services Desktops HDFS HDFS HDFS HBase HBase HBase HDFS HDFS HDFS ES ES ES MongoDB MongoDB MongoDB Cloud Services DynamoDB Amazon S3 Linux Mac Windows drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit
  • 17. DREMIO Extensible Datastore Architecture Storage Plugin API MongoDB Plugin File Plugin Execution Engine Format Plugin APIFileSystem API HDFS S3 MapR-FS Parquet JSON CSV HBase Plugin Hive Plugin Chapter 2: Connecting to Datastores Kudu Plugin Phoenix Plugin
  • 19. DREMIO Referencing a Table SELECT * FROM production.website.users; Chapter 3: The Universal Namespace Datastore Workspace Table
  • 20. DREMIO Run Your First Query > SELECT name FROM mongo.yelp.business LIMIT 1; +--------------------+ | name | +--------------------+ | Eric Goldberg, MD | +--------------------+ > SELECT name FROM dfs.root.`/opt/tutorial/yelp/business.json` LIMIT 1; +--------------------+ | name | +--------------------+ | Eric Goldberg, MD | +--------------------+
  • 21. DREMIO Namespaces & Tables Storage Plugin Type Workspace Table mongo Database Collection hive Database Table hbase Namespace Table file (HDFS cluster, S3, …) Directory File or directory … … … User defines these in the datastore configuration
  • 22. DREMIO > SELECT * FROM dfs.root.`yelp/review.json` r, mongo.yelp.business b WHERE r.business_id = b.business_id; Joining Across Datastores is Easy! Alias to a specific file system (S3, HDFS, local, NAS) Alias to a specific MongoDB cluster
  • 23. DREMIO > SELECT b.name AS name, COUNT(*) AS reviews FROM dfs.yelp.`review.json` r, mongo.yelp.business b WHERE r.business_id = b.business_id GROUP BY b.business_id, b.name ORDER BY reviews DESC LIMIT 3; +-------------------+----------+ | name | reviews | +-------------------+----------+ | Mon Ami Gabi | 3695 | | Earl of Sandwich | 3263 | | Wicked Spoon | 3011 | +-------------------+----------+ What Business Has the Most Reviews on Yelp?
  • 24. DREMIO Native JSON Data Model Access Arrays SELECT categories[0] { "business_id": 123, "name": "McDonalds", "categories": ["restaurant", "fast food"], "attributes": { "family friendly": true, "fast": true, "romantic": false } } Access Maps WHERE t.attributes.romantic IS TRUE Flatten Arrays SELECT name, FLATTEN(categories) Extract Keys SELECT name, KVGEN(attributes) Flatten Maps SELECT name, FLATTEN(KVGEN(attributes)) Access Embedded JSON Blobs SELECT d.address.state FROM (SELECT CONVERT_FROM(t.data, JSON) d FROM t)
  • 25. DREMIO Accessing Array Elements > SELECT categories FROM business LIMIT 2; +-------------------------------------------+ | categories | +-------------------------------------------+ | ["American (Traditional)","Restaurants"] | | ["Chinese","Restaurants"] | +-------------------------------------------+ > SELECT categories[0] FROM business LIMIT 2; +-------------------------+ | EXPR$0 | +-------------------------+ | American (Traditional) | | Chinese | +-------------------------+
  • 26. DREMIO FLATTEN • FLATTEN converts single record with array field into multiple records – One output record for each array element • Non FLATTENed fields are repeated in each of the output records > SELECT categories FROM business LIMIT 2; +-------------------------------------------+ | categories | +-------------------------------------------+ | ["American (Traditional)","Restaurants"] | | ["Chinese","Restaurants"] | +-------------------------------------------+ > SELECT FLATTEN(categories) FROM business LIMIT 4; +-------------------------+ | EXPR$0 | +-------------------------+ | American (Traditional) | | Restaurants | | Chinese | | Restaurants | +-------------------------+
  • 27. DREMIO Non-FLATTENed Fields are Repeated > SELECT name, categories FROM business LIMIT 2; +------------------------------+-------------------------------------------+ | name | categories | +------------------------------+-------------------------------------------+ | Deforest Family Restaurant | ["American (Traditional)","Restaurants"] | | Chang Jiang Chinese Kitchen | ["Chinese","Restaurants"] | +------------------------------+-------------------------------------------+ > SELECT name, FLATTEN(categories) FROM business LIMIT 4; +------------------------------+-------------------------+ | name | EXPR$1 | +------------------------------+-------------------------+ | Deforest Family Restaurant | American (Traditional) | | Deforest Family Restaurant | Restaurants | | Chang Jiang Chinese Kitchen | Chinese | | Chang Jiang Chinese Kitchen | Restaurants | +------------------------------+-------------------------+
  • 28. DREMIO ODBC and JDBC • Drill includes standard ODBC/JDBC drivers – ODBC for native apps – JDBC for Java apps • User installs the driver on the client – The same machine as the BI tool • Driver communicates with Drill cluster(s) • Make sure driver and cluster are compatible versions Drill Cluster Drill JDBC Driver TIBCO Spotfire Client Drill ODBC Driver Tableau Client (eg, Laptop)
  • 30. DREMIO Thank You • Learn about Apache Arrow • Jacques Nadeau’s blog post: www.dremio.com/blog/Apache-Arrow/ • Apache Arrow website: arrow.apache.org • Download Apache Drill: drill.apache.org • Reach out to learn more about the Dremio private beta • Email me: tshiran@dremio.com • Sign up on the site: www.dremio.com
  • 33. DREMIO Questions • User trends based on yelping_since (Mongo) • Top business categories, with coloring by state • Which businesses are gross? (Elastic<->Mongo) • Which of those had the most website clicks? – distinct(business_id) on elastic, mongo.business, hdfs.default.click