SlideShare a Scribd company logo
1 of 47
Analyzing Real-World Data with Apache Drill 
© 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 1
© 2014 MapR Technologies 2 
Data is doubling in 
size every two years
44 ZETTABYTES 
© 2014 MapR Technologies 3 
IDC estimates that in 2020, 
there will be 44 zettabytes 
of data in the world 
4.4 ZETTABYTES 
1.8 ZETTABYTES 
2011 2013 
2020 
Source: IDC Digital Universe
© 2014 MapR Technologies 4 
UNSTRUCTURED 
DATA 
Unstructured data will account 
for more than 80% of the data 
collected by organizations 
STRUCTURED DATA 
1980 1990 2000 2010 2020 
Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data 
Total Data Stored
NoSchema Datastores are Capturing this Data 
Volume MBs-GBs TBs-PBs 
RELATIONAL DATABASES “NOSCHEMA” DATASTORES 
Structure 
Development 
1980 1990 2000 2010 2020 
© 2014 MapR Technologies 5 
Fixed schema 
DBA controls structure 
Dynamic schema (schema-free) 
Application controls structure 
Database 
Structured Structured, semi-structured and unstructured 
Planned (release cycle = months-years) Iterative (release cycle = days-weeks)
WANT 2 DON’T WANT 
© 2014 MapR Technologies 6 
SQL in the Big Data World 
• SQL 
• BI (Tableau, MicroStrategy, etc.) 
• Low latency 
• Scalability 
• Create and maintain schemas on: 
– HDFS (Parquet, JSON, etc.) 
– HBase 
– MongoDB 
• Transform or copy data 
We want SQL and BI support without compromising the 
flexibility and agility of NoSchema datastores
• Schema-free scale-out query engine for Hadoop and NoSQL 
• Point-and-query vs. schema-first 
• Low latency 
• Extreme ease of use 
• Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs 
© 2014 MapR Technologies 7 
APACHE DRILL 
40+ contributors 
150+ years of experience building 
databases and distributed systems
Evolution Towards Self-Service Data Exploration 
© 2014 MapR Technologies 8 
Data Modeling and 
Transformation 
Data Visualization 
IT-driven 
IT-driven 
IT-driven 
Self-service 
IT-driven 
Self-service 
Not needed 
Self-service 
Traditional BI 
w/ RDBMS 
Self-Service BI 
w/ RDBMS 
SQL-on-Hadoop 
Self-Service 
Data Exploration 
Zero-day analytics
© 2014 MapR Technologies 9
RDBMS/SQL-on-Hadoop table 
Apache Drill table 
© 2014 MapR Technologies 10 
Drill’s Data Model is Flexible 
Fixed schema Schema-less 
HBase 
JSON 
BSON 
CSV 
TSV 
Parquet 
Avro 
Flat 
Complex 
Flexibility 
Flexibility 
Name Gender Age 
Michael M 6 
Jennifer F 3 
{ 
name: { 
first: Michael, 
last: Smith 
}, 
hobbies: [ski, soccer], 
district: Los Altos 
} 
{ 
name: { 
first: Jennifer, 
last: Gates 
}, 
hobbies: [sing], 
preschool: CCLC 
}
Drill Supports Schema Discovery On-The-Fly 
Schema Declared In Advance Schema2 Discovered On-The-Fly 
© 2014 MapR Technologies 11 
• Fixed schema 
• Leverage schema in centralized 
repository (Hive Metastore) 
• Fixed schema, evolving schema or 
schema-less 
• Leverage schema in centralized 
repository or self-describing data 
SCHEMA ON 
WRITE 
SCHEMA 
BEFORE READ 
SCHEMA ON THE 
FLY
SELECT po_document.AllowPartialShipment 
FROM j_purchaseorder; 
© 2014 MapR Technologies 12 
Native JSON 
SELECT json_value(po_document, 
'$.AllowPartialShipment’ RETURNING 
NUMBER) 
FROM j_purchaseorder; 
JSON query with Drill: 
JSON query with Oracle: 
Relational databases cannot provide true schema-free JSON support.
© 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 13 
Architecture
© 2014 MapR Technologies 14 
High Level Architecture 
• Cluster of commodity servers 
– Daemon (drillbit) on each node 
• No dependency on other execution engines (MapReduce, Spark, Tez) 
– Better performance and manageability 
• ZooKeeper maintains ephemeral cluster membership information 
– drillbit uses ZooKeeper to find other drillbits in the cluster 
– Client uses ZooKeeper to find drillbits 
• Data processing unit is columnar record batches 
– Enables schema flexibility with negligible performance impact
ZooKeeper 
ZooKeeper 
ZooKeeper 
© 2014 MapR Technologies 15 
Drill Maximizes Data Locality 
drillbit 
DataNode/Regi 
onServer/mong 
od 
drillbit 
DataNode/Regi 
onServer/mong 
od 
drillbit 
DataNode/Regi 
onServer/mong 
od 
… 
Data Source Best Practice 
HDFS or MapR-FS drillbit on each DataNode 
HBase or MapR-DB drillbit on each RegionServer 
MongoDB drillbit on each mongod node (when using replicas, run it on the replica node)
5. Return results 
to client 
© 2014 MapR Technologies 16 
SELECT* Query Execution 
drillbit 
ZooKeeper 
Client 
(JDBC, ODBC, 
REST) 
1. Find drillbits 
(once per session) 
2. Submit query to 
drillbit 
3. Create logical and physical execution plans 
4. Farm out execution of fragments to cluster 
(completely distributed execution) 
ZooKeeper 
ZooKeeper 
drillbit drillbit 
* CTAS (CREATE TABLE AS SELECT) queries include steps 1-4
DFS 
© 2014 MapR Technologies 17 
Core Modules within drillbit 
SQL Parser 
Hive 
HBase 
Distributed Cache 
Storage Plugins 
MongoDB 
Physical Plan 
Execution 
Logical Plan 
Optimizer 
RPC Endpoint
Example: Analyzing Real-World Data 
© 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 18
© 2014 MapR Technologies 19 
Demo Plan 
1. Run Drill 
2. Configure DFS and MongoDB storage plugins 
3. Explore the data 
– Basics 
– Complex data 
– Views
© 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 20 
Run Drill
You can now access the Web UI: 
http://localhost:8047 
© 2014 MapR Technologies 21 
Run Drill in Embedded Mode (sqlline) 
$ tar xf apache-drill-0.7.0.tar.gz 
$ cd apache-drill-0.7.0 
$ bin/sqlline -u jdbc:drill:zk=local 
> SELECT * 
FROM dfs.root.`/Users/tshiran/Development/demo/data/yelp/user.json` 
LIMIT 1; 
+---------------+------------+--------------+------------+------------+ 
| yelping_since | votes | review_count | name | user_id | 
+---------------+------------+--------------+------------+------------+ 
| 2012-02 | {"funny":1,"useful":5,"cool":0} | 6 | Lee | 
qtrmBGNqCvupHMHL_bKFgQ | 
• drillbit (Drill daemon) starts automatically in embedded mode 
• No ZooKeeper in embedded mode (hence zk=local) 
• Can’t use BI clients (JDBC/ODBC) in embedded mode
• Define the Drill cluster name and ZooKeeper nodes in conf/drill-override.conf 
• Start drillbit: 
$ bin/drillbit.sh start 
© 2014 MapR Technologies 22 
Or Run Drill in Distributed Mode… 
• Make sure ZooKeeper (zkServer) is running: 
$ zkServer start 
• Access the Web UI: http://localhost:8047 
• Connect a client to the cluster (eg, sqlline): 
$ bin/sqlline -u jdbc:drill:zk=localhost:2181 
• Clients (like sqlline) connect to ZooKeeper to discover the cluster nodes 
• If you have multiple Drill clusters registered in one ZooKeeper ensemble, specify the desired 
cluster in the JDBC connection string: 
jdbc:drill:zk=localhost:2181/drill/<clustername> 
• Not sure if ZooKeeper is running? Run telnet localhost 2181 and make sure it connects
© 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 23 
Configure Storage Plugins
© 2014 MapR Technologies 24 
Enable MongoDB Storage Plugin
Define Workspaces in the DFS Storage Plugin 
© 2014 MapR Technologies 25 
• d
© 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 26 
Explore the Data: Basics
© 2014 MapR Technologies 27 
Inventory: DFS Files 
{ 
"votes": {"funny": 0, "useful": 2, "cool": 1}, 
"user_id": "Xqd0DzHaiyRqVH3WRG7hzg", 
"review_id": "15SdjuK7DmYqUAj6rjGowg", 
"stars": 5, 
"date": "2007-05-17", 
"text": "dr. goldberg offers everything ...", 
"type": "review", 
"business_id": "vcNAWiLM4dR7D2nwwJ7nCA" 
}
© 2014 MapR Technologies 28 
Inventory: MongoDB Collections 
$ mongo 
MongoDB shell version: 2.6.5 
> show databases; 
admin (empty) 
local 0.078GB 
yelp 0.453GB 
> use yelp 
> db.users.findOne() 
{ 
"_id" : ObjectId("54566cdf3237149de181a92a"), 
"yelping_since" : "2012-02", 
"votes" : { 
"funny" : 1, 
"useful" : 5, 
"cool" : 0 
}, 
"review_count" : 6, 
"name" : "Lee", 
"user_id" : "qtrmBGNqCvupHMHL_bKFgQ", 
"friends" : [ ] 
}
© 2014 MapR Technologies 29 
Let’s Go! 
> SELECT * 
FROM 
dfs.root.`/Users/tshiran/Development/demo/data/y 
elp/review.json` 
WHERE stars = 1 
LIMIT 1; 
+------------+------------+------------+------------+------------+------------+------------+-------------+ 
| votes | user_id | review_id | stars | date | text | type | business_id | 
+------------+------------+------------+------------+------------+------------+------------+-------------+ 
| {"funny":0,"useful":0,"cool":0} | Qrs3EICADUKNFoUq2iHStA | _ePLBPrkrf4bhyiKWEn4Qg | 1 | 2013-04-19 
| I don't know what Dr. Goldberg was like before moving to Arizona, but let me tell you, STAY AWAY from this 
doctor and this office. | review | vcNAWiLM4dR7D2nwwJ7nCA | 
+------------+------------+------------+------------+------------+------------+------------+-------------+
© 2014 MapR Technologies 30 
Using Storage Plugins and Workspaces 
Storage plugin 
Workspace 
Path relative to workspace 
> SELECT * FROM 
dfs.root.`/Users/tshiran/Development/demo/data/yelp/review.json` 
LIMIT 1; 
> SELECT * FROM dfs.demo.`yelp/review.json` LIMIT 1; 
> SELECT * FROM mongo.yelp.users LIMIT 1; 
> USE mongo.yelp; 
> SELECT * FROM users LIMIT 1; 
Storage Plugin Workspace Table 
dfs Path Path relative to workspace 
mongo Database Collection 
hive Database Table 
hbase Namespace Table
© 2014 MapR Technologies 31 
Most Common User Names (MongoDB) 
> SELECT name, count(*) AS users 
FROM mongo.yelp.users 
GROUP BY name 
ORDER BY users DESC LIMIT 10; 
+------------+------------+ 
| name | users | 
+------------+------------+ 
| David | 2453 | 
| John | 2378 | 
| Michael | 2322 | 
| Chris | 2202 | 
| Mike | 2037 | 
| Jennifer | 1867 | 
| Jessica | 1463 | 
| Jason | 1457 | 
| Michelle | 1439 | 
| Brian | 1436 | 
+------------+------------+
© 2014 MapR Technologies 32 
Cities with the Most Businesses 
> SELECT state, city, count(*) AS businesses 
FROM dfs.demo.`/yelp/business.json` 
GROUP BY state, city 
ORDER BY businesses DESC LIMIT 10; 
+------------+------------+-------------+ 
| state | city | businesses | 
+------------+------------+-------------+ 
| NV | Las Vegas | 12021 | 
| AZ | Phoenix | 7499 | 
| AZ | Scottsdale | 3605 | 
| EDH | Edinburgh | 2804 | 
| AZ | Mesa | 2041 | 
| AZ | Tempe | 2025 | 
| NV | Henderson | 1914 | 
| AZ | Chandler | 1637 | 
| WI | Madison | 1630 | 
| AZ | Glendale | 1196 | 
+------------+------------+-------------+
© 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 33 
Explore the Data: Complex Data
© 2014 MapR Technologies 34 
business.json (1) 
{ 
"business_id": "4bEjOyTaDG24SY5TxsaUNQ", 
"full_address": "3655 Las Vegas Blvd SnThe StripnLas Vegas, NV 89109", 
"hours": { 
"Monday": {"close": "23:00", "open": "07:00"}, 
"Tuesday": {"close": "23:00", "open": "07:00"}, 
"Friday": {"close": "00:00", "open": "07:00"}, 
"Wednesday": {"close": "23:00", "open": "07:00"}, 
"Thursday": {"close": "23:00", "open": "07:00"}, 
"Sunday": {"close": "23:00", "open": "07:00"}, 
"Saturday": {"close": "00:00", "open": "07:00"} 
}, 
"open": true, 
"categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"], 
"city": "Las Vegas", 
"review_count": 4084, 
"name": "Mon Ami Gabi", 
"neighborhoods": ["The Strip"], 
"longitude": -115.172588519464,
© 2014 MapR Technologies 35 
business.json (2) 
"state": "NV", 
"stars": 4.0, 
"attributes": { 
"Alcohol": "full_bar”, 
"Noise Level": "average", 
"Has TV": false, 
"Attire": "casual", 
"Ambience": { 
"romantic": true, 
"intimate": false, 
"touristy": false, 
"hipster": false, 
"classy": true, 
"trendy": false, 
"casual": false 
}, 
"Good For": {"dessert": false, "latenight": false, "lunch": false, 
"dinner": true, "breakfast": false, "brunch": false}, 
} 
}
Which Places Are Open Right Now (22:00)? 
> SELECT name, b.hours 
© 2014 MapR Technologies 36 
FROM dfs.demo.`yelp/business.json` b 
WHERE b.hours.Saturday.`open` < '22:00' AND 
b.hours.Saturday.`close` > '22:00' 
LIMIT 2; 
+------------+------------+ 
| name | hours | 
+------------+------------+ 
| Chang Jiang Chinese Kitchen | 
{"Tuesday":{"close":"22:00","open":"11:00"},"Friday":{"close":"22:30","open":"11:00"},"Monday":{" 
close":"22:00","open":"11:00"},"Wednesday":{"close":"22:00","open":"11:00"},"Thursday":{"close":" 
22:00","open":"11:00"},"Sunday":{"close":"21:00","open":"16:00"},"Saturday":{"close":"22:30","ope 
n":"11:00"}} | 
| Grand China Restaurant | 
{"Tuesday":{"close":"22:00","open":"11:00"},"Friday":{"close":"23:00","open":"11:00"},"Monday":{" 
close":"22:00","open":"11:00"},"Wednesday":{"close":"22:00","open":"11:00"},"Thursday":{"close":" 
22:00","open":"11:00"},"Sunday":{"close":"22:00","open":"12:00"},"Saturday":{"close":"23:00","ope 
n":"11:00"}} | 
+------------+------------+
It’s 10pm in Vegas and I Want Good Hummus! 
> SELECT name, stars, b.hours.Friday, categories 
© 2014 MapR Technologies 37 
FROM dfs.demo.`yelp/business.json` b 
WHERE b.hours.Friday.`open` < '22:00' AND 
b.hours.Friday.`close` > '22:00' AND 
REPEATED_CONTAINS(categories, 'Mediterranean') AND 
city = 'Las Vegas' 
ORDER BY stars DESC 
LIMIT 2; 
+------------+------------+------------+------------+ 
| name | stars | EXPR$2 | categories | 
+------------+------------+------------+------------+ 
| Olives | 4.0 | {"close":"22:30","open":"11:00"} | ["Mediterranean","Restaurants"] | 
| Marrakech Moroccan Restaurant | 4.0 | {"close":"23:00","open":"17:30"} | 
["Mediterranean","Middle Eastern","Moroccan","Restaurants"] | 
+------------+------------+------------+------------+
© 2014 MapR Technologies 38 
Flatten Repeated Values 
> SELECT name, categories 
FROM dfs.demo.`yelp/business.json` LIMIT 3; 
+------------+------------+ 
| name | categories | 
+------------+------------+ 
| Eric Goldberg, MD | ["Doctors","Health & Medical"] | 
| Pine Cone Restaurant | ["Restaurants"] | 
| Deforest Family Restaurant | ["American (Traditional)","Restaurants"] | 
+------------+------------+ 
> SELECT name, FLATTEN(categories) AS categories 
FROM dfs.demo.`yelp/business.json` LIMIT 5; 
+------------+------------+ 
| name | categories | 
+------------+------------+ 
| Eric Goldberg, MD | Doctors | 
| Eric Goldberg, MD | Health & Medical | 
| Pine Cone Restaurant | Restaurants | 
| Deforest Family Restaurant | American (Traditional) | 
| Deforest Family Restaurant | Restaurants | 
+------------+------------+
Most and Least Common Business Categories 
> SELECT category, count(*) AS businesses 
FROM (SELECT name, FLATTEN(categories) AS category 
© 2014 MapR Technologies 39 
FROM dfs.demo.`yelp/business.json`) c 
GROUP BY category ORDER BY businesses DESC; 
+------------+------------+ 
| category | businesses | 
+------------+------------+ 
| Restaurants | 14303 | 
… 
| Australian | 1 | 
| Boat Dealers | 1 | 
| Firewood | 1 | 
+------------+------------+ 
715 rows selected (3.439 seconds) 
> SELECT name, categories FROM dfs.demo.`yelp/business.json` 
WHERE true and REPEATED_CONTAINS(categories, 'Australian'); 
+------------+------------+ 
| name | categories | 
+------------+------------+ 
| The Australian AZ | ["Bars","Burgers","Nightlife","Australian","Sports Bars","Restaurants"] | 
+------------+------------+
© 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 40 
Explore the Data: Views
columns[0] columns[4] 
© 2014 MapR Technologies 41 
Create a View for Name-Gender Mapping 
names.csv: 
> CREATE VIEW dfs.tmp.`names` AS 
SELECT columns[0] AS name, columns[4] AS gender 
FROM dfs.demo.`names.csv`; 
> USE dfs.tmp; 
> CREATE VIEW names1 ASSELECT columns[0] AS name, columns[4] AS gender FROM 
dfs.demo.`names.csv`; 
> SELECT * FROM dfs.tmp.names WHERE name = 'John'; 
+------------+------------+ 
| name | gender | 
+------------+------------+ 
| John | Male | 
+------------+------------+
Most Common Names (and their Genders) on Yelp 
> SELECT u.name, n.gender, count(*) AS number 
© 2014 MapR Technologies 42 
FROM mongo.yelp.users u, dfs.tmp.names n 
WHERE u.name = n.name 
GROUP BY u.name, n.gender 
ORDER BY number DESC LIMIT 10; 
+------------+------------+------------+ 
| name | gender | number | 
+------------+------------+------------+ 
| David | Male | 2453 | 
| John | Male | 2378 | 
| Michael | Male | 2322 | 
| Chris | Unknown | 2202 | 
| Mike | Male | 2037 | 
| Jennifer | Female | 1867 | 
| Jessica | Female | 1463 | 
| Jason | Male | 1457 | 
| Michelle | Female | 1439 | 
| Brian | Male | 1436 | 
+------------+------------+------------+
© 2014 MapR Technologies 43 
Who Rates Higher – Men or Women? 
> SELECT n.gender, count(*) AS users, round(avg(average_stars), 2) stars 
FROM mongo.yelp.users u, dfs.tmp.names n 
WHERE u.name = n.name 
GROUP BY n.gender; 
+------------+------------+------------+ 
| gender | users | stars | 
+------------+------------+------------+ 
| Female | 103684 | 3.77 | 
| Male | 97430 | 3.696 | 
| Unknown | 18409 | 3.727 | 
+------------+------------+------------+
© 2014 MapR Technologies 44 
Who Writes More – Men or Women? 
It takes a 3-way join to find out… 
> SELECT n.gender, round(avg(length(r.text))) AS review_length 
FROM dfs.demo.`yelp/review.json` r, 
mongo.yelp.users u, 
dfs.tmp.names n 
WHERE u.name = n.name AND r.user_id = u.user_id 
GROUP BY n.gender; 
+------------+---------------+ 
| gender | review_length | 
+------------+---------------+ 
| Male | 665 | 
| Female | 730 | 
| Unknown | 711 | 
+------------+---------------+
© 2014 MapR Technologies 45 
Drill Tweets (@ApacheDrill)
© 2014 MapR Technologies 46 
Thank You 
• Learn: incubator.apache.org/drill/ 
• Download: incubator.apache.org/drill/download/ 
• Ask questions: drill-user@incubator.apache.org 
• Contact me: tshiran@apache.org
© 2014 MapR Technologies 47 
Thank You 
Tomer Shiran, VP Product Management 
@mapr maprtech 
tshiran@mapr.com 
MapRTechnologies 
maprtech 
mapr-technologies

More Related Content

What's hot

Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillMapR Technologies
 
Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Gera Shegalov
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
Free Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillFree Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillMapR Technologies
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...The Hive
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesDataWorks Summit/Hadoop Summit
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, Howmcsrivas
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into ProductionMapR Technologies
 
Killing ETL with Apache Drill
Killing ETL with Apache DrillKilling ETL with Apache Drill
Killing ETL with Apache DrillCharles Givre
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillMapR Technologies
 
Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Vince Gonzalez
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBradford Stephens
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
 
Apache Drill with Oracle, Hive and HBase
Apache Drill with Oracle, Hive and HBaseApache Drill with Oracle, Hive and HBase
Apache Drill with Oracle, Hive and HBaseNag Arvind Gudiseva
 

What's hot (20)

Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache Drill
 
Apache drill
Apache drillApache drill
Apache drill
 
Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Free Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillFree Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache Drill
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
Killing ETL with Apache Drill
Killing ETL with Apache DrillKilling ETL with Apache Drill
Killing ETL with Apache Drill
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
 
Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
HBase with MapR
HBase with MapRHBase with MapR
HBase with MapR
 
Apache Drill with Oracle, Hive and HBase
Apache Drill with Oracle, Hive and HBaseApache Drill with Oracle, Hive and HBase
Apache Drill with Oracle, Hive and HBase
 

Viewers also liked

Merlin: The Ultimate Data Science Environment
Merlin: The Ultimate Data Science EnvironmentMerlin: The Ultimate Data Science Environment
Merlin: The Ultimate Data Science EnvironmentCharles Givre
 
Strata NYC 2015 What does your smart device know about you?
Strata NYC 2015 What does your smart device know about you?Strata NYC 2015 What does your smart device know about you?
Strata NYC 2015 What does your smart device know about you?Charles Givre
 
What Does Your Smart Car Know About You? Strata London 2016
What Does Your Smart Car Know About You?  Strata London 2016What Does Your Smart Car Know About You?  Strata London 2016
What Does Your Smart Car Know About You? Strata London 2016Charles Givre
 
Apache Drill Workshop
Apache Drill WorkshopApache Drill Workshop
Apache Drill WorkshopCharles Givre
 
Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2Charles Givre
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
 
Data Exploration with Apache Drill: Day 1
Data Exploration with Apache Drill:  Day 1Data Exploration with Apache Drill:  Day 1
Data Exploration with Apache Drill: Day 1Charles Givre
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsMapR Technologies
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleMapR Technologies
 
Km 65 tahun 2002
Km 65 tahun 2002Km 65 tahun 2002
Km 65 tahun 2002Bp Nafri
 
RAPIM 2011
RAPIM 2011RAPIM 2011
RAPIM 2011Bp Nafri
 
Apache Storm - Minando redes sociales y medios en tiempo real
Apache Storm - Minando redes sociales y medios en tiempo realApache Storm - Minando redes sociales y medios en tiempo real
Apache Storm - Minando redes sociales y medios en tiempo realAndrés Mauricio Palacios
 
RAKORNIS 2010
RAKORNIS 2010RAKORNIS 2010
RAKORNIS 2010Bp Nafri
 
Pristine Advisers Presentation
Pristine Advisers PresentationPristine Advisers Presentation
Pristine Advisers PresentationPattyBaronowski
 

Viewers also liked (17)

Merlin: The Ultimate Data Science Environment
Merlin: The Ultimate Data Science EnvironmentMerlin: The Ultimate Data Science Environment
Merlin: The Ultimate Data Science Environment
 
Strata NYC 2015 What does your smart device know about you?
Strata NYC 2015 What does your smart device know about you?Strata NYC 2015 What does your smart device know about you?
Strata NYC 2015 What does your smart device know about you?
 
What Does Your Smart Car Know About You? Strata London 2016
What Does Your Smart Car Know About You?  Strata London 2016What Does Your Smart Car Know About You?  Strata London 2016
What Does Your Smart Car Know About You? Strata London 2016
 
Apache Drill Workshop
Apache Drill WorkshopApache Drill Workshop
Apache Drill Workshop
 
Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Data Exploration with Apache Drill: Day 1
Data Exploration with Apache Drill:  Day 1Data Exploration with Apache Drill:  Day 1
Data Exploration with Apache Drill: Day 1
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
 
High-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in HadoopHigh-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in Hadoop
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
 
Km 65 tahun 2002
Km 65 tahun 2002Km 65 tahun 2002
Km 65 tahun 2002
 
RAPIM 2011
RAPIM 2011RAPIM 2011
RAPIM 2011
 
Narkoba
NarkobaNarkoba
Narkoba
 
PSCO
PSCOPSCO
PSCO
 
Apache Storm - Minando redes sociales y medios en tiempo real
Apache Storm - Minando redes sociales y medios en tiempo realApache Storm - Minando redes sociales y medios en tiempo real
Apache Storm - Minando redes sociales y medios en tiempo real
 
RAKORNIS 2010
RAKORNIS 2010RAKORNIS 2010
RAKORNIS 2010
 
Pristine Advisers Presentation
Pristine Advisers PresentationPristine Advisers Presentation
Pristine Advisers Presentation
 

Similar to Analyzing Real-World Data with Apache Drill

Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillTomer Shiran
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop BigDataEverywhere
 
Self-Service Data Exploration with Apache Drill
Self-Service Data Exploration with Apache DrillSelf-Service Data Exploration with Apache Drill
Self-Service Data Exploration with Apache DrillMapR Technologies
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies
 
Postgres.foreign.data.wrappers.2015
Postgres.foreign.data.wrappers.2015Postgres.foreign.data.wrappers.2015
Postgres.foreign.data.wrappers.2015EDB
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeMapR Technologies
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeDataWorks Summit
 
Hadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big DataHadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big DataSenturus
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Tugdual Grall
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionMapR Technologies
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームMasayuki Matsushita
 
The Central View of your Data with Postgres
The Central View of your Data with PostgresThe Central View of your Data with Postgres
The Central View of your Data with PostgresEDB
 
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonUsing Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonMapR Technologies
 
Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Connor McDonald
 

Similar to Analyzing Real-World Data with Apache Drill (20)

Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
 
Self-Service Data Exploration with Apache Drill
Self-Service Data Exploration with Apache DrillSelf-Service Data Exploration with Apache Drill
Self-Service Data Exploration with Apache Drill
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
 
Postgres.foreign.data.wrappers.2015
Postgres.foreign.data.wrappers.2015Postgres.foreign.data.wrappers.2015
Postgres.foreign.data.wrappers.2015
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About Time
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About Time
 
Hadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big DataHadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big Data
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
2014 08-20-pit-hug
2014 08-20-pit-hug2014 08-20-pit-hug
2014 08-20-pit-hug
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop Solution
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
Drilling on JSON
Drilling on JSONDrilling on JSON
Drilling on JSON
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
 
The Central View of your Data with Postgres
The Central View of your Data with PostgresThe Central View of your Data with Postgres
The Central View of your Data with Postgres
 
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonUsing Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
 
Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 

Recently uploaded

While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 

Recently uploaded (20)

While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 

Analyzing Real-World Data with Apache Drill

  • 1. Analyzing Real-World Data with Apache Drill © 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 1
  • 2. © 2014 MapR Technologies 2 Data is doubling in size every two years
  • 3. 44 ZETTABYTES © 2014 MapR Technologies 3 IDC estimates that in 2020, there will be 44 zettabytes of data in the world 4.4 ZETTABYTES 1.8 ZETTABYTES 2011 2013 2020 Source: IDC Digital Universe
  • 4. © 2014 MapR Technologies 4 UNSTRUCTURED DATA Unstructured data will account for more than 80% of the data collected by organizations STRUCTURED DATA 1980 1990 2000 2010 2020 Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data Total Data Stored
  • 5. NoSchema Datastores are Capturing this Data Volume MBs-GBs TBs-PBs RELATIONAL DATABASES “NOSCHEMA” DATASTORES Structure Development 1980 1990 2000 2010 2020 © 2014 MapR Technologies 5 Fixed schema DBA controls structure Dynamic schema (schema-free) Application controls structure Database Structured Structured, semi-structured and unstructured Planned (release cycle = months-years) Iterative (release cycle = days-weeks)
  • 6. WANT 2 DON’T WANT © 2014 MapR Technologies 6 SQL in the Big Data World • SQL • BI (Tableau, MicroStrategy, etc.) • Low latency • Scalability • Create and maintain schemas on: – HDFS (Parquet, JSON, etc.) – HBase – MongoDB • Transform or copy data We want SQL and BI support without compromising the flexibility and agility of NoSchema datastores
  • 7. • Schema-free scale-out query engine for Hadoop and NoSQL • Point-and-query vs. schema-first • Low latency • Extreme ease of use • Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs © 2014 MapR Technologies 7 APACHE DRILL 40+ contributors 150+ years of experience building databases and distributed systems
  • 8. Evolution Towards Self-Service Data Exploration © 2014 MapR Technologies 8 Data Modeling and Transformation Data Visualization IT-driven IT-driven IT-driven Self-service IT-driven Self-service Not needed Self-service Traditional BI w/ RDBMS Self-Service BI w/ RDBMS SQL-on-Hadoop Self-Service Data Exploration Zero-day analytics
  • 9. © 2014 MapR Technologies 9
  • 10. RDBMS/SQL-on-Hadoop table Apache Drill table © 2014 MapR Technologies 10 Drill’s Data Model is Flexible Fixed schema Schema-less HBase JSON BSON CSV TSV Parquet Avro Flat Complex Flexibility Flexibility Name Gender Age Michael M 6 Jennifer F 3 { name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos } { name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC }
  • 11. Drill Supports Schema Discovery On-The-Fly Schema Declared In Advance Schema2 Discovered On-The-Fly © 2014 MapR Technologies 11 • Fixed schema • Leverage schema in centralized repository (Hive Metastore) • Fixed schema, evolving schema or schema-less • Leverage schema in centralized repository or self-describing data SCHEMA ON WRITE SCHEMA BEFORE READ SCHEMA ON THE FLY
  • 12. SELECT po_document.AllowPartialShipment FROM j_purchaseorder; © 2014 MapR Technologies 12 Native JSON SELECT json_value(po_document, '$.AllowPartialShipment’ RETURNING NUMBER) FROM j_purchaseorder; JSON query with Drill: JSON query with Oracle: Relational databases cannot provide true schema-free JSON support.
  • 13. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 13 Architecture
  • 14. © 2014 MapR Technologies 14 High Level Architecture • Cluster of commodity servers – Daemon (drillbit) on each node • No dependency on other execution engines (MapReduce, Spark, Tez) – Better performance and manageability • ZooKeeper maintains ephemeral cluster membership information – drillbit uses ZooKeeper to find other drillbits in the cluster – Client uses ZooKeeper to find drillbits • Data processing unit is columnar record batches – Enables schema flexibility with negligible performance impact
  • 15. ZooKeeper ZooKeeper ZooKeeper © 2014 MapR Technologies 15 Drill Maximizes Data Locality drillbit DataNode/Regi onServer/mong od drillbit DataNode/Regi onServer/mong od drillbit DataNode/Regi onServer/mong od … Data Source Best Practice HDFS or MapR-FS drillbit on each DataNode HBase or MapR-DB drillbit on each RegionServer MongoDB drillbit on each mongod node (when using replicas, run it on the replica node)
  • 16. 5. Return results to client © 2014 MapR Technologies 16 SELECT* Query Execution drillbit ZooKeeper Client (JDBC, ODBC, REST) 1. Find drillbits (once per session) 2. Submit query to drillbit 3. Create logical and physical execution plans 4. Farm out execution of fragments to cluster (completely distributed execution) ZooKeeper ZooKeeper drillbit drillbit * CTAS (CREATE TABLE AS SELECT) queries include steps 1-4
  • 17. DFS © 2014 MapR Technologies 17 Core Modules within drillbit SQL Parser Hive HBase Distributed Cache Storage Plugins MongoDB Physical Plan Execution Logical Plan Optimizer RPC Endpoint
  • 18. Example: Analyzing Real-World Data © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 18
  • 19. © 2014 MapR Technologies 19 Demo Plan 1. Run Drill 2. Configure DFS and MongoDB storage plugins 3. Explore the data – Basics – Complex data – Views
  • 20. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 20 Run Drill
  • 21. You can now access the Web UI: http://localhost:8047 © 2014 MapR Technologies 21 Run Drill in Embedded Mode (sqlline) $ tar xf apache-drill-0.7.0.tar.gz $ cd apache-drill-0.7.0 $ bin/sqlline -u jdbc:drill:zk=local > SELECT * FROM dfs.root.`/Users/tshiran/Development/demo/data/yelp/user.json` LIMIT 1; +---------------+------------+--------------+------------+------------+ | yelping_since | votes | review_count | name | user_id | +---------------+------------+--------------+------------+------------+ | 2012-02 | {"funny":1,"useful":5,"cool":0} | 6 | Lee | qtrmBGNqCvupHMHL_bKFgQ | • drillbit (Drill daemon) starts automatically in embedded mode • No ZooKeeper in embedded mode (hence zk=local) • Can’t use BI clients (JDBC/ODBC) in embedded mode
  • 22. • Define the Drill cluster name and ZooKeeper nodes in conf/drill-override.conf • Start drillbit: $ bin/drillbit.sh start © 2014 MapR Technologies 22 Or Run Drill in Distributed Mode… • Make sure ZooKeeper (zkServer) is running: $ zkServer start • Access the Web UI: http://localhost:8047 • Connect a client to the cluster (eg, sqlline): $ bin/sqlline -u jdbc:drill:zk=localhost:2181 • Clients (like sqlline) connect to ZooKeeper to discover the cluster nodes • If you have multiple Drill clusters registered in one ZooKeeper ensemble, specify the desired cluster in the JDBC connection string: jdbc:drill:zk=localhost:2181/drill/<clustername> • Not sure if ZooKeeper is running? Run telnet localhost 2181 and make sure it connects
  • 23. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 23 Configure Storage Plugins
  • 24. © 2014 MapR Technologies 24 Enable MongoDB Storage Plugin
  • 25. Define Workspaces in the DFS Storage Plugin © 2014 MapR Technologies 25 • d
  • 26. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 26 Explore the Data: Basics
  • 27. © 2014 MapR Technologies 27 Inventory: DFS Files { "votes": {"funny": 0, "useful": 2, "cool": 1}, "user_id": "Xqd0DzHaiyRqVH3WRG7hzg", "review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05-17", "text": "dr. goldberg offers everything ...", "type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA" }
  • 28. © 2014 MapR Technologies 28 Inventory: MongoDB Collections $ mongo MongoDB shell version: 2.6.5 > show databases; admin (empty) local 0.078GB yelp 0.453GB > use yelp > db.users.findOne() { "_id" : ObjectId("54566cdf3237149de181a92a"), "yelping_since" : "2012-02", "votes" : { "funny" : 1, "useful" : 5, "cool" : 0 }, "review_count" : 6, "name" : "Lee", "user_id" : "qtrmBGNqCvupHMHL_bKFgQ", "friends" : [ ] }
  • 29. © 2014 MapR Technologies 29 Let’s Go! > SELECT * FROM dfs.root.`/Users/tshiran/Development/demo/data/y elp/review.json` WHERE stars = 1 LIMIT 1; +------------+------------+------------+------------+------------+------------+------------+-------------+ | votes | user_id | review_id | stars | date | text | type | business_id | +------------+------------+------------+------------+------------+------------+------------+-------------+ | {"funny":0,"useful":0,"cool":0} | Qrs3EICADUKNFoUq2iHStA | _ePLBPrkrf4bhyiKWEn4Qg | 1 | 2013-04-19 | I don't know what Dr. Goldberg was like before moving to Arizona, but let me tell you, STAY AWAY from this doctor and this office. | review | vcNAWiLM4dR7D2nwwJ7nCA | +------------+------------+------------+------------+------------+------------+------------+-------------+
  • 30. © 2014 MapR Technologies 30 Using Storage Plugins and Workspaces Storage plugin Workspace Path relative to workspace > SELECT * FROM dfs.root.`/Users/tshiran/Development/demo/data/yelp/review.json` LIMIT 1; > SELECT * FROM dfs.demo.`yelp/review.json` LIMIT 1; > SELECT * FROM mongo.yelp.users LIMIT 1; > USE mongo.yelp; > SELECT * FROM users LIMIT 1; Storage Plugin Workspace Table dfs Path Path relative to workspace mongo Database Collection hive Database Table hbase Namespace Table
  • 31. © 2014 MapR Technologies 31 Most Common User Names (MongoDB) > SELECT name, count(*) AS users FROM mongo.yelp.users GROUP BY name ORDER BY users DESC LIMIT 10; +------------+------------+ | name | users | +------------+------------+ | David | 2453 | | John | 2378 | | Michael | 2322 | | Chris | 2202 | | Mike | 2037 | | Jennifer | 1867 | | Jessica | 1463 | | Jason | 1457 | | Michelle | 1439 | | Brian | 1436 | +------------+------------+
  • 32. © 2014 MapR Technologies 32 Cities with the Most Businesses > SELECT state, city, count(*) AS businesses FROM dfs.demo.`/yelp/business.json` GROUP BY state, city ORDER BY businesses DESC LIMIT 10; +------------+------------+-------------+ | state | city | businesses | +------------+------------+-------------+ | NV | Las Vegas | 12021 | | AZ | Phoenix | 7499 | | AZ | Scottsdale | 3605 | | EDH | Edinburgh | 2804 | | AZ | Mesa | 2041 | | AZ | Tempe | 2025 | | NV | Henderson | 1914 | | AZ | Chandler | 1637 | | WI | Madison | 1630 | | AZ | Glendale | 1196 | +------------+------------+-------------+
  • 33. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 33 Explore the Data: Complex Data
  • 34. © 2014 MapR Technologies 34 business.json (1) { "business_id": "4bEjOyTaDG24SY5TxsaUNQ", "full_address": "3655 Las Vegas Blvd SnThe StripnLas Vegas, NV 89109", "hours": { "Monday": {"close": "23:00", "open": "07:00"}, "Tuesday": {"close": "23:00", "open": "07:00"}, "Friday": {"close": "00:00", "open": "07:00"}, "Wednesday": {"close": "23:00", "open": "07:00"}, "Thursday": {"close": "23:00", "open": "07:00"}, "Sunday": {"close": "23:00", "open": "07:00"}, "Saturday": {"close": "00:00", "open": "07:00"} }, "open": true, "categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"], "city": "Las Vegas", "review_count": 4084, "name": "Mon Ami Gabi", "neighborhoods": ["The Strip"], "longitude": -115.172588519464,
  • 35. © 2014 MapR Technologies 35 business.json (2) "state": "NV", "stars": 4.0, "attributes": { "Alcohol": "full_bar”, "Noise Level": "average", "Has TV": false, "Attire": "casual", "Ambience": { "romantic": true, "intimate": false, "touristy": false, "hipster": false, "classy": true, "trendy": false, "casual": false }, "Good For": {"dessert": false, "latenight": false, "lunch": false, "dinner": true, "breakfast": false, "brunch": false}, } }
  • 36. Which Places Are Open Right Now (22:00)? > SELECT name, b.hours © 2014 MapR Technologies 36 FROM dfs.demo.`yelp/business.json` b WHERE b.hours.Saturday.`open` < '22:00' AND b.hours.Saturday.`close` > '22:00' LIMIT 2; +------------+------------+ | name | hours | +------------+------------+ | Chang Jiang Chinese Kitchen | {"Tuesday":{"close":"22:00","open":"11:00"},"Friday":{"close":"22:30","open":"11:00"},"Monday":{" close":"22:00","open":"11:00"},"Wednesday":{"close":"22:00","open":"11:00"},"Thursday":{"close":" 22:00","open":"11:00"},"Sunday":{"close":"21:00","open":"16:00"},"Saturday":{"close":"22:30","ope n":"11:00"}} | | Grand China Restaurant | {"Tuesday":{"close":"22:00","open":"11:00"},"Friday":{"close":"23:00","open":"11:00"},"Monday":{" close":"22:00","open":"11:00"},"Wednesday":{"close":"22:00","open":"11:00"},"Thursday":{"close":" 22:00","open":"11:00"},"Sunday":{"close":"22:00","open":"12:00"},"Saturday":{"close":"23:00","ope n":"11:00"}} | +------------+------------+
  • 37. It’s 10pm in Vegas and I Want Good Hummus! > SELECT name, stars, b.hours.Friday, categories © 2014 MapR Technologies 37 FROM dfs.demo.`yelp/business.json` b WHERE b.hours.Friday.`open` < '22:00' AND b.hours.Friday.`close` > '22:00' AND REPEATED_CONTAINS(categories, 'Mediterranean') AND city = 'Las Vegas' ORDER BY stars DESC LIMIT 2; +------------+------------+------------+------------+ | name | stars | EXPR$2 | categories | +------------+------------+------------+------------+ | Olives | 4.0 | {"close":"22:30","open":"11:00"} | ["Mediterranean","Restaurants"] | | Marrakech Moroccan Restaurant | 4.0 | {"close":"23:00","open":"17:30"} | ["Mediterranean","Middle Eastern","Moroccan","Restaurants"] | +------------+------------+------------+------------+
  • 38. © 2014 MapR Technologies 38 Flatten Repeated Values > SELECT name, categories FROM dfs.demo.`yelp/business.json` LIMIT 3; +------------+------------+ | name | categories | +------------+------------+ | Eric Goldberg, MD | ["Doctors","Health & Medical"] | | Pine Cone Restaurant | ["Restaurants"] | | Deforest Family Restaurant | ["American (Traditional)","Restaurants"] | +------------+------------+ > SELECT name, FLATTEN(categories) AS categories FROM dfs.demo.`yelp/business.json` LIMIT 5; +------------+------------+ | name | categories | +------------+------------+ | Eric Goldberg, MD | Doctors | | Eric Goldberg, MD | Health & Medical | | Pine Cone Restaurant | Restaurants | | Deforest Family Restaurant | American (Traditional) | | Deforest Family Restaurant | Restaurants | +------------+------------+
  • 39. Most and Least Common Business Categories > SELECT category, count(*) AS businesses FROM (SELECT name, FLATTEN(categories) AS category © 2014 MapR Technologies 39 FROM dfs.demo.`yelp/business.json`) c GROUP BY category ORDER BY businesses DESC; +------------+------------+ | category | businesses | +------------+------------+ | Restaurants | 14303 | … | Australian | 1 | | Boat Dealers | 1 | | Firewood | 1 | +------------+------------+ 715 rows selected (3.439 seconds) > SELECT name, categories FROM dfs.demo.`yelp/business.json` WHERE true and REPEATED_CONTAINS(categories, 'Australian'); +------------+------------+ | name | categories | +------------+------------+ | The Australian AZ | ["Bars","Burgers","Nightlife","Australian","Sports Bars","Restaurants"] | +------------+------------+
  • 40. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 40 Explore the Data: Views
  • 41. columns[0] columns[4] © 2014 MapR Technologies 41 Create a View for Name-Gender Mapping names.csv: > CREATE VIEW dfs.tmp.`names` AS SELECT columns[0] AS name, columns[4] AS gender FROM dfs.demo.`names.csv`; > USE dfs.tmp; > CREATE VIEW names1 ASSELECT columns[0] AS name, columns[4] AS gender FROM dfs.demo.`names.csv`; > SELECT * FROM dfs.tmp.names WHERE name = 'John'; +------------+------------+ | name | gender | +------------+------------+ | John | Male | +------------+------------+
  • 42. Most Common Names (and their Genders) on Yelp > SELECT u.name, n.gender, count(*) AS number © 2014 MapR Technologies 42 FROM mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name GROUP BY u.name, n.gender ORDER BY number DESC LIMIT 10; +------------+------------+------------+ | name | gender | number | +------------+------------+------------+ | David | Male | 2453 | | John | Male | 2378 | | Michael | Male | 2322 | | Chris | Unknown | 2202 | | Mike | Male | 2037 | | Jennifer | Female | 1867 | | Jessica | Female | 1463 | | Jason | Male | 1457 | | Michelle | Female | 1439 | | Brian | Male | 1436 | +------------+------------+------------+
  • 43. © 2014 MapR Technologies 43 Who Rates Higher – Men or Women? > SELECT n.gender, count(*) AS users, round(avg(average_stars), 2) stars FROM mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name GROUP BY n.gender; +------------+------------+------------+ | gender | users | stars | +------------+------------+------------+ | Female | 103684 | 3.77 | | Male | 97430 | 3.696 | | Unknown | 18409 | 3.727 | +------------+------------+------------+
  • 44. © 2014 MapR Technologies 44 Who Writes More – Men or Women? It takes a 3-way join to find out… > SELECT n.gender, round(avg(length(r.text))) AS review_length FROM dfs.demo.`yelp/review.json` r, mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name AND r.user_id = u.user_id GROUP BY n.gender; +------------+---------------+ | gender | review_length | +------------+---------------+ | Male | 665 | | Female | 730 | | Unknown | 711 | +------------+---------------+
  • 45. © 2014 MapR Technologies 45 Drill Tweets (@ApacheDrill)
  • 46. © 2014 MapR Technologies 46 Thank You • Learn: incubator.apache.org/drill/ • Download: incubator.apache.org/drill/download/ • Ask questions: drill-user@incubator.apache.org • Contact me: tshiran@apache.org
  • 47. © 2014 MapR Technologies 47 Thank You Tomer Shiran, VP Product Management @mapr maprtech tshiran@mapr.com MapRTechnologies maprtech mapr-technologies

Editor's Notes

  1. Have someone introduce me. Thank audience (tie to morning activities), sponsors, HP, etc. We’re here because this is the biggest thing that has happened to Hadoop…
  2. Here at the conference we’re talking about data science. But before we can appreciate the changes happening in data science, we must first talk about Data. Data is doubling every two years. The fast growing volume, variety and velocity of data is overwhelming traditional systems and approaches. A revolutionary approach is required to leverage this data. And with this new technology, Data science as we know, is undergoing tremendous change.
  3. To give you a sense of the data volumes that we’re talking about, I’ve included this chart that shows why a revolutionary approach is needed. You can see the amount of data growth moving from 1.8 Zettabytes to 44 Zettabytes in just over 5 years. To put this into perspective a large datawarehouse contains terabytes of data. A zettabye is 1 billion terabytes. Numbers in chart are from two IDC reports (sponsored by emc). http://www.emc.com/collateral/about/news/idc-emc-digital-universe-2011-infographic.pdf http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm
  4. What is the source of this data growth? While structured data growth has been relatively modest, the growth in unstructured data has been exponential. Source of statistic: http://link.springer.com/chapter/10.1007/978-3-642-39146-0_2
  5. The database/datastore landscape is evolving to meet the new requirements. 2009 was the inflection point. NoSchema systems in which applications control structure. Developers are being empowered and they are voting for the agility offered by these systems. In the early days if this revolution we sacrificed the query language, and we eliminated the ability to leverage the knowledge and tools available to millions of people. We’re changing that by a distributed SQL engine. But when we do that, we have to keep in mind that this transition to a NoSchema world happened for a reason, and we don’t want to reintroduce the centralized, DBA-managed schema.
  6. TODO: Add Impala and Splunk logos
  7. IT-driven = months of delay, unnecessary work (data is no longer relevant, etc.) The so-what needs to be conveyed. Why does it matter that it’s not needed. 6 months -> 3 months -> 3 months -> day zero So imagine now what you can get… Data Agility is needed for Business Agility >>> Stand still during slide, move in at the punchline (why does this matter to YOU)
  8. Organizations are realizing that they have to move towards self-service
  9. All SQL engines (traditional or SQL-on-Hadoop) view tables as spreadsheet-like data structures with rows and columns. All records have the same structure, and there is no support for nested data or repeating fields. Drill views tables conceptually as collections of JSON (with additional types) documents. Each record can have a different structure (hence, schema-less). This is revolutionary and has never been done before. If you consider the four data models shown in the 2x2, all models can be represented by the complex, no schema model (JSON) because it is the most flexible. However, no other data model can be represented by the flat, fixed schema model. Therefore, when using any SQL engine except Drill, the data has to be transformed before it can be available to queries.
  10. TODO: Add Impala and Splunk logos