SlideShare a Scribd company logo
1 of 109
Download to read offline
Killing ETL with Drill
Charles S. Givre
@cgivre
cgivre@thedataist.com
The problems
We want SQL and BI support
without compromising flexibility
and ability of NoSchema datastores.
Data is not arranged in an
optimal way for ad-hoc analysis
Data is not arranged in an
optimal way for ad-hoc analysis
ETL Data Warehouse
Analytics teams spend between
50%-90% of their time preparing
their data.
76% of Data Scientist say this is the
least enjoyable part of their job.
http://visit.crowdflower.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf
The ETL Process consumes the
most time and contributes almost
no value to the end product.
ETL Data Warehouse
“Any sufficiently advanced
technology is indistinguishable
from magic”
—Arthur C. Clarke
You just query the data…
no schema
Drill is NOT just SQL on Hadoop
Drill scales
Drill is open source
Download Drill at: drill.apache.org
Why should you use Drill?
Why should you use Drill?
Drill is easy to use
Drill is easy to use
Drill uses standard ANSI SQL
Drill is FAST!!
https://www.mapr.com/blog/comparing-sql-functions-and-performance-apache-spark-and-apache-drill
https://www.mapr.com/blog/comparing-sql-functions-and-performance-apache-spark-and-apache-drill
https://www.mapr.com/blog/comparing-sql-functions-and-performance-apache-spark-and-apache-drill
Quick Demo
Thank you Jair Aguirre!!
Quick Demo
seanlahman.com/baseball-archive/statistics
Quick Demo
data = load '/user/cloudera/data/baseball_csv/Teams.csv' using PigStorage(',');
filtered = filter data by ($0 == '1988');
tm_hr = foreach filtered generate (chararray) $40 as team, (int) $19 as hrs;
ordered = order tm_hr by hrs desc;
dump ordered;
Execution Time:
1 minute, 38 seconds
Quick Demo
SELECT columns[40], cast(columns[19] as int) AS HR
FROM `baseball_csv/Teams.csv`
WHERE columns[0] = '1988'
ORDER BY HR desc;
Execution Time:
0232 seconds!!
Drill is Versatile
NoSQL, No Problem
NoSQL, No Problem
https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json
NoSQL, No Problem
https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json
SELECT t.address.zipcode AS zip, count(name) AS rests
FROM `restaurants` t
GROUP BY t.address.zipcode
ORDER BY rests DESC
LIMIT 10;
Querying Across Silos
Querying Across Silos
Farmers Market Data Restaurant Data
Querying Across Silos
SELECT t1.Borough, t1.markets, t2.rests, cast(t1.markets AS
FLOAT)/ cast(t2.rests AS FLOAT) AS ratio
FROM (
SELECT Borough, count(`Farmers Markets Name`) AS markets
FROM `farmers_markets.csv`
GROUP BY Borough ) t1
JOIN (
SELECT borough, count(name) AS rests
FROM mongo.test.`restaurants`
GROUP BY borough
) t2
ON t1.Borough=t2.borough
ORDER BY ratio DESC;
Querying Across Silos
Execution Time: 0.502 Seconds
To follow along, please download the
files at:
https://github.com/cgivre/drillworkshop
Querying Drill
Querying Drill
SELECT DISTINCT management_role FROM cp.`employee.json`;
Querying Drill
http://localhost:8047
Querying Drill
SELECT * FROM cp.`employee.json` LIMIT 20
Querying Drill
SELECT * FROM cp.`employee.json` LIMIT 20
Querying Drill
SELECT <fields>
FROM <table>
WHERE <optional logical condition>
Querying Drill
SELECT name, address, email
FROM customerData
WHERE age > 20
Querying Drill
SELECT name, address, email
FROM dfs.logs.`/data/customers.csv`
WHERE age > 20
Querying Drill
FROM dfs.logs.`/data/customers.csv`
Storage Plugin Workspace Table
Querying Drill
Plugins Supported Description
cp Queries files in the Java ClassPath
dfs
File System. Can connect to remote filesystems such as
Hadoop
hbase Connects to HBase
hive Integrates Drill with the Apache Hive metastore
kudu Provides a connection to Apache Kudu
mongo Connects to mongoDB
RDBMS
Provides a connection to relational databases such as MySQL,
Postgres, Oracle and others.
S3 Provides a connection to an S3 cluster
Problem: You have multiple log files
which you would like to analyze
Problem: You have multiple log files which you
would like to analyze
• In the sample data files, there is a folder called ‘logs’ which
contains the following structure:
SELECT *
FROM dfs.drillworkshop.`logs/`
LIMIT 10
SELECT *
FROM dfs.drillworkshop.`logs/`
LIMIT 10
dirn accesses the
subdirectories
dirn accesses the
subdirectories
SELECT *
FROM dfs.drilldata.`logs/`
WHERE dir0 = ‘2013’
Function Description
MAXDIR(), MINDIR() Limit query to the first or last directory
IMAXDIR(), IMINDIR()
Limit query to the first or last directory in
case insensitive order.
Directory Functions
WHERE dir<n> = MAXDIR('<plugin>.<workspace>', '<filename>')
In Class Exercise:
Find the total number of items sold by year and the total
dollar sales in each year.
HINT: Don’t forget to CAST() the fields to appropriate data
types
SELECT dir0 AS data_year,
SUM( CAST( item_count AS INTEGER ) ) as total_items,
SUM( CAST( amount_spent AS FLOAT ) ) as total_sales
FROM dfs.drillworkshop.`logs/`
GROUP BY dir0
Let’s look at JSON data
Let’s look at JSON data
[
{
"name": "Farley, Colette L.",
"email": "iaculis@atarcu.ca",
"DOB": "2011-08-14",
"phone": "1-758-453-3833"
},
{
"name": "Kelley, Cherokee R.",
"email": "ante.blandit@malesuadafringilla.edu",
"DOB": "1992-09-01",
"phone": "1-595-478-7825"
}
…
]
Let’s look at JSON data
SELECT *
FROM dfs.drillworkshop.`json/customers.json`
Let’s look at JSON data
SELECT *
FROM dfs.drillworkshop.`json/customers.json`
Let’s look at JSON data
SELECT *
FROM dfs.drillworkshop.`json/customers.json`
What about nested data?
Please open
baltimore_salaries.json
in a text editor
{
"meta" : {
"view" : {
"id" : "nsfe-bg53",
"name" : "Baltimore City Employee Salaries FY2015",
"attribution" : "Mayor's Office",
"averageRating" : 0,
"category" : "City Government",
…
" "format" : { }
},
},
"data" : [ [ 1, "66020CF9-8449-4464-AE61-B2292C7A0F2D", 1, 1438255843, "393202",
1438255843, "393202", null, "Aaron,Patricia G", "Facilities/Office Services II",
"A03031", "OED-Employment Dev (031)", "1979-10-24T00:00:00", "55314.00", "53626.04" ]
, [ 2, "31C7A2FE-60E6-4219-890B-AFF01C09EC65", 2, 1438255843, "393202", 1438255843,
"393202", null, "Aaron,Petra L", "ASSISTANT STATE'S ATTORNEY", "A29045", "States
Attorneys Office (045)", "2006-09-25T00:00:00", "74000.00", "73000.08" ]
{
"meta" : {
"view" : {
"id" : "nsfe-bg53",
"name" : "Baltimore City Employee Salaries FY2015",
"attribution" : "Mayor's Office",
"averageRating" : 0,
"category" : "City Government",
…
" "format" : { }
},
},
"data" : [ [ 1, "66020CF9-8449-4464-AE61-B2292C7A0F2D", 1, 1438255843, "393202",
1438255843, "393202", null, "Aaron,Patricia G", "Facilities/Office Services II",
"A03031", "OED-Employment Dev (031)", "1979-10-24T00:00:00", "55314.00", "53626.04" ]
, [ 2, "31C7A2FE-60E6-4219-890B-AFF01C09EC65", 2, 1438255843, "393202", 1438255843,
"393202", null, "Aaron,Petra L", "ASSISTANT STATE'S ATTORNEY", "A29045", "States
Attorneys Office (045)", "2006-09-25T00:00:00", "74000.00", "73000.08" ]
{
"meta" : {
"view" : {
"id" : "nsfe-bg53",
"name" : "Baltimore City Employee Salaries FY2015",
"attribution" : "Mayor's Office",
"averageRating" : 0,
"category" : "City Government",
…
" "format" : { }
},
},
"data" : [ [ 1, "66020CF9-8449-4464-AE61-B2292C7A0F2D", 1,
1438255843, "393202", 1438255843, "393202", null,
"Aaron,Patricia G", "Facilities/Office Services II", "A03031",
"OED-Employment Dev (031)", "1979-10-24T00:00:00", "55314.00",
"53626.04" ]
, [ 2, "31C7A2FE-60E6-4219-890B-AFF01C09EC65", 2, 1438255843,
"393202", 1438255843, "393202", null, "Aaron,Petra L",
"ASSISTANT STATE'S ATTORNEY", "A29045", "States Attorneys
Office (045)", "2006-09-25T00:00:00", "74000.00", "73000.08" ]
"data" : [
[ 1,
"66020CF9-8449-4464-AE61-B2292C7A0F2D",
1,
1438255843,
"393202",
1438255843,
“393202",
null,
"Aaron,Patricia G",
"Facilities/Office Services II",
"A03031",
"OED-Employment Dev (031)",
"1979-10-24T00:00:00",
“55314.00",
“53626.04"
]
Drill has a series of functions
for nested data
Let’s look at this data in Drill
Let’s look at this data in Drill
SELECT *
FROM dfs.drillworkshop.`baltimore_salaries.json`
Let’s look at this data in Drill
SELECT *
FROM dfs.drillworkshop.`baltimore_salaries.json`
Let’s look at this data in Drill
SELECT data
FROM dfs.drillworkshop.`baltimore_salaries.json`
FLATTEN( <json array> )
separates elements in a repeated
field into individual records.
SELECT FLATTEN( data ) AS raw_data
FROM dfs.drillworkshop.`baltimore_salaries.json`
SELECT FLATTEN( data ) AS raw_data
FROM dfs.drillworkshop.`baltimore_salaries.json`
SELECT FLATTEN( data ) AS raw_data
FROM dfs.drillworkshop.`baltimore_salaries.json`
SELECT raw_data[8] AS name …
FROM
(
SELECT FLATTEN( data ) AS raw_data
FROM dfs.drillworkshop.`baltimore_salaries.json`
)
SELECT raw_data[8] AS name, raw_data[9] AS job_title
FROM
(
SELECT FLATTEN( data ) AS raw_data
FROM dfs.drillworkshop.`baltimore_salaries.json`
)
SELECT raw_data[9] AS job_title,
AVG( CAST( raw_data[13] AS DOUBLE ) ) AS avg_salary,
COUNT( DISTINCT raw_data[8] ) AS person_count
FROM
(
SELECT FLATTEN( data ) AS raw_data
FROM dfs.drillworkshop.`json/baltimore_salaries.json`
)
GROUP BY raw_data[9]
ORDER BY avg_salary DESC
Using the JSON file, recreate the earlier query to find the average
salary by job title and how many people have each job title.
Log Files
Log Files
• Drill does not natively support reading log files… yet
• If you are NOT using Merlin, included in the GitHub repo are several .jar files.
Please take a second and copy them to <drill directory>/jars/3rdparty
Log Files
070823 21:00:32 1 Connect root@localhost on test1
070823 21:00:48 1 Query show tables
070823 21:00:56 1 Query select * from category
070917 16:29:01 21 Query select * from location
070917 16:29:12 21 Query select * from location where id = 1 LIMIT 1
log": {
"type": "log",
"extensions": [
"log"
],
"fieldNames": [
"date",
"time",
"pid",
"action",
"query"
],
"pattern": "(d{6})s(d{2}:d{2}:d{2})s+(d+)s(w+)s+(.+)"
}
}
SELECT *
FROM dfs.drillworkshop.`log_files/mysql.log`
SELECT *
FROM dfs.drillworkshop.`log_files/mysql.log`
HTTPD Log Files
HTTPD Log Files
For “documentation”: https://issues.apache.org/jira/browse/DRILL-3423
195.154.46.135 - - [25/Oct/2015:04:11:25 +0100] "GET /linux/doing-pxe-without-dhcp-control HTTP/1.1" 200 24323 "http://howto.basjes.nl/"
"Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0"
23.95.237.180 - - [25/Oct/2015:04:11:26 +0100] "GET /join_form HTTP/1.0" 200 11114 "http://howto.basjes.nl/" "Mozilla/5.0 (Windows NT 5.1;
rv:35.0) Gecko/20100101 Firefox/35.0"
23.95.237.180 - - [25/Oct/2015:04:11:27 +0100] "POST /join_form HTTP/1.1" 302 9093 "http://howto.basjes.nl/join_form" "Mozilla/5.0
(Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0"
158.222.5.157 - - [25/Oct/2015:04:24:31 +0100] "GET /join_form HTTP/1.0" 200 11114 "http://howto.basjes.nl/" "Mozilla/5.0 (Windows NT 6.3;
WOW64; rv:34.0) Gecko/20100101 Firefox/34.0 AlexaToolbar/alxf-2.21"
158.222.5.157 - - [25/Oct/2015:04:24:32 +0100] "POST /join_form HTTP/1.1" 302 9093 "http://howto.basjes.nl/join_form" "Mozilla/5.0
(Windows NT 6.3; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0 AlexaToolbar/alxf-2.21"
HTTPD Log Files
For “documentation”: https://issues.apache.org/jira/browse/DRILL-3423
195.154.46.135 - - [25/Oct/2015:04:11:25 +0100] "GET /linux/doing-pxe-without-dhcp-control HTTP/1.1" 200 24323 "http://howto.basjes.nl/"
"Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0"
23.95.237.180 - - [25/Oct/2015:04:11:26 +0100] "GET /join_form HTTP/1.0" 200 11114 "http://howto.basjes.nl/" "Mozilla/5.0 (Windows NT 5.1;
rv:35.0) Gecko/20100101 Firefox/35.0"
23.95.237.180 - - [25/Oct/2015:04:11:27 +0100] "POST /join_form HTTP/1.1" 302 9093 "http://howto.basjes.nl/join_form" "Mozilla/5.0
(Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0"
158.222.5.157 - - [25/Oct/2015:04:24:31 +0100] "GET /join_form HTTP/1.0" 200 11114 "http://howto.basjes.nl/" "Mozilla/5.0 (Windows NT 6.3;
WOW64; rv:34.0) Gecko/20100101 Firefox/34.0 AlexaToolbar/alxf-2.21"
158.222.5.157 - - [25/Oct/2015:04:24:32 +0100] "POST /join_form HTTP/1.1" 302 9093 "http://howto.basjes.nl/join_form" "Mozilla/5.0
(Windows NT 6.3; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0 AlexaToolbar/alxf-2.21"
"httpd": {
"type": "httpd",
"logFormat": "%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-agent}i"",
"timestampFormat": null
},
HTTPD Log Files
For “documentation”: https://issues.apache.org/jira/browse/DRILL-3423
SELECT *
FROM dfs.drillworkshop.`data_files/log_files/small-server-
log.httpd`
HTTPD Log Files
For “documentation”: https://issues.apache.org/jira/browse/DRILL-3423
SELECT *
FROM dfs.drillworkshop.`data_files/log_files/small-server-
log.httpd`
HTTPD Log Files
SELECT request_referer, parse_url( request_referer ) AS url_data
FROM dfs.drillworkshop.`data_files/log_files/small-server-log.httpd`
HTTPD Log Files
SELECT request_referer, parse_url( request_referer ) AS url_data
FROM dfs.drillworkshop.`data_files/log_files/small-server-log.httpd`
Networking Functions
Networking Functions
• inet_aton( <ip> ): Converts an IPv4 Address to an integer
• inet_ntoa( <int> ): Converts an integer to an IPv4 address
• is_private(<ip>): Returns true if the IP is private
• in_network(<ip>,<cidr>): Returns true if the IP is in the CIDR block
• getAddressCount( <cidr> ): Returns the number of IPs in a CIDR block
• getBroadcastAddress(<cidr>): Returns the broadcast address of a CIDR block
• getNetmast( <cidr> ): Returns the net mask of a CIDR block
• getLowAddress( <cidr>): Returns the low IP of a CIDR block
• getHighAddress(<cidr>): Returns the high IP of a CIDR block
• parse_user_agent( <ua_string> ): Returns a map of user agent information
PCAP Files
SELECT *
FROM dfs.test.`dns-zone-transfer-ixfr.pcap`
SELECT *
FROM dfs.test.`dns-zone-transfer-ixfr.pcap`
Connecting other Data Sources
Connecting other Data Sources
Connecting other Data Sources
Connecting other Data Sources
SELECT teams.name, SUM( batting.HR ) as hr_total
FROM batting
INNER JOIN teams ON batting.teamID=teams.teamID
WHERE batting.yearID = 1988 AND teams.yearID = 1988
GROUP BY batting.teamID
ORDER BY hr_total DESC
Connecting other Data Sources
SELECT teams.name, SUM( batting.HR ) as hr_total
FROM batting
INNER JOIN teams ON batting.teamID=teams.teamID
WHERE batting.yearID = 1988 AND teams.yearID = 1988
GROUP BY batting.teamID
ORDER BY hr_total DESC
Connecting other Data Sources
SELECT teams.name, SUM( batting.HR ) as hr_total
FROM batting
INNER JOIN teams ON batting.teamID=teams.teamID
WHERE batting.yearID = 1988 AND teams.yearID = 1988
GROUP BY batting.teamID
ORDER BY hr_total DESC
MySQL: 0.047 seconds
Connecting other Data Sources
MySQL: 0.047 seconds
Drill: 0.366 seconds
SELECT teams.name, SUM( batting.HR ) as hr_total
FROM mysql.stats.batting
INNER JOIN mysql.stats.teams ON batting.teamID=teams.teamID
WHERE batting.yearID = 1988 AND teams.yearID = 1988
GROUP BY teams.name
ORDER BY hr_total DESC
Conclusion
• Drill is easy to use
• Drill scales
• Drill is open source
• Drill is versatile
Why aren’t you using Drill?
Thank you!
Charles Givre
@cgivre
givre_charles@bah.com
thedataist.com

More Related Content

What's hot

Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
DataWorks Summit
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
MapR Technologies
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
DataWorks Summit
 

What's hot (20)

Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
Data Exploration with Apache Drill: Day 1
Data Exploration with Apache Drill:  Day 1Data Exploration with Apache Drill:  Day 1
Data Exploration with Apache Drill: Day 1
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Apache Drill with Oracle, Hive and HBase
Apache Drill with Oracle, Hive and HBaseApache Drill with Oracle, Hive and HBase
Apache Drill with Oracle, Hive and HBase
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Free Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillFree Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache Drill
 
Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
 
Apache drill
Apache drillApache drill
Apache drill
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
 

Similar to Killing ETL with Apache Drill

IndyCodeCamp SDS May 16th 2009
IndyCodeCamp SDS May 16th 2009IndyCodeCamp SDS May 16th 2009
IndyCodeCamp SDS May 16th 2009
Aaron King
 
Data Analysis using Data Flux
Data Analysis using Data FluxData Analysis using Data Flux
Data Analysis using Data Flux
Sunil Pai
 
[C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama
[C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama[C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama
[C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama
Insight Technology, Inc.
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Wrangling data like a boss
Wrangling data like a bossWrangling data like a boss
Wrangling data like a boss
Stephanie Locke
 

Similar to Killing ETL with Apache Drill (20)

Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data Integrations
 
Intro to Big Data - Orlando Code Camp 2014
Intro to Big Data - Orlando Code Camp 2014Intro to Big Data - Orlando Code Camp 2014
Intro to Big Data - Orlando Code Camp 2014
 
IndyCodeCamp SDS May 16th 2009
IndyCodeCamp SDS May 16th 2009IndyCodeCamp SDS May 16th 2009
IndyCodeCamp SDS May 16th 2009
 
Data Analysis using Data Flux
Data Analysis using Data FluxData Analysis using Data Flux
Data Analysis using Data Flux
 
OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)
OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)
OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisIBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
 
[C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama
[C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama[C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama
[C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Application Development & Database Choices: Postgres Support for non Relation...
Application Development & Database Choices: Postgres Support for non Relation...Application Development & Database Choices: Postgres Support for non Relation...
Application Development & Database Choices: Postgres Support for non Relation...
 
Agile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science MeetupAgile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science Meetup
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Wrangling data like a boss
Wrangling data like a bossWrangling data like a boss
Wrangling data like a boss
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Tools and Tips: From Accidental to Efficient Data Warehouse Developer (SQLBit...
Tools and Tips: From Accidental to Efficient Data Warehouse Developer (SQLBit...Tools and Tips: From Accidental to Efficient Data Warehouse Developer (SQLBit...
Tools and Tips: From Accidental to Efficient Data Warehouse Developer (SQLBit...
 
Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.
 
GreenDao Introduction
GreenDao IntroductionGreenDao Introduction
GreenDao Introduction
 
Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015
 
Elasticsearch in 15 Minutes
Elasticsearch in 15 MinutesElasticsearch in 15 Minutes
Elasticsearch in 15 Minutes
 

Recently uploaded

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 

Killing ETL with Apache Drill

  • 1. Killing ETL with Drill Charles S. Givre @cgivre cgivre@thedataist.com
  • 3. We want SQL and BI support without compromising flexibility and ability of NoSchema datastores.
  • 4. Data is not arranged in an optimal way for ad-hoc analysis
  • 5. Data is not arranged in an optimal way for ad-hoc analysis ETL Data Warehouse
  • 6. Analytics teams spend between 50%-90% of their time preparing their data.
  • 7. 76% of Data Scientist say this is the least enjoyable part of their job. http://visit.crowdflower.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf
  • 8. The ETL Process consumes the most time and contributes almost no value to the end product.
  • 9.
  • 11.
  • 12. “Any sufficiently advanced technology is indistinguishable from magic” —Arthur C. Clarke
  • 13. You just query the data… no schema
  • 14. Drill is NOT just SQL on Hadoop
  • 16. Drill is open source Download Drill at: drill.apache.org
  • 17. Why should you use Drill?
  • 18. Why should you use Drill? Drill is easy to use
  • 19. Drill is easy to use Drill uses standard ANSI SQL
  • 24. Quick Demo Thank you Jair Aguirre!!
  • 26. Quick Demo data = load '/user/cloudera/data/baseball_csv/Teams.csv' using PigStorage(','); filtered = filter data by ($0 == '1988'); tm_hr = foreach filtered generate (chararray) $40 as team, (int) $19 as hrs; ordered = order tm_hr by hrs desc; dump ordered; Execution Time: 1 minute, 38 seconds
  • 27.
  • 28. Quick Demo SELECT columns[40], cast(columns[19] as int) AS HR FROM `baseball_csv/Teams.csv` WHERE columns[0] = '1988' ORDER BY HR desc; Execution Time: 0232 seconds!!
  • 30.
  • 33. NoSQL, No Problem https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json SELECT t.address.zipcode AS zip, count(name) AS rests FROM `restaurants` t GROUP BY t.address.zipcode ORDER BY rests DESC LIMIT 10;
  • 35. Querying Across Silos Farmers Market Data Restaurant Data
  • 36. Querying Across Silos SELECT t1.Borough, t1.markets, t2.rests, cast(t1.markets AS FLOAT)/ cast(t2.rests AS FLOAT) AS ratio FROM ( SELECT Borough, count(`Farmers Markets Name`) AS markets FROM `farmers_markets.csv` GROUP BY Borough ) t1 JOIN ( SELECT borough, count(name) AS rests FROM mongo.test.`restaurants` GROUP BY borough ) t2 ON t1.Borough=t2.borough ORDER BY ratio DESC;
  • 37. Querying Across Silos Execution Time: 0.502 Seconds
  • 38.
  • 39. To follow along, please download the files at: https://github.com/cgivre/drillworkshop
  • 41. Querying Drill SELECT DISTINCT management_role FROM cp.`employee.json`;
  • 43. Querying Drill SELECT * FROM cp.`employee.json` LIMIT 20
  • 44. Querying Drill SELECT * FROM cp.`employee.json` LIMIT 20
  • 45. Querying Drill SELECT <fields> FROM <table> WHERE <optional logical condition>
  • 46. Querying Drill SELECT name, address, email FROM customerData WHERE age > 20
  • 47. Querying Drill SELECT name, address, email FROM dfs.logs.`/data/customers.csv` WHERE age > 20
  • 49. Querying Drill Plugins Supported Description cp Queries files in the Java ClassPath dfs File System. Can connect to remote filesystems such as Hadoop hbase Connects to HBase hive Integrates Drill with the Apache Hive metastore kudu Provides a connection to Apache Kudu mongo Connects to mongoDB RDBMS Provides a connection to relational databases such as MySQL, Postgres, Oracle and others. S3 Provides a connection to an S3 cluster
  • 50. Problem: You have multiple log files which you would like to analyze
  • 51. Problem: You have multiple log files which you would like to analyze • In the sample data files, there is a folder called ‘logs’ which contains the following structure:
  • 55. dirn accesses the subdirectories SELECT * FROM dfs.drilldata.`logs/` WHERE dir0 = ‘2013’
  • 56. Function Description MAXDIR(), MINDIR() Limit query to the first or last directory IMAXDIR(), IMINDIR() Limit query to the first or last directory in case insensitive order. Directory Functions WHERE dir<n> = MAXDIR('<plugin>.<workspace>', '<filename>')
  • 57. In Class Exercise: Find the total number of items sold by year and the total dollar sales in each year. HINT: Don’t forget to CAST() the fields to appropriate data types SELECT dir0 AS data_year, SUM( CAST( item_count AS INTEGER ) ) as total_items, SUM( CAST( amount_spent AS FLOAT ) ) as total_sales FROM dfs.drillworkshop.`logs/` GROUP BY dir0
  • 58. Let’s look at JSON data
  • 59. Let’s look at JSON data [ { "name": "Farley, Colette L.", "email": "iaculis@atarcu.ca", "DOB": "2011-08-14", "phone": "1-758-453-3833" }, { "name": "Kelley, Cherokee R.", "email": "ante.blandit@malesuadafringilla.edu", "DOB": "1992-09-01", "phone": "1-595-478-7825" } … ]
  • 60. Let’s look at JSON data SELECT * FROM dfs.drillworkshop.`json/customers.json`
  • 61. Let’s look at JSON data SELECT * FROM dfs.drillworkshop.`json/customers.json`
  • 62. Let’s look at JSON data SELECT * FROM dfs.drillworkshop.`json/customers.json`
  • 65. { "meta" : { "view" : { "id" : "nsfe-bg53", "name" : "Baltimore City Employee Salaries FY2015", "attribution" : "Mayor's Office", "averageRating" : 0, "category" : "City Government", … " "format" : { } }, }, "data" : [ [ 1, "66020CF9-8449-4464-AE61-B2292C7A0F2D", 1, 1438255843, "393202", 1438255843, "393202", null, "Aaron,Patricia G", "Facilities/Office Services II", "A03031", "OED-Employment Dev (031)", "1979-10-24T00:00:00", "55314.00", "53626.04" ] , [ 2, "31C7A2FE-60E6-4219-890B-AFF01C09EC65", 2, 1438255843, "393202", 1438255843, "393202", null, "Aaron,Petra L", "ASSISTANT STATE'S ATTORNEY", "A29045", "States Attorneys Office (045)", "2006-09-25T00:00:00", "74000.00", "73000.08" ]
  • 66. { "meta" : { "view" : { "id" : "nsfe-bg53", "name" : "Baltimore City Employee Salaries FY2015", "attribution" : "Mayor's Office", "averageRating" : 0, "category" : "City Government", … " "format" : { } }, }, "data" : [ [ 1, "66020CF9-8449-4464-AE61-B2292C7A0F2D", 1, 1438255843, "393202", 1438255843, "393202", null, "Aaron,Patricia G", "Facilities/Office Services II", "A03031", "OED-Employment Dev (031)", "1979-10-24T00:00:00", "55314.00", "53626.04" ] , [ 2, "31C7A2FE-60E6-4219-890B-AFF01C09EC65", 2, 1438255843, "393202", 1438255843, "393202", null, "Aaron,Petra L", "ASSISTANT STATE'S ATTORNEY", "A29045", "States Attorneys Office (045)", "2006-09-25T00:00:00", "74000.00", "73000.08" ]
  • 67. { "meta" : { "view" : { "id" : "nsfe-bg53", "name" : "Baltimore City Employee Salaries FY2015", "attribution" : "Mayor's Office", "averageRating" : 0, "category" : "City Government", … " "format" : { } }, }, "data" : [ [ 1, "66020CF9-8449-4464-AE61-B2292C7A0F2D", 1, 1438255843, "393202", 1438255843, "393202", null, "Aaron,Patricia G", "Facilities/Office Services II", "A03031", "OED-Employment Dev (031)", "1979-10-24T00:00:00", "55314.00", "53626.04" ] , [ 2, "31C7A2FE-60E6-4219-890B-AFF01C09EC65", 2, 1438255843, "393202", 1438255843, "393202", null, "Aaron,Petra L", "ASSISTANT STATE'S ATTORNEY", "A29045", "States Attorneys Office (045)", "2006-09-25T00:00:00", "74000.00", "73000.08" ]
  • 68. "data" : [ [ 1, "66020CF9-8449-4464-AE61-B2292C7A0F2D", 1, 1438255843, "393202", 1438255843, “393202", null, "Aaron,Patricia G", "Facilities/Office Services II", "A03031", "OED-Employment Dev (031)", "1979-10-24T00:00:00", “55314.00", “53626.04" ]
  • 69. Drill has a series of functions for nested data
  • 70. Let’s look at this data in Drill
  • 71. Let’s look at this data in Drill SELECT * FROM dfs.drillworkshop.`baltimore_salaries.json`
  • 72. Let’s look at this data in Drill SELECT * FROM dfs.drillworkshop.`baltimore_salaries.json`
  • 73. Let’s look at this data in Drill SELECT data FROM dfs.drillworkshop.`baltimore_salaries.json`
  • 74. FLATTEN( <json array> ) separates elements in a repeated field into individual records.
  • 75. SELECT FLATTEN( data ) AS raw_data FROM dfs.drillworkshop.`baltimore_salaries.json`
  • 76. SELECT FLATTEN( data ) AS raw_data FROM dfs.drillworkshop.`baltimore_salaries.json`
  • 77. SELECT FLATTEN( data ) AS raw_data FROM dfs.drillworkshop.`baltimore_salaries.json`
  • 78. SELECT raw_data[8] AS name … FROM ( SELECT FLATTEN( data ) AS raw_data FROM dfs.drillworkshop.`baltimore_salaries.json` )
  • 79. SELECT raw_data[8] AS name, raw_data[9] AS job_title FROM ( SELECT FLATTEN( data ) AS raw_data FROM dfs.drillworkshop.`baltimore_salaries.json` )
  • 80. SELECT raw_data[9] AS job_title, AVG( CAST( raw_data[13] AS DOUBLE ) ) AS avg_salary, COUNT( DISTINCT raw_data[8] ) AS person_count FROM ( SELECT FLATTEN( data ) AS raw_data FROM dfs.drillworkshop.`json/baltimore_salaries.json` ) GROUP BY raw_data[9] ORDER BY avg_salary DESC
  • 81. Using the JSON file, recreate the earlier query to find the average salary by job title and how many people have each job title.
  • 83. Log Files • Drill does not natively support reading log files… yet • If you are NOT using Merlin, included in the GitHub repo are several .jar files. Please take a second and copy them to <drill directory>/jars/3rdparty
  • 84. Log Files 070823 21:00:32 1 Connect root@localhost on test1 070823 21:00:48 1 Query show tables 070823 21:00:56 1 Query select * from category 070917 16:29:01 21 Query select * from location 070917 16:29:12 21 Query select * from location where id = 1 LIMIT 1
  • 85. log": { "type": "log", "extensions": [ "log" ], "fieldNames": [ "date", "time", "pid", "action", "query" ], "pattern": "(d{6})s(d{2}:d{2}:d{2})s+(d+)s(w+)s+(.+)" } }
  • 89. HTTPD Log Files For “documentation”: https://issues.apache.org/jira/browse/DRILL-3423 195.154.46.135 - - [25/Oct/2015:04:11:25 +0100] "GET /linux/doing-pxe-without-dhcp-control HTTP/1.1" 200 24323 "http://howto.basjes.nl/" "Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0" 23.95.237.180 - - [25/Oct/2015:04:11:26 +0100] "GET /join_form HTTP/1.0" 200 11114 "http://howto.basjes.nl/" "Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0" 23.95.237.180 - - [25/Oct/2015:04:11:27 +0100] "POST /join_form HTTP/1.1" 302 9093 "http://howto.basjes.nl/join_form" "Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0" 158.222.5.157 - - [25/Oct/2015:04:24:31 +0100] "GET /join_form HTTP/1.0" 200 11114 "http://howto.basjes.nl/" "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0 AlexaToolbar/alxf-2.21" 158.222.5.157 - - [25/Oct/2015:04:24:32 +0100] "POST /join_form HTTP/1.1" 302 9093 "http://howto.basjes.nl/join_form" "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0 AlexaToolbar/alxf-2.21"
  • 90. HTTPD Log Files For “documentation”: https://issues.apache.org/jira/browse/DRILL-3423 195.154.46.135 - - [25/Oct/2015:04:11:25 +0100] "GET /linux/doing-pxe-without-dhcp-control HTTP/1.1" 200 24323 "http://howto.basjes.nl/" "Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0" 23.95.237.180 - - [25/Oct/2015:04:11:26 +0100] "GET /join_form HTTP/1.0" 200 11114 "http://howto.basjes.nl/" "Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0" 23.95.237.180 - - [25/Oct/2015:04:11:27 +0100] "POST /join_form HTTP/1.1" 302 9093 "http://howto.basjes.nl/join_form" "Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0" 158.222.5.157 - - [25/Oct/2015:04:24:31 +0100] "GET /join_form HTTP/1.0" 200 11114 "http://howto.basjes.nl/" "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0 AlexaToolbar/alxf-2.21" 158.222.5.157 - - [25/Oct/2015:04:24:32 +0100] "POST /join_form HTTP/1.1" 302 9093 "http://howto.basjes.nl/join_form" "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0 AlexaToolbar/alxf-2.21" "httpd": { "type": "httpd", "logFormat": "%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-agent}i"", "timestampFormat": null },
  • 91. HTTPD Log Files For “documentation”: https://issues.apache.org/jira/browse/DRILL-3423 SELECT * FROM dfs.drillworkshop.`data_files/log_files/small-server- log.httpd`
  • 92. HTTPD Log Files For “documentation”: https://issues.apache.org/jira/browse/DRILL-3423 SELECT * FROM dfs.drillworkshop.`data_files/log_files/small-server- log.httpd`
  • 93. HTTPD Log Files SELECT request_referer, parse_url( request_referer ) AS url_data FROM dfs.drillworkshop.`data_files/log_files/small-server-log.httpd`
  • 94. HTTPD Log Files SELECT request_referer, parse_url( request_referer ) AS url_data FROM dfs.drillworkshop.`data_files/log_files/small-server-log.httpd`
  • 96. Networking Functions • inet_aton( <ip> ): Converts an IPv4 Address to an integer • inet_ntoa( <int> ): Converts an integer to an IPv4 address • is_private(<ip>): Returns true if the IP is private • in_network(<ip>,<cidr>): Returns true if the IP is in the CIDR block • getAddressCount( <cidr> ): Returns the number of IPs in a CIDR block • getBroadcastAddress(<cidr>): Returns the broadcast address of a CIDR block • getNetmast( <cidr> ): Returns the net mask of a CIDR block • getLowAddress( <cidr>): Returns the low IP of a CIDR block • getHighAddress(<cidr>): Returns the high IP of a CIDR block • parse_user_agent( <ua_string> ): Returns a map of user agent information
  • 103. Connecting other Data Sources SELECT teams.name, SUM( batting.HR ) as hr_total FROM batting INNER JOIN teams ON batting.teamID=teams.teamID WHERE batting.yearID = 1988 AND teams.yearID = 1988 GROUP BY batting.teamID ORDER BY hr_total DESC
  • 104. Connecting other Data Sources SELECT teams.name, SUM( batting.HR ) as hr_total FROM batting INNER JOIN teams ON batting.teamID=teams.teamID WHERE batting.yearID = 1988 AND teams.yearID = 1988 GROUP BY batting.teamID ORDER BY hr_total DESC
  • 105. Connecting other Data Sources SELECT teams.name, SUM( batting.HR ) as hr_total FROM batting INNER JOIN teams ON batting.teamID=teams.teamID WHERE batting.yearID = 1988 AND teams.yearID = 1988 GROUP BY batting.teamID ORDER BY hr_total DESC MySQL: 0.047 seconds
  • 106. Connecting other Data Sources MySQL: 0.047 seconds Drill: 0.366 seconds SELECT teams.name, SUM( batting.HR ) as hr_total FROM mysql.stats.batting INNER JOIN mysql.stats.teams ON batting.teamID=teams.teamID WHERE batting.yearID = 1988 AND teams.yearID = 1988 GROUP BY teams.name ORDER BY hr_total DESC
  • 107. Conclusion • Drill is easy to use • Drill scales • Drill is open source • Drill is versatile
  • 108. Why aren’t you using Drill?