SlideShare a Scribd company logo
1 of 32
Interactive SQL POC on
Hadoop
Hive 13, Hive-on-Tez, Presto
Storage: RCFile, ORC, Parquet and Avro
Sudhir Mallem
Alex Bain
George Zhang
Team
Introduction
• Interactive SQL POC on HDFS
• Hive (version 13), Hive-on-Tez, Presto
• Storage formats:
• RCFile
• ORC
• Parquet
• Avro
• Compression
• Snappy
• Zlib
• Native gzip compression
Goal of Benchmark
• Goal of this benchmark is :
• To provide a comprehensive overview and testing of the interactive SQL on
Hadoop
• To measure response time in terms of performance across each of the
platform across different storage formats
• To measure compression – data size – across different dimensions
• To have a better understanding of the performance gain we may potentially
see with Queries on each of these platforms.
• Avro is widely used across LinkedIn and testing it on the newer platform (hive
13) with other tools (tez, presto etc) will give us a good understanding of the
performance gain we may potentially see.
System, Storage formats and compression
Systems chosen:
Hive - version 13.1
Hive13 on Tez + Yarn - Tez version: 0.4.1
Presto - 0.74, 0.79 and 0.80 version
Storage Formats and compression:
ORC + zlib compression
RCFile + snappy
Parquet + snappy
Avro + Avro native compression - Deflate level 9
System, Storage formats and compression
• Presto - dataset was created in RCFile.
• This was the most recommended version for Presto when this was
evaluated.
• During the time of the evaluation, Presto had issues working with
Avro and Parquet. The queries were either not running or they were
not quite optimized.
• With Presto v0.80 releasing, we have tested it with ORCFile format.
• We flattened certain data (pageviewevent) to support benchmark on
Presto.
• Currently Presto supports only Map datatypes. Struct etc use
Json_extract function.
About Hive 13
• This is the next version of Hive at LinkedIn.
• Hive is used heavily at LinkedIn for interactive sql capability where
users are not PigLatin savvy and prefer a sql solution.
• Hive is generally slow as it runs on Map/reduce and competes with
Mappers and reducers on the HDFS system along with PigLatin and
vanilla m/r.
About Hive 13-on-Tez
• Tez is a new application framework built on Hadoop Yarn that can execute complex directed
acyclic graphs of general data processing tasks. In many ways it can be thought of as a more
flexible and powerful successor of the map-reduce framework built by HortonWorks
• It generalizes map and reduce tasks by exposing interfaces for generic data processing tasks,
which consist of a triplet of interfaces: input, output and processor. These tasks are the
vertices in the execution graph. Edges (i.e.: data connections between tasks) are first class
citizens in Tez and together with the input/output interfaces greatly increase the flexibility of
how data is transferred between tasks.
• Tez also greatly extends the possible ways of which individual tasks can be linked together;
In fact any arbitrary DAG can be executed directly in Tez. In Tez parlance a map-reduce job is
basically a simple DAG consisting of a single map and reduce vertice connected by a
“bipartite” edge (i.e.: the edge connects every map task to every reduce task). Map input
and reduce outputs are HDFS inputs and outputs respectively. The map output class locally
sorts and partitions the data by a certain key, while the reduce input class merge-sorts its
data on the same key.
• Tez also provides what basically is a map-reduce compat layer that let’s one run MR jobs on
top of the new execution layer by implementing Map/Reduce concepts on the new
execution framework.
About Presto
• Presto is an open source distributed SQL query engine for running
interactive analytic queries against data sources of all sizes ranging from
gigabytes to petabytes.
• Presto was designed and written from the ground up for interactive
analytics and approaches the speed of commercial data warehouses
while scaling to the data size of organizations like LinkedIn and
Facebook.
About Dataset
• The input dataset was carefully chosen to cover not only the performance
perspective of benchmarking, but also to gain better insight into each of the system.
It gives a good understanding of the query patters they support, the functions, ease
of use etc.
• Different dimension tables, facts and Aggregates.
• Data ranges anywhere from 20k rows to 80+billion.
• Hive supports Complex datatypes like Struct, array, Union and Map. The data that we
chose has nested structure, key values and binary data.
• We Flattened the data for use in Presto as the 0.74 version of Presto supports only
Array and Map datatype. The underlying data is stored as JSON, so we have to use
json functions to extract and refer to the data.
• One of the dataset is a flat table with 600+ columns, specifically to test the columnar
functionality with Parquet, RCFile and ORC file formats.
Evaluation Criteria
• We chose 15 queries for our testing and benchmarking. These sqls are
some of the commonly used queries users run in DWH at LinkedIn.
• The queries test the following functionality:
• date and time manipulations
• nested sqls, wildcard searches,
• Filter predicates, partition pruning, Full table scans and Joins (3way, 2way etc).
• exists, in, not exists, not in
• Aggregate functions like sum, max, count(distinct), count(1)
• Extract keys from map, struct datatypes
Query 1 – simple groupby and count
select
trackingcode,
count(1)
from pageviewevent
where
datepartition='2014-07-15'
group by
trackingcode
limit 100;
Query 2 - case expression with filter predicates
SELECT
datepartition,
SUM(CASE when requestheader.pagekey in ('pulse-
saved-articles','pulse-settings','pulse-pbar','pulse-slice-
internal', 'pulse-share-hub','pulse-special-jobs-
economy','pulse-browse','pulse-slice-connections') then 1
when requestheader.pagekey in ('pulse-slice','pulse-top-
news') and (trackingcode NOT LIKE 'eml-tod%' OR
trackingcode IS NULL) then 1 else 0 end) AS TODAY_PV
FROM
pageviewevent
where
datepartition = '2014-07-15'
and header.memberid > 0
group by datepartition;
Query 3 – check count(distinct) with wildcard search
SELECT
a.datepartition,
d.country_sk,
COUNT(1) AS total_count,
count(distinct a.header.memberid) as unique_count
FROM pageviewevent a
INNER JOIN dim_tracking_code b ON
a.trackingcode=b.tracking_code
INNER JOIN dim_page_key c ON
a.requestheader.pagekey=c.page_key AND c.is_aggregate =
1
left outer join dim_member_cntry_lcl d on a.header.memberId
= d.member_sk
WHERE a.datepartition = '2014-07-18'
AND (
LOWER(a.trackingcode) LIKE 'eml_bt1%'
OR LOWER(a.trackingcode) LIKE 'emlt_bt1%'
OR LOWER(a.trackingcode) LIKE 'eml-bt1%'
OR LOWER(a.trackingcode) LIKE 'emlt-bt1%'
)
GROUP BY a.datepartition, country_sk ;
Query 4 – Joins, filter predicates with count(distinct)
SELECT
datepartition,
coalesce(c.country_sk,-9),
COUNT(DISTINCT a.header.memberid)
FROM pageviewevent a
inner join dim_page_key b on a.requestheader.pagekey =
b.page_key
and b.page_key_group_sk = 39
and b.is_aggregate = 1
left outer join
dim_member_cntry_lcl c on a.header.memberid=
c.member_sk
where a.datepartition = '2014-07-19'
and a.header.memberid > 0
group by
datepartition, coalesce(c.country_sk,-9);
Query 5 – test map datatype with filter predicates
select
substr(datepartition,1,10) as date_data,
campaigntypeint,
header.memberid,
channelid,
`format` as ad_format,
publisherid, campaignid,
advertiserid,
creativeid,
parameters['organicActivityId'] as activityid,
parameters['activityType'] as socialflag,
'0' as feedposition,
sum(case when statusint in (1,4) and channelid in (2,1) then 1 when statusint in (1,4)
and channelid in (2000, 3000) and parameters['sequence'] = 0 then 1 else 0 end) as
imp,
sum(case when statusint = 1 and channelid in (2,1) then 1 when statusint = 1 and
channelid in (2000, 3000) and parameters['sequence'] = 0 then 1 else 0 end) as
imp_sas,
sum(case when channelid in (2000, 3000) and parameters['sequence'] > 0 then 1 else 0
end) as view_other,
sum(case when statusint = 1 then cost else 0.0 end) as rev_imp
from adimpressionevent
where
datepartition = '2014-07-20' and campaignTypeInt = 14
group by
substr(datepartition,1,10),
campaigntypeint, header.memberid, channelid, `format`, publisherid, campaignid,
advertiserid, creativeid,
parameters['organicActivityId'],
parameters['activityType']
limit 1000;
Query 6 – 2 table join with count(distinct)
select
count(distinct member_sk)
from dim_position p
join dim_company c
on c.company_sk=p.std_company_sk
and c.active='Y'
and c.company_type_sk=4
where
end_date is null
and is_primary ='Y';
Query 7 – 600+ column table test
select om.current_company as Company,
om.industry as Industry,
om.company_size as Company_Size,
om.current_title as Job_Title,
om.member_sk as Member_SK,
om.first_name as First_Name,
om.last_name as Last_Name,
om.email_address as Email,
om.connections as Connections ,
om.country as Country,
om.region as Region,
om.cropped_picture_id as Profile_Picture,
om.pref_locale as Pref_Locale,
om.headline as Headline
from om_segment om
where
om.ACTIVE_FLAG = 1
and om.country_sk in (162,78,75,2,57)
and om.connections > 99
and om.pageview_l30d > 0
and
(
( om.headline like '%linkedin%')
or (om.current_title like '%linkedin%')
or (
(om.headline like '%social media%' or om.headline like
'%social consultant%' or om.headline like '%social recruit%' or
om.headline like '%employer brand%')
and
(om.headline like '%train%' or om.headline like '%consult%' or
om.headline like '%advis%' or om.headline like '%recruit%')
)
or (
(om.current_title like '%social media%' or om.current_title like
'%social consultant%' or om.current_title like '%social recruit%' or
om.current_title like '%employer brand%')
and
(om.current_title like '%train%' or om.current_title like
'%consult%' or om.current_title like
'%advis%' or om.current_title like '%recruit%'
)
)
) ;
Query 8 – 3 table joins with uniques
select distinct f.member_sk
FROM
dim_education e join
dim_member_flat f on (e.member_sk = f.member_sk)
join
dim_school s on (e.school_sk = s.school_sk)
WHERE f.active_flag = 'Y' and (
( e.country_sk = 167 ) OR
( s.country_sk = 167 ) ) limit 1000;
Query 9 – wide table test (600+ columns - test columnar)
select member_sk
from om_segment
where
(lss_decision_maker_flag like 'DM' or
lss_decision_maker_flag like 'IC')
and (lss_company_tier like 'Enterprise' or
lss_company_tier like 'SMB' or lss_company_tier like
'SRM')
and (lss_customer_status like 'Prospect' or
lss_customer_status like 'Customer')
and (lss_subscriber_status like 'Online Gen Subscriber' or
lss_subscriber_status like 'Not a Subscriber')
and country_sk in
(14,194,174,95,154,167,227,37,102,78,162,163,70,193,2
1,132,59,101,2,242) limit 1000;
Query 10 – using sub-queries joins – push down
select
p.member_sk
from dim_position p
inner join (
select
position_sk,
std_title_2_sk,
member_sk
from
dim_position_std_title_2) pt
on p.position_sk = pt.position_sk
and p.member_sk = pt.member_sk
inner join (
select std_title_2_sk
from
dim_std_title_2 where std_title_2_id in
(17801,20923,11001,21845,8206,8136,22224,5204,13257,5642,8,16565,7
92,12949,13758)) t
on pt.std_title_2_sk = t.std_title_2_sk
inner join (
select company_sk
from dim_company
where company_size_sk > 2) c
on p.std_company_sk = c.company_sk
where p.end_date is null
and p.is_primary = 'Y' limit 1000;
Query 11 – test unionall
select
distinct member_sk
from (
select member_sk
from dim_education
where school_sk in (
9873, 10065, 10388, 9872, 7916, 10241, 10242, 9900,
10377, 10719, 10637, 8534, 8535, 9906)
union all
select member_sk
from dim_position
where final_company_sk in
(74701,74702,12831,159378,62771,67754,
75480,79641,73975,87156,1895741,147775)
or company_sk in
(74701,74702,12831,159378,62771,67754,75480,79641
,73975,87156,1895741,147775)
) x
limit 1000;
Query 12 – 3 table joins
create table u_smallem.retirement_members as
select distinct sds.member_sk
from u_smallem.v_retirement_dm sds inner join
dim_member_flat mem on mem.member_sk=sds.member_sk and
active_flag='Y' inner join
dim_position pos on sds.member_sk=pos.member_sk
where
(pos.final_seniority_2_sk in (6,7,9,10) OR
pos.user_supplied_title like '%senior consultant%')
UNION
select distinct current_date, mem.member_sk, 739, 4
from dim_position pos inner join
dim_member_flat mem on mem.member_sk=pos.member_sk and
active_flag='Y'
where
pos.final_company_sk
in(12254,24672,12694,16583,21410,38641,145164,32346,20918,35083,96
824,49506,159381,48201,45860,215432,53484,327842,63747,78721,1394
06,778800)
and (final_std_title_2_sk in (select std_title_2_sk as final_st_title_2_sk
from dim_std_title_2 where occupation_id=235)
or pos.user_supplied_title like '%benefit consultant%');
Query 13 – time based calculations
select distinct member_sk from (
select
member_sk,
start_date,
end_date, cast(from_unixtime(unix_timestamp()-
24*3600*90,'yyyyMM') as int) d1,
cast(year(from_unixtime(unix_timestamp())) as int)*100 d2,
source_created_ts
from dim_position ) x
where
start_date >= d1 or end_date >= d1
or ((start_date = d2 or end_date = d2)
and source_created_ts >= unix_timestamp()-24*3600*90)
limit 1000;
Query 14 – many small table joins
create table u_smallem.vs_rti_ad_order
as
select
o.ad_order_sk,
sum (r.ad_impressions) as impressions,
sum (r.ad_clicks) as clicks
from agg_daily_ad_revenue r
inner join dim_ad a on r.ad_sk = a.ad_sk
inner join dim_ad_order o on r.ad_order_sk = o.ad_order_sk
inner join dim_advertiser v on v.advertiser_sk = o.advertiser_sk
where r.datepartition >= '2014-07-01' and r.datepartition <= '2014-07-31'
and r.ad_creative_size_sk in (6,8,17,29)
and v.adv_saleschannel_name like 'Field%'
and o.lars_sales_channel_name like 'Advertising Field'
and r.ad_site_sk = 1
and r.ad_zone_sk <> 1175
and o.proposal_bind_id is not null
and
(coalesce(a.lars_product_type, 'n/a') not like 'Click Tracker' or coalesce(a.lars_product_type, 'n/a') not like 'inMail'
or coalesce(a.lars_target_type, 'n/a') not like 'Partner Message' or coalesce(a.lars_target_type, 'n/a') not like 'Polls'
)
group by o.ad_order_sk
having sum(r.ad_impressions) > 9999;
drop table if exists u_smallem.vs_final;
create table u_smallem.vs_final
as
select distinct i.member_sk from (
select member_sk, f.ad_order_sk, count(1) as impr from fact_detail_ad_impressions f join u_smallem.vs_rti_ad_order u on
f.ad_order_sk = u.ad_order_sk
where date_sk >= '2014-07-01' and date_sk <= '2014-07-07'
and ad_creative_size_sk in (6,17)
--and ad_order_sk in (select distinct ad_order_sk from u_smallem.vs_rti_ad_order)
group by member_sk, f.ad_order_sk
having count(1) > 10) i join om_segment o on i.member_sk = o.member_sk
where i.member_sk > 0 and o.pageview_l30d < 3000;
Query 15 – check not exists
drop table if exists u_smallem.tmp_SDS_AU;
create table u_smallem.tmp_SDS_AU
AS select distinct member_sk
from fact_bzops_follower f1
where
company_id = 2584270
and status='A'
and not exists (
select 1
from fact_bzops_follower f2
where
company_id = 3600
and status='A'
and f2.member_sk = f1.member_sk) ;
Query1 – Test concurrent users (Presto only)
• This exercise was performed for Presto only.
• Concurrency is measured by number of users
running query in parallel.
• For simplicity sake, we chose the same query
ran by 1 user, 2, 4, 8 and 12 users at the same
time.
• Queries 3 and 4 Failed with multiple
concurrent users which clearly indicates that
more memory is required on the system
• Multiple big table joins would fail on the
system when run concurrently.
Query3:
SELECT
datepartition,
coalesce(c.country_sk,-9),
COUNT(DISTINCT a.memberid)
FROM pageviewevent_flat a
inner join dim_page_key b on a.pagekey = b.page_key and b.page_key_group_sk = 39 and b.is_aggregate = 1
left outer join dim_member_cntry_lcl c on a.memberid= c.member_sk
where a.datepartition = '2014-07-11'
and a.memberid > 0
group by datepartition, coalesce(c.country_sk,-9);
Query1 – Test linear growth (7 day window)
Query:
select trackingcode, count(1) from pageviewevent_flat
where
datepartition >= '2014-07-15' and datepartition <= '2014-07-16'
group by trackingcode limit 100;
• This exercise was performed on Presto and
hive-on-tez
• We chose Query 1 for this test
• Query 1 was ran with 1,2,4 and 7 day range.
Query3 – Test linear growth (7 day window)
Query:
SELECT
a.datepartition,
d.country_sk,
COUNT(1) AS total_count,
count(distinct a.memberid) as unique_count
FROM pageviewevent_flat a
INNER JOIN dim_page_key c
ON a.pagekey=c.page_key AND c.is_aggregate = 1
left outer join dim_member_cntry_lcl d
on a.memberId = d.member_sk
WHERE a.datepartition >= '2014-07-18'
and a.datepartition <= '2014-07-19'
AND (
LOWER(a.trackingcode) LIKE 'eml_bt1%'
OR LOWER(a.trackingcode) LIKE 'emlt_bt1%'
OR LOWER(a.trackingcode) LIKE 'eml-bt1%'
OR LOWER(a.trackingcode) LIKE 'emlt-bt1%'
)
GROUP BY a.datepartition, country_sk ;
• This exercise was performed on Presto and
hive-on-tez
• We chose Query 3 for this test
• Query 3 was ran with 1,2,4 and 7 day range.
All queries – Holistic view
Conclusion
• Hive-on-Tez
• Pros:
• Environments that are running on Hive only can benefit from having Hive-on-tez.
• Hive-on-tez offers considerable improvement in query performance and offers an alternate
solution to MapReduce.
• In many cases we have seen queries speed up atleast 3x-8x compared to Hive.
• Switching to Hive-on-tez is extremely simple (set hive.execution.engine=tez)
• Cons:
• For this POC, we had to tweak many Hive configuration properties to get the optimal
performance for queries running on Tez. We felt this to be a drawback as we had to tune
parameters specific to certain queries. This may be a hindrance for ad-hoc queries.
• There were couple of queries that were running infinitely and we had to terminate them.
Conclusion
• Presto
• Pros:
• was proven to be fast and is a very good solution for ad-hoc analysis and faster table scans.
• Presto was 3x to 10x faster in almost many queries compared to Hive on MapReduce.
• The sql federation and query federation is an amazing feature for joining mysql or teradata
to Hive tables using Presto. This is similar to the Aster data SQL-H feature.
• Cons:
• Requires separate installation
• Memory was a big issue with Presto. Concurrency test that we did with multiple users
clearly indicates that memory was insufficient. Also, joining 2 big tables requires lot of
memory and running them on Presto clearly indicates that this is not going to work as it
doesn’t support distributed hash joins.
• DDLs are not supported.

More Related Content

What's hot

Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseAsis Mohanty
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hiveDavid Kaiser
 
Data Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystemData Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystemDataWorks Summit/Hadoop Summit
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleDataWorks Summit/Hadoop Summit
 
Benefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceBenefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceDataWorks Summit/Hadoop Summit
 
Big data at United Airlines
Big data at United AirlinesBig data at United Airlines
Big data at United AirlinesDataWorks Summit
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 

What's hot (20)

Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hive
 
Data Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystemData Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystem
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
 
Benefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceBenefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business Intelligence
 
Big data at United Airlines
Big data at United AirlinesBig data at United Airlines
Big data at United Airlines
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Empower Data-Driven Organizations
Empower Data-Driven OrganizationsEmpower Data-Driven Organizations
Empower Data-Driven Organizations
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 

Similar to Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)

Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL ServerStéphane Fréchette
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineNicolas Morales
 
Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaNithin Kakkireni
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120Hyoungjun Kim
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
At the core you will have KUSTO
At the core you will have KUSTOAt the core you will have KUSTO
At the core you will have KUSTORiccardo Zamana
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.pptSathish24111
 
Microsoft R - ScaleR Overview
Microsoft R - ScaleR OverviewMicrosoft R - ScaleR Overview
Microsoft R - ScaleR OverviewKhalid Salama
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over YarnInMobi Technology
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 

Similar to Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez) (20)

Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL Server
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop Engine
 
Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_Sharmila
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
At the core you will have KUSTO
At the core you will have KUSTOAt the core you will have KUSTO
At the core you will have KUSTO
 
Big data
Big dataBig data
Big data
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
Microsoft R - ScaleR Overview
Microsoft R - ScaleR OverviewMicrosoft R - ScaleR Overview
Microsoft R - ScaleR Overview
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 

Recently uploaded

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)

  • 1. Interactive SQL POC on Hadoop Hive 13, Hive-on-Tez, Presto Storage: RCFile, ORC, Parquet and Avro Sudhir Mallem Alex Bain George Zhang
  • 3. Introduction • Interactive SQL POC on HDFS • Hive (version 13), Hive-on-Tez, Presto • Storage formats: • RCFile • ORC • Parquet • Avro • Compression • Snappy • Zlib • Native gzip compression
  • 4. Goal of Benchmark • Goal of this benchmark is : • To provide a comprehensive overview and testing of the interactive SQL on Hadoop • To measure response time in terms of performance across each of the platform across different storage formats • To measure compression – data size – across different dimensions • To have a better understanding of the performance gain we may potentially see with Queries on each of these platforms. • Avro is widely used across LinkedIn and testing it on the newer platform (hive 13) with other tools (tez, presto etc) will give us a good understanding of the performance gain we may potentially see.
  • 5. System, Storage formats and compression Systems chosen: Hive - version 13.1 Hive13 on Tez + Yarn - Tez version: 0.4.1 Presto - 0.74, 0.79 and 0.80 version Storage Formats and compression: ORC + zlib compression RCFile + snappy Parquet + snappy Avro + Avro native compression - Deflate level 9
  • 6. System, Storage formats and compression • Presto - dataset was created in RCFile. • This was the most recommended version for Presto when this was evaluated. • During the time of the evaluation, Presto had issues working with Avro and Parquet. The queries were either not running or they were not quite optimized. • With Presto v0.80 releasing, we have tested it with ORCFile format. • We flattened certain data (pageviewevent) to support benchmark on Presto. • Currently Presto supports only Map datatypes. Struct etc use Json_extract function.
  • 7. About Hive 13 • This is the next version of Hive at LinkedIn. • Hive is used heavily at LinkedIn for interactive sql capability where users are not PigLatin savvy and prefer a sql solution. • Hive is generally slow as it runs on Map/reduce and competes with Mappers and reducers on the HDFS system along with PigLatin and vanilla m/r.
  • 8. About Hive 13-on-Tez • Tez is a new application framework built on Hadoop Yarn that can execute complex directed acyclic graphs of general data processing tasks. In many ways it can be thought of as a more flexible and powerful successor of the map-reduce framework built by HortonWorks • It generalizes map and reduce tasks by exposing interfaces for generic data processing tasks, which consist of a triplet of interfaces: input, output and processor. These tasks are the vertices in the execution graph. Edges (i.e.: data connections between tasks) are first class citizens in Tez and together with the input/output interfaces greatly increase the flexibility of how data is transferred between tasks. • Tez also greatly extends the possible ways of which individual tasks can be linked together; In fact any arbitrary DAG can be executed directly in Tez. In Tez parlance a map-reduce job is basically a simple DAG consisting of a single map and reduce vertice connected by a “bipartite” edge (i.e.: the edge connects every map task to every reduce task). Map input and reduce outputs are HDFS inputs and outputs respectively. The map output class locally sorts and partitions the data by a certain key, while the reduce input class merge-sorts its data on the same key. • Tez also provides what basically is a map-reduce compat layer that let’s one run MR jobs on top of the new execution layer by implementing Map/Reduce concepts on the new execution framework.
  • 9. About Presto • Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. • Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the data size of organizations like LinkedIn and Facebook.
  • 10. About Dataset • The input dataset was carefully chosen to cover not only the performance perspective of benchmarking, but also to gain better insight into each of the system. It gives a good understanding of the query patters they support, the functions, ease of use etc. • Different dimension tables, facts and Aggregates. • Data ranges anywhere from 20k rows to 80+billion. • Hive supports Complex datatypes like Struct, array, Union and Map. The data that we chose has nested structure, key values and binary data. • We Flattened the data for use in Presto as the 0.74 version of Presto supports only Array and Map datatype. The underlying data is stored as JSON, so we have to use json functions to extract and refer to the data. • One of the dataset is a flat table with 600+ columns, specifically to test the columnar functionality with Parquet, RCFile and ORC file formats.
  • 11. Evaluation Criteria • We chose 15 queries for our testing and benchmarking. These sqls are some of the commonly used queries users run in DWH at LinkedIn. • The queries test the following functionality: • date and time manipulations • nested sqls, wildcard searches, • Filter predicates, partition pruning, Full table scans and Joins (3way, 2way etc). • exists, in, not exists, not in • Aggregate functions like sum, max, count(distinct), count(1) • Extract keys from map, struct datatypes
  • 12. Query 1 – simple groupby and count select trackingcode, count(1) from pageviewevent where datepartition='2014-07-15' group by trackingcode limit 100;
  • 13. Query 2 - case expression with filter predicates SELECT datepartition, SUM(CASE when requestheader.pagekey in ('pulse- saved-articles','pulse-settings','pulse-pbar','pulse-slice- internal', 'pulse-share-hub','pulse-special-jobs- economy','pulse-browse','pulse-slice-connections') then 1 when requestheader.pagekey in ('pulse-slice','pulse-top- news') and (trackingcode NOT LIKE 'eml-tod%' OR trackingcode IS NULL) then 1 else 0 end) AS TODAY_PV FROM pageviewevent where datepartition = '2014-07-15' and header.memberid > 0 group by datepartition;
  • 14. Query 3 – check count(distinct) with wildcard search SELECT a.datepartition, d.country_sk, COUNT(1) AS total_count, count(distinct a.header.memberid) as unique_count FROM pageviewevent a INNER JOIN dim_tracking_code b ON a.trackingcode=b.tracking_code INNER JOIN dim_page_key c ON a.requestheader.pagekey=c.page_key AND c.is_aggregate = 1 left outer join dim_member_cntry_lcl d on a.header.memberId = d.member_sk WHERE a.datepartition = '2014-07-18' AND ( LOWER(a.trackingcode) LIKE 'eml_bt1%' OR LOWER(a.trackingcode) LIKE 'emlt_bt1%' OR LOWER(a.trackingcode) LIKE 'eml-bt1%' OR LOWER(a.trackingcode) LIKE 'emlt-bt1%' ) GROUP BY a.datepartition, country_sk ;
  • 15. Query 4 – Joins, filter predicates with count(distinct) SELECT datepartition, coalesce(c.country_sk,-9), COUNT(DISTINCT a.header.memberid) FROM pageviewevent a inner join dim_page_key b on a.requestheader.pagekey = b.page_key and b.page_key_group_sk = 39 and b.is_aggregate = 1 left outer join dim_member_cntry_lcl c on a.header.memberid= c.member_sk where a.datepartition = '2014-07-19' and a.header.memberid > 0 group by datepartition, coalesce(c.country_sk,-9);
  • 16. Query 5 – test map datatype with filter predicates select substr(datepartition,1,10) as date_data, campaigntypeint, header.memberid, channelid, `format` as ad_format, publisherid, campaignid, advertiserid, creativeid, parameters['organicActivityId'] as activityid, parameters['activityType'] as socialflag, '0' as feedposition, sum(case when statusint in (1,4) and channelid in (2,1) then 1 when statusint in (1,4) and channelid in (2000, 3000) and parameters['sequence'] = 0 then 1 else 0 end) as imp, sum(case when statusint = 1 and channelid in (2,1) then 1 when statusint = 1 and channelid in (2000, 3000) and parameters['sequence'] = 0 then 1 else 0 end) as imp_sas, sum(case when channelid in (2000, 3000) and parameters['sequence'] > 0 then 1 else 0 end) as view_other, sum(case when statusint = 1 then cost else 0.0 end) as rev_imp from adimpressionevent where datepartition = '2014-07-20' and campaignTypeInt = 14 group by substr(datepartition,1,10), campaigntypeint, header.memberid, channelid, `format`, publisherid, campaignid, advertiserid, creativeid, parameters['organicActivityId'], parameters['activityType'] limit 1000;
  • 17. Query 6 – 2 table join with count(distinct) select count(distinct member_sk) from dim_position p join dim_company c on c.company_sk=p.std_company_sk and c.active='Y' and c.company_type_sk=4 where end_date is null and is_primary ='Y';
  • 18. Query 7 – 600+ column table test select om.current_company as Company, om.industry as Industry, om.company_size as Company_Size, om.current_title as Job_Title, om.member_sk as Member_SK, om.first_name as First_Name, om.last_name as Last_Name, om.email_address as Email, om.connections as Connections , om.country as Country, om.region as Region, om.cropped_picture_id as Profile_Picture, om.pref_locale as Pref_Locale, om.headline as Headline from om_segment om where om.ACTIVE_FLAG = 1 and om.country_sk in (162,78,75,2,57) and om.connections > 99 and om.pageview_l30d > 0 and ( ( om.headline like '%linkedin%') or (om.current_title like '%linkedin%') or ( (om.headline like '%social media%' or om.headline like '%social consultant%' or om.headline like '%social recruit%' or om.headline like '%employer brand%') and (om.headline like '%train%' or om.headline like '%consult%' or om.headline like '%advis%' or om.headline like '%recruit%') ) or ( (om.current_title like '%social media%' or om.current_title like '%social consultant%' or om.current_title like '%social recruit%' or om.current_title like '%employer brand%') and (om.current_title like '%train%' or om.current_title like '%consult%' or om.current_title like '%advis%' or om.current_title like '%recruit%' ) ) ) ;
  • 19. Query 8 – 3 table joins with uniques select distinct f.member_sk FROM dim_education e join dim_member_flat f on (e.member_sk = f.member_sk) join dim_school s on (e.school_sk = s.school_sk) WHERE f.active_flag = 'Y' and ( ( e.country_sk = 167 ) OR ( s.country_sk = 167 ) ) limit 1000;
  • 20. Query 9 – wide table test (600+ columns - test columnar) select member_sk from om_segment where (lss_decision_maker_flag like 'DM' or lss_decision_maker_flag like 'IC') and (lss_company_tier like 'Enterprise' or lss_company_tier like 'SMB' or lss_company_tier like 'SRM') and (lss_customer_status like 'Prospect' or lss_customer_status like 'Customer') and (lss_subscriber_status like 'Online Gen Subscriber' or lss_subscriber_status like 'Not a Subscriber') and country_sk in (14,194,174,95,154,167,227,37,102,78,162,163,70,193,2 1,132,59,101,2,242) limit 1000;
  • 21. Query 10 – using sub-queries joins – push down select p.member_sk from dim_position p inner join ( select position_sk, std_title_2_sk, member_sk from dim_position_std_title_2) pt on p.position_sk = pt.position_sk and p.member_sk = pt.member_sk inner join ( select std_title_2_sk from dim_std_title_2 where std_title_2_id in (17801,20923,11001,21845,8206,8136,22224,5204,13257,5642,8,16565,7 92,12949,13758)) t on pt.std_title_2_sk = t.std_title_2_sk inner join ( select company_sk from dim_company where company_size_sk > 2) c on p.std_company_sk = c.company_sk where p.end_date is null and p.is_primary = 'Y' limit 1000;
  • 22. Query 11 – test unionall select distinct member_sk from ( select member_sk from dim_education where school_sk in ( 9873, 10065, 10388, 9872, 7916, 10241, 10242, 9900, 10377, 10719, 10637, 8534, 8535, 9906) union all select member_sk from dim_position where final_company_sk in (74701,74702,12831,159378,62771,67754, 75480,79641,73975,87156,1895741,147775) or company_sk in (74701,74702,12831,159378,62771,67754,75480,79641 ,73975,87156,1895741,147775) ) x limit 1000;
  • 23. Query 12 – 3 table joins create table u_smallem.retirement_members as select distinct sds.member_sk from u_smallem.v_retirement_dm sds inner join dim_member_flat mem on mem.member_sk=sds.member_sk and active_flag='Y' inner join dim_position pos on sds.member_sk=pos.member_sk where (pos.final_seniority_2_sk in (6,7,9,10) OR pos.user_supplied_title like '%senior consultant%') UNION select distinct current_date, mem.member_sk, 739, 4 from dim_position pos inner join dim_member_flat mem on mem.member_sk=pos.member_sk and active_flag='Y' where pos.final_company_sk in(12254,24672,12694,16583,21410,38641,145164,32346,20918,35083,96 824,49506,159381,48201,45860,215432,53484,327842,63747,78721,1394 06,778800) and (final_std_title_2_sk in (select std_title_2_sk as final_st_title_2_sk from dim_std_title_2 where occupation_id=235) or pos.user_supplied_title like '%benefit consultant%');
  • 24. Query 13 – time based calculations select distinct member_sk from ( select member_sk, start_date, end_date, cast(from_unixtime(unix_timestamp()- 24*3600*90,'yyyyMM') as int) d1, cast(year(from_unixtime(unix_timestamp())) as int)*100 d2, source_created_ts from dim_position ) x where start_date >= d1 or end_date >= d1 or ((start_date = d2 or end_date = d2) and source_created_ts >= unix_timestamp()-24*3600*90) limit 1000;
  • 25. Query 14 – many small table joins create table u_smallem.vs_rti_ad_order as select o.ad_order_sk, sum (r.ad_impressions) as impressions, sum (r.ad_clicks) as clicks from agg_daily_ad_revenue r inner join dim_ad a on r.ad_sk = a.ad_sk inner join dim_ad_order o on r.ad_order_sk = o.ad_order_sk inner join dim_advertiser v on v.advertiser_sk = o.advertiser_sk where r.datepartition >= '2014-07-01' and r.datepartition <= '2014-07-31' and r.ad_creative_size_sk in (6,8,17,29) and v.adv_saleschannel_name like 'Field%' and o.lars_sales_channel_name like 'Advertising Field' and r.ad_site_sk = 1 and r.ad_zone_sk <> 1175 and o.proposal_bind_id is not null and (coalesce(a.lars_product_type, 'n/a') not like 'Click Tracker' or coalesce(a.lars_product_type, 'n/a') not like 'inMail' or coalesce(a.lars_target_type, 'n/a') not like 'Partner Message' or coalesce(a.lars_target_type, 'n/a') not like 'Polls' ) group by o.ad_order_sk having sum(r.ad_impressions) > 9999; drop table if exists u_smallem.vs_final; create table u_smallem.vs_final as select distinct i.member_sk from ( select member_sk, f.ad_order_sk, count(1) as impr from fact_detail_ad_impressions f join u_smallem.vs_rti_ad_order u on f.ad_order_sk = u.ad_order_sk where date_sk >= '2014-07-01' and date_sk <= '2014-07-07' and ad_creative_size_sk in (6,17) --and ad_order_sk in (select distinct ad_order_sk from u_smallem.vs_rti_ad_order) group by member_sk, f.ad_order_sk having count(1) > 10) i join om_segment o on i.member_sk = o.member_sk where i.member_sk > 0 and o.pageview_l30d < 3000;
  • 26. Query 15 – check not exists drop table if exists u_smallem.tmp_SDS_AU; create table u_smallem.tmp_SDS_AU AS select distinct member_sk from fact_bzops_follower f1 where company_id = 2584270 and status='A' and not exists ( select 1 from fact_bzops_follower f2 where company_id = 3600 and status='A' and f2.member_sk = f1.member_sk) ;
  • 27. Query1 – Test concurrent users (Presto only) • This exercise was performed for Presto only. • Concurrency is measured by number of users running query in parallel. • For simplicity sake, we chose the same query ran by 1 user, 2, 4, 8 and 12 users at the same time. • Queries 3 and 4 Failed with multiple concurrent users which clearly indicates that more memory is required on the system • Multiple big table joins would fail on the system when run concurrently. Query3: SELECT datepartition, coalesce(c.country_sk,-9), COUNT(DISTINCT a.memberid) FROM pageviewevent_flat a inner join dim_page_key b on a.pagekey = b.page_key and b.page_key_group_sk = 39 and b.is_aggregate = 1 left outer join dim_member_cntry_lcl c on a.memberid= c.member_sk where a.datepartition = '2014-07-11' and a.memberid > 0 group by datepartition, coalesce(c.country_sk,-9);
  • 28. Query1 – Test linear growth (7 day window) Query: select trackingcode, count(1) from pageviewevent_flat where datepartition >= '2014-07-15' and datepartition <= '2014-07-16' group by trackingcode limit 100; • This exercise was performed on Presto and hive-on-tez • We chose Query 1 for this test • Query 1 was ran with 1,2,4 and 7 day range.
  • 29. Query3 – Test linear growth (7 day window) Query: SELECT a.datepartition, d.country_sk, COUNT(1) AS total_count, count(distinct a.memberid) as unique_count FROM pageviewevent_flat a INNER JOIN dim_page_key c ON a.pagekey=c.page_key AND c.is_aggregate = 1 left outer join dim_member_cntry_lcl d on a.memberId = d.member_sk WHERE a.datepartition >= '2014-07-18' and a.datepartition <= '2014-07-19' AND ( LOWER(a.trackingcode) LIKE 'eml_bt1%' OR LOWER(a.trackingcode) LIKE 'emlt_bt1%' OR LOWER(a.trackingcode) LIKE 'eml-bt1%' OR LOWER(a.trackingcode) LIKE 'emlt-bt1%' ) GROUP BY a.datepartition, country_sk ; • This exercise was performed on Presto and hive-on-tez • We chose Query 3 for this test • Query 3 was ran with 1,2,4 and 7 day range.
  • 30. All queries – Holistic view
  • 31. Conclusion • Hive-on-Tez • Pros: • Environments that are running on Hive only can benefit from having Hive-on-tez. • Hive-on-tez offers considerable improvement in query performance and offers an alternate solution to MapReduce. • In many cases we have seen queries speed up atleast 3x-8x compared to Hive. • Switching to Hive-on-tez is extremely simple (set hive.execution.engine=tez) • Cons: • For this POC, we had to tweak many Hive configuration properties to get the optimal performance for queries running on Tez. We felt this to be a drawback as we had to tune parameters specific to certain queries. This may be a hindrance for ad-hoc queries. • There were couple of queries that were running infinitely and we had to terminate them.
  • 32. Conclusion • Presto • Pros: • was proven to be fast and is a very good solution for ad-hoc analysis and faster table scans. • Presto was 3x to 10x faster in almost many queries compared to Hive on MapReduce. • The sql federation and query federation is an amazing feature for joining mysql or teradata to Hive tables using Presto. This is similar to the Aster data SQL-H feature. • Cons: • Requires separate installation • Memory was a big issue with Presto. Concurrency test that we did with multiple users clearly indicates that memory was insufficient. Also, joining 2 big tables requires lot of memory and running them on Presto clearly indicates that this is not going to work as it doesn’t support distributed hash joins. • DDLs are not supported.