SlideShare a Scribd company logo
1 of 17
Apache Hive
Big Data Webinar Session 3
Presenter : Amit Khandelwal
Agenda
• Introduction
• Where does Hive falls in Big Data stack
• Hive Architecture
• Hive Components
• Job Execution Flow
• Different modes of Hive
• HQL
• Hive Data Model
• Tables
• Partitioning
• Bucketing
Introduction
• What’s Hive ?
• Data warehousing tool built on top of Hadoop.
• Provides High level abstraction by allowing users to query data which in turn fires
Map Reduce jobs, Spark jobs or Tez jobs.
• It is designed for OLAP.
• It is familiar, fast, scalable, and extensible.
• What Hive is not
• A relational database
• A design for Online Transaction Processing (OLTP)
• A language for real-time queries and row-level updates
Where does Hive Fall in the Stack?
Data Processing Layer
JDBC
DataSources
IngestionLayer
Data Storage Layer
Data Query Layer / Consumption Layer
ODBC
Hive Architecture
HDFS
MapReduce
Executor Optimizer
Parser Compiler
JDBC/ODBC
CLI Thrift Server Web Interface
Driver
Meta
Store
Client
Metastore
RDBMS
Spark Tez
Hive Component
• Hive Client or Shell Interface – CLI (Command Line Interface)
• Driver:
 Handles sessions, fetch, execution
 Parse, plan, optimize
• Execution Engine:
 Query compilation/validation
 Query Planning
 Optimizing the query plan
 Run map or reduce
• Meta store Database (default is Derby)
Job Execution Flow
Different modes of Hive
Hive can operate in two modes depending on the size of data nodes in Hadoop cluster.
These modes are
I. Local mode
II. Map reduce mode
By default, it works on Map Reduce mode
Hive Query Language (HQL)
• Hive provides a SQL dialect known as Hive Query Language (HQL)
• Name of its default database is “default”
• Hive stores meta information of tables in “derby database” (default database which
comes with hive)
• Example: Select * from <TableName>;
Hive Data Model
• Tables
• Partitions
• Buckets
Hive Tables
• Analogous to relational database tables.
• Each table has a corresponding directory in HDFS.
• Data is stored as files within that directory.
• Types of hive tables :
I. Internal Tables
II. External Tables
Partitions
• Dividing tables into different parts.
• Partitioning helps reducing the amount of data you query.
• A partition is usually represented as a directory on HDFS.
• Increases performance.
• Examples : CREATE TABLE sales (name String, totalsales FLOAT)
PARTITIONED BY (country STRING, year INT, month INT);
Partitions - II
Buckets
• Partitions are sub-divided into buckets, to provide extra structure to the data that
may be used for more efficient querying.
• Bucketing works based on the value of hash function of some column of a table.
• set hive.enforce.bucketing=true;
• CREATE TABLE sales
(openingbid FLOAT, finalbid FLOAT, itemtype STRING, days INT)
CLUSTERED BY(openingbid) INTO 5 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
• INSERT OVERWRITE TABLE sales SELECT * from nteg_demo.testing;
Pros
• Provides an easy way to process large scale data
• Distributed data warehouse.
• Support SQL-like language called HiveQL (HQL).
• Efficient execution plans for performance
• Interoperability with other database
Limitations
• Not designed for the online data processing
• High Latency
• Don’t have proper support for the transaction processing
ANALLGEIERDIVISION
Thank you 
Amit Khandelwal

More Related Content

What's hot (20)

Apache hive
Apache hiveApache hive
Apache hive
 
Hive
HiveHive
Hive
 
Apache hive
Apache hiveApache hive
Apache hive
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Apache Hive - Introduction
Apache Hive - IntroductionApache Hive - Introduction
Apache Hive - Introduction
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive Tutorial
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Introduction to Hive
Introduction to HiveIntroduction to Hive
Introduction to Hive
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
4. hbase overview
4. hbase overview4. hbase overview
4. hbase overview
 
6.hive
6.hive6.hive
6.hive
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
Apache hive1
Apache hive1Apache hive1
Apache hive1
 

Similar to Apache Hive

Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxvishwasgarade1
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingearnwithme2522
 
Unit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxUnit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxBhavanaHotchandani
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSHive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSRUHULAMINHAZARIKA
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
hive architecture and hive components in detail
hive architecture and hive components in detailhive architecture and hive components in detail
hive architecture and hive components in detailHariKumar544765
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in HyderabadRajitha D
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 

Similar to Apache Hive (20)

Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
 
Apache hive
Apache hiveApache hive
Apache hive
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
Unit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxUnit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptx
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
מיכאל
מיכאלמיכאל
מיכאל
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSHive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
hive architecture and hive components in detail
hive architecture and hive components in detailhive architecture and hive components in detail
hive architecture and hive components in detail
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in Hyderabad
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in Hyderabad
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 

Recently uploaded

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 

Recently uploaded (20)

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 

Apache Hive

  • 1. Apache Hive Big Data Webinar Session 3 Presenter : Amit Khandelwal
  • 2. Agenda • Introduction • Where does Hive falls in Big Data stack • Hive Architecture • Hive Components • Job Execution Flow • Different modes of Hive • HQL • Hive Data Model • Tables • Partitioning • Bucketing
  • 3. Introduction • What’s Hive ? • Data warehousing tool built on top of Hadoop. • Provides High level abstraction by allowing users to query data which in turn fires Map Reduce jobs, Spark jobs or Tez jobs. • It is designed for OLAP. • It is familiar, fast, scalable, and extensible. • What Hive is not • A relational database • A design for Online Transaction Processing (OLTP) • A language for real-time queries and row-level updates
  • 4. Where does Hive Fall in the Stack? Data Processing Layer JDBC DataSources IngestionLayer Data Storage Layer Data Query Layer / Consumption Layer ODBC
  • 5. Hive Architecture HDFS MapReduce Executor Optimizer Parser Compiler JDBC/ODBC CLI Thrift Server Web Interface Driver Meta Store Client Metastore RDBMS Spark Tez
  • 6. Hive Component • Hive Client or Shell Interface – CLI (Command Line Interface) • Driver:  Handles sessions, fetch, execution  Parse, plan, optimize • Execution Engine:  Query compilation/validation  Query Planning  Optimizing the query plan  Run map or reduce • Meta store Database (default is Derby)
  • 8. Different modes of Hive Hive can operate in two modes depending on the size of data nodes in Hadoop cluster. These modes are I. Local mode II. Map reduce mode By default, it works on Map Reduce mode
  • 9. Hive Query Language (HQL) • Hive provides a SQL dialect known as Hive Query Language (HQL) • Name of its default database is “default” • Hive stores meta information of tables in “derby database” (default database which comes with hive) • Example: Select * from <TableName>;
  • 10. Hive Data Model • Tables • Partitions • Buckets
  • 11. Hive Tables • Analogous to relational database tables. • Each table has a corresponding directory in HDFS. • Data is stored as files within that directory. • Types of hive tables : I. Internal Tables II. External Tables
  • 12. Partitions • Dividing tables into different parts. • Partitioning helps reducing the amount of data you query. • A partition is usually represented as a directory on HDFS. • Increases performance. • Examples : CREATE TABLE sales (name String, totalsales FLOAT) PARTITIONED BY (country STRING, year INT, month INT);
  • 14. Buckets • Partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. • Bucketing works based on the value of hash function of some column of a table. • set hive.enforce.bucketing=true; • CREATE TABLE sales (openingbid FLOAT, finalbid FLOAT, itemtype STRING, days INT) CLUSTERED BY(openingbid) INTO 5 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; • INSERT OVERWRITE TABLE sales SELECT * from nteg_demo.testing;
  • 15. Pros • Provides an easy way to process large scale data • Distributed data warehouse. • Support SQL-like language called HiveQL (HQL). • Efficient execution plans for performance • Interoperability with other database
  • 16. Limitations • Not designed for the online data processing • High Latency • Don’t have proper support for the transaction processing

Editor's Notes

  1. Apache Hive is an ETL and Data warehousing tool built on top of Hadoop for data summarization, analysis and querying of large data systems in open source Hadoop platform. The tables in Hive are similar to tables in a relational database, and data units can be organized from larger to more granular units with the help of Partitioning and Bucketing which we will see in coming slides.. As Hadoop is a batch-oriented system, Hive doesn’t support OLTP (Online Transaction Processing) but it is closer to OLAP (Online Analytical Processing) but not ideal since there is significant latency between issuing a query and receiving a reply, due to the overhead of Mapreduce jobs and due to the size of the data sets Hadoop was designed to serve. Hive is based on the notion of Write once, Read many times but RDBMS is designed for Read and Write many times.
  2. MapReduce is a programming paradigm used for processing the data that is one of the core components of hadoop. A MapReduce program is composed of 3 operations : Map: each worker node applies the map function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of redundant input data is processed. Shuffle: worker nodes redistribute data based on the output keys (produced by the map function), such that all data belonging to one key is located on the same worker node. Reduce: worker nodes process each group of output data, per key, in parallel and provide the required ouput. 2. So think of a situation where you have to analyze a large distributed data set. And each time you want to execute a query, you have to write a customized map-reduce java program. You already know the length of the java codes and also how time consuming it will be to write a map-reduce code every time. Can you imagine how uncomfortable it would be ? This was happening during the beginning of big data world, all of the SQL type of queries were supposed to be implemented into MapReduce Java API in order to execute queries over the distributed data. This is what changed with the arrival of Hive. Herein the required SQL abstraction is provided by Hive itself which integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. HQL separates the user from the complexity of Map Reduce programming.   It reuses familiar concepts from the relational database world, such as tables, rows, columns and schema, etc. for ease of learning. Why HIVE ? If you have attended last webinars on bigdata then you know how map-reduce works? How uncomfortable it would be while working on Hadoop for analyzing the data due to the coding nature of Hadoop (Since for each analysis you have to write customize map-reduce jobs). Usually according to the earlier standards, all of the SQL type of data base queries were supposed to be implemented into the system of MapReduce Java and API in order to execute queries over the distributed data. This is what changed with the arrival of Hive. Herein the required SQL abstraction is provided by Hive itself in order to integrate queries directly in to JAVA without the need of implementation of queries without using low level JAVA APIs. HQL separates the user from the complexity of Map Reduce programming.  It reuses familiar concepts from the relational database world, such as tables, rows, columns and schema, etc. for ease of learning.
  3. Apache Hive is an ETL and Data warehousing tool built on top of Hadoop for data summarization, analysis and querying of large data systems in open source Hadoop platform. The tables in Hive are similar to tables in a relational database, and data units can be organized from larger to more granular units with the help of Partitioning and Bucketing which we will see in coming slides.. As Hadoop is a batch-oriented system, Hive doesn’t support OLTP (Online Transaction Processing) but it is closer to OLAP (Online Analytical Processing) but not ideal since there is significant latency between issuing a query and receiving a reply, due to the overhead of Mapreduce jobs and due to the size of the data sets Hadoop was designed to serve. Hive is based on the notion of Write once, Read many times but RDBMS is designed for Read and Write many times.
  4. Hadoop Components Hadoop Key components are Yarn and HDFS. Yarn is the resource manager, whereas HDFS is the distributed file system of Hadoop.
  5. 1. Most interactions tend to take place over a command line interface (CLI). Hive provides a CLI to write Hive queries using Hive Query Language(HQL) 2. Driver : it communicates with all type of JDBC, ODBC, and other client specific applications.  Driver processes requests from different applications to meta store and field systems for further processing. 3. Metastore : Hive has a relational database on the master node it uses to keep track of state. For instance, when you CREATE TABLE FOO(foo string) LOCATION 'hdfs://tmp/';, this table schema is stored in the database. If you have a partitioned table, the partitions are stored in the database(this allows hive to use lists of partitions without going to the file-system and finding them, etc). These sorts of things are the 'metadata’. Hive has a default metastore (derby); however, you can also change it to other RDBMS. Derby only allows one connection, that’s why you don’t see Hive use Derby in production environment. So, for single user metadata storage, Hive uses derby database and for multiple user Metadata or shared Metadata case Hive uses MYSQL. Also, On connecting to Hive via CLI, it establishes a connection to metastore as well.
  6. From the screenshot we can understand the Job execution flow in Hive with Hadoop 1. Executing Query from the UI( User Interface) 2. The driver is interacting with Compiler for getting the plan. (Here plan refers to query execution) process and its related metadata information gathering 3. The compiler creates the plan for a job to be executed. 4. Compiler communicating with Meta store for getting metadata request 5. Meta store sends metadata information back to compiler 6. Compiler communicating with Driver with the proposed plan to execute the query 7. Driver Sending execution plans to Execution engine 8. Execution Engine (EE) acts as a bridge between Hive and Hadoop to process the query. For DFS operations. 8. EE should first contacts Name Node and then to Data nodes to get the values stored in tables. 9. EE is going to fetch desired records from Data Nodes. The actual data of tables resides in data node only. While from Name Node it only fetches the metadata information for the query. 10. It collects actual data from data nodes related to mentioned query 11. Execution Engine (EE) communicates bi-directionally with Meta store present in Hive to perform DDL (Data Definition Language) operations. Here DDL operations like CREATE, DROP and ALTERING tables and databases are done. 12. Meta store will store information about database name, table names and column names only. It will fetch data related to query mentioned. 13. Execution Engine (EE) in turn communicates with Hadoop daemons such as Name node, Data nodes, and job tracker to execute the query on top of Hadoop file system Fetching results from driver Sending results to Execution engine. 14. Once the results fetched from data nodes to the EE, it will send results back to driver and to UI ( front end) Hive Continuously in contact with Hadoop file system and its daemons via Execution engine. The dotted arrow in the Job flow diagram shows the Execution engine communication with Hadoop daemons.
  7. Processing will be very fast on smaller data sets present in the local machine Mapreduce : query will be executs query in parallel way Processing of large data sets with better performance can be achieved through this mode Also we can set in which mode we wan hive to work. By default, it works on Map Reduce mode and for local mode you can have set: SET mapred.job.tracker=local; From the Hive version 0.7 it supports a mode to run map reduce jobs in local mode automatically.
  8. HQL syntax is similar to the SQL syntax that most of us are familiar with. Internally, a compiler translates HiveQL statements into a directed acyclic graph of MapReduce, Tez, or Spark jobs, which are submitted to Hive for execution Generally, HQL syntax is similar to the SQL syntax that most data analysts are familiar with. The Sample query below display all the records present in mentioned table name. One more term comes here – Hive Server2 : HiveServer2 (HS2) is a server interface that performs following functions: Enables remote clients to execute queries against Hive Retrieve the results of mentioned queries From the latest version it's having some advanced features Based on Thrift RPC like; Multi-client concurrency Authentication
  9. Tables in hive are logically made up of data being stored. You would have to first decide how you want to access the data, according to that you would do partitioning, and bucketing. How tables are stored in hdfs as files. Let me show you how tables exist in hdfs. Also, data loaded in the tables are going to be stored in Hadoop cluster on HDFS. Hive supports four types file - TEXTFILE, SEQUENCEFILE, ORC (Optimized Row Columnar ) and RCFILE (Record Columnar File). - binary file format offers high compression rate By default hive creates Internal tables and manages data. It means that Hive moves the data into its warehouse directory. While an external table, tells hive to refer the data that is at an existing location outside the warehouse directory. How meta store works here ? Metastore works a link
  10. Hive has been one of the preferred tool for performing queries on large datasets, especially when full table scan is done. Lets consider an Example of sales table having millions of records. Lets say table has commodity column, totalsales column, country column, year column, month column. Now if I have to get the number of totalsales of a commodity lets say x in US In march, 2012. What you expect what will happen ? How much time it would take to scan millions of records. Now let us consider another scnerio, where we have done partitioning on columns – country column, year column, month column, date column. Partitioning is a way of organizing a big table by diving it into different parts based on partition keys. grouping same type of data together based on a column or partition key.  A partition is usually represented as a directory on HDFS. As each partition resides as a directory in hdfs so lets see how the data will be grouped. In the case of tables which are not partitioned, all the files in a table’s data directory is read and then filters are applied on it as a subsequent phase. This becomes a slow and expensive affair especially in cases of large tables. SO, Partitioning is oftenly used for distributing load horizontally. If you have a big table, partitioning helps by reducing the amount of data you query.
  11. CREATE TABLE newsales (sale_id INT, amount FLOAT) PARTITIONED BY (country STRING, year INT, month INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY "," STORED AS TEXTFILE; LOAD DATA LOCAL INPATH '/hivePath/sales.csv' INTO TABLE newsales partition(country='US', year=2012, month=10); create table auctionwithpartition (openingbid FLOAT, fianlbid float, itemtype string) PARTITIONED BY (days int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',’; LOAD DATA LOCAL INPATH '/hivePath/auctiondata.csv' INTO TABLE auctionwithpartition partition(days = 7 );
  12. Data in each partition can be divided into buckets based on a hash function of the column. Each bucket is stored as a file in partition directory.
  13. HBASE
  14. Not designed for OLTP hence, no Real time access to data. High Latency - Hive takes less time to load the data because of its property “scheme on read” but it takes longer time to query the data because data has to be verified against the schema at the time of querying. Previously it did not support the transaction processing because it had no support for ACID properties but recently ACID properties has been added to version hive 0.14 but it leads to performance degradation.