Apache Hive

•

1 like•154 views

In this introduction to Apache Hive the following topics are covered: 1. Hive Introduction 2. Hive origin 3. Where does Hive fall in Big Data stack 4. Hive architecture 5. Tts job execution mechanisms 6. HiveQL and Hive Shell 7 Types of tables 8. Querying data 9. Partitioning 10. Bucketing 11. Pros 12. Limitations of Hive

Data & Analytics

Apache Hive
Big Data Webinar Session 3
Presenter : Amit Khandelwal

Agenda
• Introduction
• Where does Hive falls in Big Data stack
• Hive Architecture
• Hive Components
• Job Execution Flow
• Different modes of Hive
• HQL
• Hive Data Model
• Tables
• Partitioning
• Bucketing

Introduction
• What’s Hive ?
• Data warehousing tool built on top of Hadoop.
• Provides High level abstraction by allowing users to query data which in turn fires
Map Reduce jobs, Spark jobs or Tez jobs.
• It is designed for OLAP.
• It is familiar, fast, scalable, and extensible.
• What Hive is not
• A relational database
• A design for Online Transaction Processing (OLTP)
• A language for real-time queries and row-level updates

Where does Hive Fall in the Stack?
Data Processing Layer
JDBC
DataSources
IngestionLayer
Data Storage Layer
Data Query Layer / Consumption Layer
ODBC

Hive Architecture
HDFS
MapReduce
Executor Optimizer
Parser Compiler
JDBC/ODBC
CLI Thrift Server Web Interface
Driver
Meta
Store
Client
Metastore
RDBMS
Spark Tez

Hive Component
• Hive Client or Shell Interface – CLI (Command Line Interface)
• Driver:
 Handles sessions, fetch, execution
 Parse, plan, optimize
• Execution Engine:
 Query compilation/validation
 Query Planning
 Optimizing the query plan
 Run map or reduce
• Meta store Database (default is Derby)

Different modes of Hive
Hive can operate in two modes depending on the size of data nodes in Hadoop cluster.
These modes are
I. Local mode
II. Map reduce mode
By default, it works on Map Reduce mode

Hive Query Language (HQL)
• Hive provides a SQL dialect known as Hive Query Language (HQL)
• Name of its default database is “default”
• Hive stores meta information of tables in “derby database” (default database which
comes with hive)
• Example: Select * from <TableName>;

Hive Data Model
• Tables
• Partitions
• Buckets

Hive Tables
• Analogous to relational database tables.
• Each table has a corresponding directory in HDFS.
• Data is stored as files within that directory.
• Types of hive tables :
I. Internal Tables
II. External Tables

Partitions
• Dividing tables into different parts.
• Partitioning helps reducing the amount of data you query.
• A partition is usually represented as a directory on HDFS.
• Increases performance.
• Examples : CREATE TABLE sales (name String, totalsales FLOAT)
PARTITIONED BY (country STRING, year INT, month INT);

Buckets
• Partitions are sub-divided into buckets, to provide extra structure to the data that
may be used for more efficient querying.
• Bucketing works based on the value of hash function of some column of a table.
• set hive.enforce.bucketing=true;
• CREATE TABLE sales
(openingbid FLOAT, finalbid FLOAT, itemtype STRING, days INT)
CLUSTERED BY(openingbid) INTO 5 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
• INSERT OVERWRITE TABLE sales SELECT * from nteg_demo.testing;

Pros
• Provides an easy way to process large scale data
• Distributed data warehouse.
• Support SQL-like language called HiveQL (HQL).
• Efficient execution plans for performance
• Interoperability with other database

Limitations
• Not designed for the online data processing
• High Latency
• Don’t have proper support for the transaction processing

ANALLGEIERDIVISION
Thank you 
Amit Khandelwal

What's hot

Apache hiveInthra onsap

HiveManas Nayak

Apache hiveVaibhav Kadu

Apache hive introductionMahmood Reza Esmaili Zand

Hive(ppt)Abhinav Tyagi

Apache Hive - IntroductionMuralidharan Deenathayalan

Apache Hive TutorialSandeep Patil

Apache HBase™Prashant Gupta

Introduction to HiveUday Vakalapudi

1. Apache HIVEAnuja Gunale

Introduction to Apache Hive(Big Data, Final Seminar)Takrim Ul Islam Laskar

An introduction to Apache Hadoop HiveMike Frampton

HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...Simplilearn

An intriduction to hiveReza Ameri

SQL Server 2012 and Big DataMicrosoft TechNet - Belgium and Luxembourg

4. hbase overviewAnuja Gunale

6.hivePrashant Gupta

HadoopMallikarjuna G D

Introduction To HBaseAnil Gupta

Apache hive1sheetal sharma

What's hot (20)

Apache hive

Hive

Apache hive

Apache hive introduction

Hive(ppt)

Apache Hive - Introduction

Apache Hive Tutorial

Apache HBase™

Introduction to Hive

1. Apache HIVE

Introduction to Apache Hive(Big Data, Final Seminar)

An introduction to Apache Hadoop Hive

HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...

An intriduction to hive

SQL Server 2012 and Big Data

4. hbase overview

6.hive

Hadoop

Introduction To HBase

Apache hive1

Similar to Apache Hive

Intro to HBase - Lars GeorgeJAX London

hive_slides_Webinar_Session_1.pptxvishwasgarade1

Apache Drill at ApacheCon2014Neeraja Rentachintala

Apache Hive, data segmentation and bucketingearnwithme2522

Apache hivepradipbajpai68

03 hive query language (hql)Subhas Kumar Ghosh

Unit II Hadoop Ecosystem_Updated.pptxBhavanaHotchandani

Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow

מיכאלsqlserver.co.il

Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.

Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSRUHULAMINHAZARIKA

SQL on HadoopBigdatapump

Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin

Apache Hadoop 1.1Sperasoft

hive architecture and hive components in detailHariKumar544765

Impala for PhillyDB MeetupShravan (Sean) Pabba

Hadoop Training in HyderabadRajitha D

Hadoop Training in HyderabadCHENNAKESHAVAKATAGAR

Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin

Big data Hadoop Ayyappan Paramesh

Similar to Apache Hive (20)

Intro to HBase - Lars George

hive_slides_Webinar_Session_1.pptx

Apache Drill at ApacheCon2014

Apache Hive, data segmentation and bucketing

Apache hive

03 hive query language (hql)

Unit II Hadoop Ecosystem_Updated.pptx

Big Data Developers Moscow Meetup 1 - sql on hadoop

מיכאל

Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud

Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS

SQL on Hadoop

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Apache Hadoop 1.1

hive architecture and hive components in detail

Impala for PhillyDB Meetup

Hadoop Training in Hyderabad

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Big data Hadoop

Recently uploaded

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

Easter Eggs From Star Wars and in cars 1 and 217djon017

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2

Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics

RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993

While-For-loop in python used in collegessuser7a7cd61

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali

Recently uploaded (20)

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT

Data Factory in Microsoft Fabric (MsBIP #82)

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

Easter Eggs From Star Wars and in cars 1 and 2

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...

20240419 - Measurecamp Amsterdam - SAM.pdf

GA4 Without Cookies [Measure Camp AMS]

Identifying Appropriate Test Statistics Involving Population Mean

Student profile product demonstration on grades, ability, well-being and mind...

Call Girls in Saket 99530🔝 56974 Escort Service

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx

RABBIT: A CLI tool for identifying bots based on their GitHub events.

While-For-loop in python used in college

Defining Constituents, Data Vizzes and Telling a Data Story

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...

Apache Hive

1. Apache Hive Big Data Webinar Session 3 Presenter : Amit Khandelwal

2. Agenda • Introduction • Where does Hive falls in Big Data stack • Hive Architecture • Hive Components • Job Execution Flow • Different modes of Hive • HQL • Hive Data Model • Tables • Partitioning • Bucketing

3. Introduction • What’s Hive ? • Data warehousing tool built on top of Hadoop. • Provides High level abstraction by allowing users to query data which in turn fires Map Reduce jobs, Spark jobs or Tez jobs. • It is designed for OLAP. • It is familiar, fast, scalable, and extensible. • What Hive is not • A relational database • A design for Online Transaction Processing (OLTP) • A language for real-time queries and row-level updates

4. Where does Hive Fall in the Stack? Data Processing Layer JDBC DataSources IngestionLayer Data Storage Layer Data Query Layer / Consumption Layer ODBC

5. Hive Architecture HDFS MapReduce Executor Optimizer Parser Compiler JDBC/ODBC CLI Thrift Server Web Interface Driver Meta Store Client Metastore RDBMS Spark Tez

6. Hive Component • Hive Client or Shell Interface – CLI (Command Line Interface) • Driver:  Handles sessions, fetch, execution  Parse, plan, optimize • Execution Engine:  Query compilation/validation  Query Planning  Optimizing the query plan  Run map or reduce • Meta store Database (default is Derby)

7. Job Execution Flow

8. Different modes of Hive Hive can operate in two modes depending on the size of data nodes in Hadoop cluster. These modes are I. Local mode II. Map reduce mode By default, it works on Map Reduce mode

9. Hive Query Language (HQL) • Hive provides a SQL dialect known as Hive Query Language (HQL) • Name of its default database is “default” • Hive stores meta information of tables in “derby database” (default database which comes with hive) • Example: Select * from <TableName>;

10. Hive Data Model • Tables • Partitions • Buckets

11. Hive Tables • Analogous to relational database tables. • Each table has a corresponding directory in HDFS. • Data is stored as files within that directory. • Types of hive tables : I. Internal Tables II. External Tables

12. Partitions • Dividing tables into different parts. • Partitioning helps reducing the amount of data you query. • A partition is usually represented as a directory on HDFS. • Increases performance. • Examples : CREATE TABLE sales (name String, totalsales FLOAT) PARTITIONED BY (country STRING, year INT, month INT);

13. Partitions - II

14. Buckets • Partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. • Bucketing works based on the value of hash function of some column of a table. • set hive.enforce.bucketing=true; • CREATE TABLE sales (openingbid FLOAT, finalbid FLOAT, itemtype STRING, days INT) CLUSTERED BY(openingbid) INTO 5 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; • INSERT OVERWRITE TABLE sales SELECT * from nteg_demo.testing;

15. Pros • Provides an easy way to process large scale data • Distributed data warehouse. • Support SQL-like language called HiveQL (HQL). • Efficient execution plans for performance • Interoperability with other database

16. Limitations • Not designed for the online data processing • High Latency • Don’t have proper support for the transaction processing

17. ANALLGEIERDIVISION Thank you  Amit Khandelwal

Editor's Notes

Apache Hive is an ETL and Data warehousing tool built on top of Hadoop for data summarization, analysis and querying of large data systems in open source Hadoop platform. The tables in Hive are similar to tables in a relational database, and data units can be organized from larger to more granular units with the help of Partitioning and Bucketing which we will see in coming slides.. As Hadoop is a batch-oriented system, Hive doesn’t support OLTP (Online Transaction Processing) but it is closer to OLAP (Online Analytical Processing) but not ideal since there is significant latency between issuing a query and receiving a reply, due to the overhead of Mapreduce jobs and due to the size of the data sets Hadoop was designed to serve. Hive is based on the notion of Write once, Read many times but RDBMS is designed for Read and Write many times.
MapReduce is a programming paradigm used for processing the data that is one of the core components of hadoop. A MapReduce program is composed of 3 operations : Map: each worker node applies the map function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of redundant input data is processed. Shuffle: worker nodes redistribute data based on the output keys (produced by the map function), such that all data belonging to one key is located on the same worker node. Reduce: worker nodes process each group of output data, per key, in parallel and provide the required ouput. 2. So think of a situation where you have to analyze a large distributed data set. And each time you want to execute a query, you have to write a customized map-reduce java program. You already know the length of the java codes and also how time consuming it will be to write a map-reduce code every time. Can you imagine how uncomfortable it would be ? This was happening during the beginning of big data world, all of the SQL type of queries were supposed to be implemented into MapReduce Java API in order to execute queries over the distributed data. This is what changed with the arrival of Hive. Herein the required SQL abstraction is provided by Hive itself which integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. HQL separates the user from the complexity of Map Reduce programming. It reuses familiar concepts from the relational database world, such as tables, rows, columns and schema, etc. for ease of learning. Why HIVE ?If you have attended last webinars on bigdata then you know how map-reduce works? How uncomfortable it would be while working on Hadoop for analyzing the data due to the coding nature of Hadoop (Since for each analysis you have to write customize map-reduce jobs). Usually according to the earlier standards, all of the SQL type of data base queries were supposed to be implemented into the system of MapReduce Java and API in order to execute queries over the distributed data. This is what changed with the arrival of Hive. Herein the required SQL abstraction is provided by Hive itself in order to integrate queries directly in to JAVA without the need of implementation of queries without using low level JAVA APIs. HQL separates the user from the complexity of Map Reduce programming. It reuses familiar concepts from the relational database world, such as tables, rows, columns and schema, etc. for ease of learning.
Apache Hive is an ETL and Data warehousing tool built on top of Hadoop for data summarization, analysis and querying of large data systems in open source Hadoop platform. The tables in Hive are similar to tables in a relational database, and data units can be organized from larger to more granular units with the help of Partitioning and Bucketing which we will see in coming slides.. As Hadoop is a batch-oriented system, Hive doesn’t support OLTP (Online Transaction Processing) but it is closer to OLAP (Online Analytical Processing) but not ideal since there is significant latency between issuing a query and receiving a reply, due to the overhead of Mapreduce jobs and due to the size of the data sets Hadoop was designed to serve. Hive is based on the notion of Write once, Read many times but RDBMS is designed for Read and Write many times.
Hadoop Components Hadoop Key components are Yarn and HDFS. Yarn is the resource manager, whereas HDFS is the distributed file system of Hadoop.
1. Most interactions tend to take place over a command line interface (CLI). Hive provides a CLI to write Hive queries using Hive Query Language(HQL) 2. Driver : it communicates with all type of JDBC, ODBC, and other client specific applications. Driver processes requests from different applications to meta store and field systems for further processing. 3. Metastore : Hive has a relational database on the master node it uses to keep track of state. For instance, when you CREATE TABLE FOO(foo string) LOCATION 'hdfs://tmp/';, this table schema is stored in the database. If you have a partitioned table, the partitions are stored in the database(this allows hive to use lists of partitions without going to the file-system and finding them, etc). These sorts of things are the 'metadata’. Hive has a default metastore (derby); however, you can also change it to other RDBMS. Derby only allows one connection, that’s why you don’t see Hive use Derby in production environment. So, for single user metadata storage, Hive uses derby database and for multiple user Metadata or shared Metadata case Hive uses MYSQL. Also, On connecting to Hive via CLI, it establishes a connection to metastore as well.
From the screenshot we can understand the Job execution flow in Hive with Hadoop 1. Executing Query from the UI( User Interface) 2. The driver is interacting with Compiler for getting the plan. (Here plan refers to query execution) process and its related metadata information gathering 3. The compiler creates the plan for a job to be executed. 4. Compiler communicating with Meta store for getting metadata request 5. Meta store sends metadata information back to compiler 6. Compiler communicating with Driver with the proposed plan to execute the query 7. Driver Sending execution plans to Execution engine 8. Execution Engine (EE) acts as a bridge between Hive and Hadoop to process the query. For DFS operations. 8. EE should first contacts Name Node and then to Data nodes to get the values stored in tables. 9. EE is going to fetch desired records from Data Nodes. The actual data of tables resides in data node only. While from Name Node it only fetches the metadata information for the query. 10. It collects actual data from data nodes related to mentioned query 11. Execution Engine (EE) communicates bi-directionally with Meta store present in Hive to perform DDL (Data Definition Language) operations. Here DDL operations like CREATE, DROP and ALTERING tables and databases are done. 12. Meta store will store information about database name, table names and column names only. It will fetch data related to query mentioned. 13. Execution Engine (EE) in turn communicates with Hadoop daemons such as Name node, Data nodes, and job tracker to execute the query on top of Hadoop file system Fetching results from driver Sending results to Execution engine. 14. Once the results fetched from data nodes to the EE, it will send results back to driver and to UI ( front end) Hive Continuously in contact with Hadoop file system and its daemons via Execution engine. The dotted arrow in the Job flow diagram shows the Execution engine communication with Hadoop daemons.
Processing will be very fast on smaller data sets present in the local machine Mapreduce : query will be executs query in parallel way Processing of large data sets with better performance can be achieved through this mode Also we can set in which mode we wan hive to work. By default, it works on Map Reduce mode and for local mode you can have set: SET mapred.job.tracker=local; From the Hive version 0.7 it supports a mode to run map reduce jobs in local mode automatically.
HQL syntax is similar to the SQL syntax that most of us are familiar with. Internally, a compiler translates HiveQL statements into a directed acyclic graph of MapReduce, Tez, or Spark jobs, which are submitted to Hive for execution Generally, HQL syntax is similar to the SQL syntax that most data analysts are familiar with. The Sample query below display all the records present in mentioned table name. One more term comes here – Hive Server2 : HiveServer2 (HS2) is a server interface that performs following functions: Enables remote clients to execute queries against Hive Retrieve the results of mentioned queries From the latest version it's having some advanced features Based on Thrift RPC like; Multi-client concurrency Authentication
Tables in hive are logically made up of data being stored. You would have to first decide how you want to access the data, according to that you would do partitioning, and bucketing. How tables are stored in hdfs as files. Let me show you how tables exist in hdfs. Also, data loaded in the tables are going to be stored in Hadoop cluster on HDFS. Hive supports four types file - TEXTFILE, SEQUENCEFILE, ORC (Optimized Row Columnar ) and RCFILE (Record Columnar File). - binary file format offers high compression rate By default hive creates Internal tables and manages data. It means that Hive moves the data into its warehouse directory. While an external table, tells hive to refer the data that is at an existing location outside the warehouse directory. How meta store works here ? Metastore works a link
Hive has been one of the preferred tool for performing queries on large datasets, especially when full table scan is done. Lets consider an Example of sales table having millions of records. Lets say table has commodity column, totalsales column, country column, year column, month column. Now if I have to get the number of totalsales of a commodity lets say x in US In march, 2012. What you expect what will happen ? How much time it would take to scan millions of records. Now let us consider another scnerio, where we have done partitioning on columns – country column, year column, month column, date column. Partitioning is a way of organizing a big table by diving it into different parts based on partition keys. grouping same type of data together based on a column or partition key. A partition is usually represented as a directory on HDFS. As each partition resides as a directory in hdfs so lets see how the data will be grouped. In the case of tables which are not partitioned, all the files in a table’s data directory is read and then filters are applied on it as a subsequent phase. This becomes a slow and expensive affair especially in cases of large tables. SO, Partitioning is oftenly used for distributing load horizontally. If you have a big table, partitioning helps by reducing the amount of data you query.
CREATE TABLE newsales (sale_id INT, amount FLOAT) PARTITIONED BY (country STRING, year INT, month INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY "," STORED AS TEXTFILE; LOAD DATA LOCAL INPATH '/hivePath/sales.csv' INTO TABLE newsales partition(country='US', year=2012, month=10); create table auctionwithpartition (openingbid FLOAT, fianlbid float, itemtype string) PARTITIONED BY (days int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',’; LOAD DATA LOCAL INPATH '/hivePath/auctiondata.csv' INTO TABLE auctionwithpartition partition(days = 7 );
Data in each partition can be divided into buckets based on a hash function of the column. Each bucket is stored as a file in partition directory.
HBASE
Not designed for OLTP hence, no Real time access to data. High Latency - Hive takes less time to load the data because of its property “scheme on read” but it takes longer time to query the data because data has to be verified against the schema at the time of querying. Previously it did not support the transaction processing because it had no support for ACID properties but recently ACID properties has been added to version hive 0.14 but it leads to performance degradation.

Apache Hive

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Hive

Similar to Apache Hive (20)

Recently uploaded

Recently uploaded (20)

Apache Hive

Editor's Notes