A tutorial presentation based on github.com/amplab/shark documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
2. Purpose
This guide describes how to get Shark running locally. It creates a small Hive
installation on one machine and allows you to execute simple queries.
The only prerequisite for this guide is that you have Java and Scala
2.9.3 installed on your machine. If you don't have Scala 2.9.3, you can
download it by running:
2
$ wget http://www.scala-lang.org/files/archive/scala-2.9.3.tgz
$ tar xvfz scala-2.9.3.tgz
3. Running Shark In Other Modes
• You can also start your Shark in one of the three other supported modes:
• Running Shark on EC2
• Running Shark on a Cluster
• Running Shark with Tachyon
3
4. Let’s Start…(1/3)
• Download the binary distribution of Shark 0.8.
• The package contains two folders, shark-0.8.0 and hive-0.9.0-
shark-0.8.0-bin.
4
$ wget https://github.com/amplab/shark/releases/download/v0.8.0/shark-0.8.0-bin-
hadoop1.tgz # Hadoop 1/CDH3 - or -
$ wget https://github.com/amplab/shark/releases/download/v0.8.0/shark-0.8.0-bin-
cdh4.tgz # Hadoop 2/CDH4
$ tar xvfz shark-*-bin-*.tgz
$ cd shark-*-bin-*
• The Shark code is in the shark-0.8.0/ directory.
5. Let’s Start…(2/3)
• To setup your environment to run Shark locally, you need to set
HIVE_HOME and SCALA_HOME environmental variables in a file shark-
0.8.0/conf/shark-env.sh to point to the folders you just downloaded.
• Shark comes with a template file shark-env.sh.template that you can
copy and modify to get started:
5
$ cp shark-0.8.0/conf/shark-env.sh.template shark-0.8.0/conf/shark-env.sh
• Now edit the following two lines in shark-env.sh:
export HIVE_HOME=/path/to/hive-0.9.0-shark-0.8.0-bin
export SCALA_HOME=/path/to/scala-2.9.3
6. Let’s Start…(3/3)
• Next, create the default Hive warehouse directory. This is where Hive will
store table data for native tables:
6
$ sudo mkdir -p /user/hive/warehouse
$ sudo chmod 0777 /user/hive/warehouse # Or make your username the owner
• You can now start the Shark CLI:
$ ./bin/shark
• In addition to the Shark CLI, there are several executables in shark-0.8.0/bin:
bin/shark-withdebug
bin/shark-withinfo
: Runs Shark CLI with DEBUG level logs printed to the console.
: Runs Shark CLI with INFO level logs printed to the console.
7. Lab
Assignment
1. Launch the Shark shell.
2. Create a table called book … .
3. List all the columns of the table book.
4. Load the book table from the file books in
the local filesystem.
5. Create a table called novel, containing
those records from table book … .
6. Print out the list of available tables.
7. Count the number of records from the
table book.
8. Print out the total cost of the books with
authors who have the same last name.
9. Count the number of distinct last names.
10. Drop the tables.
7
8. Lab Assignment 5 (1/5)
1. Launch the Shark shell.
2. Create a table called book whose schema includes book's title,
description, author's first name, last name, and cost.
3. List all the columns of the table book.
8
shark
create table
book(title string, description string, firstname string, lastname string, cost int)
row format delimited fields terminated by 't';
describe book;
9. Lab Assignment 5 (2/5)
4. Load the book table from the file books in the local filesystem. The books
file has the following format:
9
load data local inpath 'books' into table book;
Speed love Long book about love Brian Dog 10
Long day Story about Monday Emily Blue 20
Flying Car Novel about airplanes Phil High 5
Short day Novel about a day Phil Dog 30
10. Lab Assignment 5 (3/5)
As an alternative solution, you can create the an external table. The
external keyword lets you to create a table and provide a location so that
Hive does not use a default location for this table. This would be useful if
you already have data generated.
10
create external table
exbook(title string, description string, firstname string, lastname string, cost int)
row format delimited fields terminated by 't'
location '<file location, excluding the name of the file>';
5. Create a table called novel, containing those records from table book
that have keyword “novel” in their description and cache it in memory.
create table novel TBLPROPERTIES('shark.cache'='MEMORY_ONLY')
as select * from book where description like "%Novel%";
11. Lab Assignment 5 (4/5)
6. Print out the list of available tables.
11
show tables;
select lastname, sum(cost) from book group by lastname;
7. Count the number of records from the table book.
select count(*) from book;
8. Print out the total cost of the books with authors who have the same last
name.
9. Count the number of distinct last names.
select count(distinct lastname) from book;
12. Lab Assignment 5 (5/5)
10. Drop the tables.
12
drop table book;
drop table novel;