SlideShare a Scribd company logo
1 of 33
HADOOP WITH PYTHON
Donald Miner
@donaldpminer
DC Python Meetup
3/10/15
Agenda
• Introduction to Hadoop
• MapReduce with mrjob
• Pig with Python UDFs
• snakebite for HDFS
• HBase and python clients
• Spark and PySpark
Hadoop Distributed File System (HDFS)
• Stores files in folders (that’s it)
• Nobody cares what’s in your files
• Chunks large files into blocks (~64MB-2GB)
• 3 replicas of each block (better safe than sorry)
• Blocks are scattered all over the place
FILE BLOCKS
MapReduce
• Analyzes raw data in HDFS where the data is
• Jobs are split into Mappers and Reducers
Reducers (you code this, too)
Automatically Groups by the
mapper’s output key
Aggregate, count, statistics
Outputs to HDFS
Mappers (you code this)
Loads data from HDFS
Filter, transform, parse
Outputs (key, value)
pairs
Hadoop Ecosystem
• Higher-level languages like Pig and Hive
• HDFS Data systems like HBase and Accumulo
• Alternative execution engines like Storm and Spark
• Close friends like ZooKeeper, Flume, Avro, Kafka
Cool Thing #1: Linear Scalability
• HDFS and MapReduce
scale linearly
• If you have twice as many
computers, jobs run twice
as fast
• If you have twice as much
data, jobs run twice as
slow
• If you have twice as many
computers, you can store
twice as much data
DATA LOCALITY!!
Cool Thing #2: Schema on Read
LOAD DATA FIRST, ASK QUESTIONS LATER
Data is parsed/interpreted as it is loaded out of HDFS
What implications does this have?
BEFORE:
ETL, schema design upfront,
tossing out original data,
comprehensive data study
Keep original data around!
Have multiple views of the same data!
Work with unstructured data sooner!
Store first, figure out what to do with it later!
WITH HADOOP:
Cool Thing #3: Transparent Parallelism
Network programming?
Inter-process communication?
Threading?
Distributed stuff?
With MapReduce, I DON’T CARE
Your solution
… I just have to be sure my solution fits into this tiny box
Fault tolerance?
Code deployment?
RPC?
Message passing?
Locking?
MapReduce
Framework
Data storage?
Scalability?
Data center fires?
Cool Thing #4: Unstructured Data
• Unstructured data:
media, text,
forms, log data
lumped structured data
• Query languages like SQL and
Pig assume some sort of
“structure”
• MapReduce is just Java:
You can do anything Java can
do in a Mapper or Reducer
Why Python?
• Python vs. Java
• Compiled vs. scripts
• Python libraries we all love
• Integration with other things
Why Not?
• Python vs. Java
• Almost nothing is native
• Performance
• Being out of date
• Being “weird”
• Smaller community, almost no official support
Questions about Hadoop?
mrjob
• Write MapReduce jobs in Python!
• Open sourced and maintained by Yelp
• Wraps “Hadoop Streaming” in cpython Python 2.5+
• Well documented
• Can run locally, in Amazon EMR, or Hadoop
Canonical Word Count
from mrjob.job import MRJob
import re
WORD_RE = re.compile(r"[w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield (word.lower(), 1)
def reducer(self, word, counts):
yield (word, sum(counts))
if __name__ == '__main__':
MRWordFreqCount.run()
Canonical Word Count
from mrjob.job import MRJob
import re
WORD_RE = re.compile(r"[w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield (word.lower(), 1)
def reducer(self, word, counts):
yield (word, sum(counts))
if __name__ == '__main__':
MRWordFreqCount.run()
The quick brown fox jumps over the lazy dog
the, 1
quick, 1
brown, 1
fox, 1
jumps, 1
over, 1
the, 1
lazy, 1
dog, 1
Canonical Word Count
from mrjob.job import MRJob
import re
WORD_RE = re.compile(r"[w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield (word.lower(), 1)
def reducer(self, word, counts):
yield (word, sum(counts))
if __name__ == '__main__':
MRWordFreqCount.run()
I like this Hadoop thing
i, 1
like, 1
this, 1
hadoop, 1
thing, 1
Canonical Word Count
from mrjob.job import MRJob
import re
WORD_RE = re.compile(r"[w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield (word.lower(), 1)
def reducer(self, word, counts):
yield (word, sum(counts))
if __name__ == '__main__':
MRWordFreqCount.run()
dog, [1, 1, 1, 1, 1, 1]
dog, 6
Canonical Word Count
from mrjob.job import MRJob
import re
WORD_RE = re.compile(r"[w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield (word.lower(), 1)
def reducer(self, word, counts):
yield (word, sum(counts))
if __name__ == '__main__':
MRWordFreqCount.run()
cat, [1, 1, 1, 1, 1, 1, 1, 1]
cat, 8
MRJOB DEMO!
Other options
http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
Hadoop Streaming – More manual but faster
Hadoopy, Dumbo, haven’t seen commits in years, mrjob in the past 12 hours
Pydoop is main competitor (not in this list)
Pydoop
• Write MapReduce jobs in Python!
• Uses Hadoop C++ Pipes, which should be faster than
wrapping streaming
• Actively being worked on
• I’m not sure which is better
Pydoop Word Count
with open('stop.txt') as f:
STOP_WORDS = set(l.strip() for l in f if not l.isspace())
def mapper(_, v, writer):
for word in v.split():
if word in STOP_WORDS:
writer.count("STOP_WORDS", 1)
else:
writer.emit(word, 1)
def reducer(word, icounts, writer):
writer.emit(word, sum(map(int, icounts)))
$ pydoop script wc.py hdfs_input hdfs_output --upload-
file-to-cache stop.txt
Pig
• Pig is a higher-level platform and language for analyzing
data that happens to run MapReduce underneath
a = LOAD ’inputdata.txt’;
b = FOREACH a GENERATE
FLATTEN(TOKENIZE((chararray)$0)) as word;
c = GROUP b BY word;
d = FOREACH c GENERATE group, COUNT(c);
STORE d INTO ‘wc';
Pig UDFs
Users can write user-defined functions to extend the
functionality of Pig
Can use jython (faster) or cpython (access to more libs)
b = FOREACH a GENERATE revster(phonenum);
...
m = GROUP j BY username;
n = FOREACH m GENERATE group, sortedconcat(j.tags);
@outputSchema(“tags:chararray")
def sortedconcat(bag):
out = set()
for tag in bag:
out.add(tag)
return ‘-’.join(sorted(out))
@outputSchema(“rev:chararray")
def revstr(instr):
return instr[::-1]
• A pure Python client
• Handles most NameNode ops (moving/renaming files,
deleting files)
• Handles most DataNode reading ops (reading files,
getmerge)
• Doesn’t handle writing to DataNodes yet
• Two ways to use: library and command line interface
- Library
from snakebite.client import Client
client = Client(”1.2.3.4", 54310, use_trash=False)
for x in client.ls(['/data']):
print x
print ‘’.join(client.cat(‘/data/ref/refdata*.csv’))
Useful for doing HDFS file manipulation in data flows or job setups
Can be used to read reference data from MapReduce jobs
- CLI
$ snakebite get /path/in/hdfs/mydata.txt /local/path/data.txt
$ snakebite rm /path/in/hdfs/mydata.txt
$ for fp in `snakebite ls /data/new/`; do
snakebite mv “/data/new/$fp” “/data/in/`date ‘+%Y/%m/%d/’$fp
done
The “hadoop” CLI client is written in Java and spins up a new JVM every time (1-3 sec)
Snakebite doesn’t have that problem, making it good for lots of programmatic
interactions with HDFS.
From the website:
Apache HBase is the Hadoop database, a distributed, scalable, big data store.
When Would I Use Apache HBase?
Use Apache HBase when you need random, realtime read/write access to
your Big Data. This project's goal is the hosting of very large tables --
billions of rows X millions of columns -- atop clusters of commodity
hardware. Apache HBase is an open-source, distributed, versioned, non-
relational database modeled after Google's Bigtable: A Distributed Storage
System for Structured Data by Chang et al. Just as Bigtable leverages the
distributed data storage provided by the Google File System, Apache HBase
provides Bigtable-like capabilities on top of Hadoop and HDFS.
Python clients
Starbase or Happybase
Uses the HBase Thrift gateway interface (slow)
Last commit 6 months ago
Appears to be fully featured
Not really there yet and have failed to gain community momentum. Java is still
king.
From the website:
Apache Spark is a fast and general-purpose cluster
computing system. It provides high-level APIs in Scala,
Java, and Python that make parallel jobs easy to write,
and an optimized engine that supports general computation
graphs. It also supports a rich set of higher-level tools
including Shark (Hive on Spark), MLlib for machine
learning, GraphX for graph processing, and Spark
Streaming.
In general, Spark is faster than MapReduce and
easier to write than MapReduce
PySpark
• Spark’s native language is Scala, but it also supports Java
and Python
• Python API is always a tad behind Scala
• Programming in Spark (and PySpark) is in the form of
chaining transformations and actions on RDDs
• RDDs are “Resilient Distributed Datasets”
• RDDs are kept in memory for the most part
PySpark Word Count Example
import sys
from operator import add
from pyspark import SparkContext
if __name__ == "__main__":
if len(sys.argv) != 2:
print >> sys.stderr, "Usage: wordcount <file>"
exit(-1)
sc = SparkContext(appName="PythonWordCount")
lines = sc.textFile(sys.argv[1], 1)
counts = lines.flatMap(lambda x: x.split(' ')) 
.map(lambda x: (x, 1)) 
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print "%s: %i" % (word, count)
sc.stop()
HADOOP WITH PYTHON
Donald Miner
@donaldpminer
DC Python Meetup
3/10/15

More Related Content

What's hot

Stateful stream processing with Apache Flink
Stateful stream processing with Apache FlinkStateful stream processing with Apache Flink
Stateful stream processing with Apache FlinkKnoldus Inc.
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Data Streaming For Big Data
Data Streaming For Big DataData Streaming For Big Data
Data Streaming For Big DataSeval Çapraz
 
MySQL Hash Table
MySQL Hash TableMySQL Hash Table
MySQL Hash TableMIJIN AN
 
Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Jay Patel
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizonThejas Nair
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!
 
Introduction to Big Data and hadoop
Introduction to Big Data and hadoopIntroduction to Big Data and hadoop
Introduction to Big Data and hadoopSandeep Patil
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
Building Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFiBuilding Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFiBryan Bende
 
Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Kai Wähner
 
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...Databricks
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 

What's hot (20)

Stateful stream processing with Apache Flink
Stateful stream processing with Apache FlinkStateful stream processing with Apache Flink
Stateful stream processing with Apache Flink
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Data Streaming For Big Data
Data Streaming For Big DataData Streaming For Big Data
Data Streaming For Big Data
 
MySQL Hash Table
MySQL Hash TableMySQL Hash Table
MySQL Hash Table
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
 
Introduction to Big Data and hadoop
Introduction to Big Data and hadoopIntroduction to Big Data and hadoop
Introduction to Big Data and hadoop
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Building Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFiBuilding Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFi
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?
 
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 

Similar to Hadoop with Python

Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsDataWorks Summit
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerCodemotion
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & HadoopJeffrey Breen
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 

Similar to Hadoop with Python (20)

Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worlds
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Data Science
Data ScienceData Science
Data Science
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe Seiler
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
 
מיכאל
מיכאלמיכאל
מיכאל
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 

More from Donald Miner

Machine Learning Vital Signs
Machine Learning Vital SignsMachine Learning Vital Signs
Machine Learning Vital SignsDonald Miner
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MDDonald Miner
 
Survey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing DataSurvey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing DataDonald Miner
 
An Introduction to Accumulo
An Introduction to AccumuloAn Introduction to Accumulo
An Introduction to AccumuloDonald Miner
 
Data, The New Currency
Data, The New CurrencyData, The New Currency
Data, The New CurrencyDonald Miner
 
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest Donald Miner
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data ScienceDonald Miner
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 

More from Donald Miner (11)

Machine Learning Vital Signs
Machine Learning Vital SignsMachine Learning Vital Signs
Machine Learning Vital Signs
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MD
 
Survey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing DataSurvey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing Data
 
An Introduction to Accumulo
An Introduction to AccumuloAn Introduction to Accumulo
An Introduction to Accumulo
 
SQL on Accumulo
SQL on AccumuloSQL on Accumulo
SQL on Accumulo
 
Data, The New Currency
Data, The New CurrencyData, The New Currency
Data, The New Currency
 
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 

Recently uploaded

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 

Recently uploaded (20)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 

Hadoop with Python

  • 1. HADOOP WITH PYTHON Donald Miner @donaldpminer DC Python Meetup 3/10/15
  • 2. Agenda • Introduction to Hadoop • MapReduce with mrjob • Pig with Python UDFs • snakebite for HDFS • HBase and python clients • Spark and PySpark
  • 3. Hadoop Distributed File System (HDFS) • Stores files in folders (that’s it) • Nobody cares what’s in your files • Chunks large files into blocks (~64MB-2GB) • 3 replicas of each block (better safe than sorry) • Blocks are scattered all over the place FILE BLOCKS
  • 4. MapReduce • Analyzes raw data in HDFS where the data is • Jobs are split into Mappers and Reducers Reducers (you code this, too) Automatically Groups by the mapper’s output key Aggregate, count, statistics Outputs to HDFS Mappers (you code this) Loads data from HDFS Filter, transform, parse Outputs (key, value) pairs
  • 5. Hadoop Ecosystem • Higher-level languages like Pig and Hive • HDFS Data systems like HBase and Accumulo • Alternative execution engines like Storm and Spark • Close friends like ZooKeeper, Flume, Avro, Kafka
  • 6. Cool Thing #1: Linear Scalability • HDFS and MapReduce scale linearly • If you have twice as many computers, jobs run twice as fast • If you have twice as much data, jobs run twice as slow • If you have twice as many computers, you can store twice as much data DATA LOCALITY!!
  • 7. Cool Thing #2: Schema on Read LOAD DATA FIRST, ASK QUESTIONS LATER Data is parsed/interpreted as it is loaded out of HDFS What implications does this have? BEFORE: ETL, schema design upfront, tossing out original data, comprehensive data study Keep original data around! Have multiple views of the same data! Work with unstructured data sooner! Store first, figure out what to do with it later! WITH HADOOP:
  • 8. Cool Thing #3: Transparent Parallelism Network programming? Inter-process communication? Threading? Distributed stuff? With MapReduce, I DON’T CARE Your solution … I just have to be sure my solution fits into this tiny box Fault tolerance? Code deployment? RPC? Message passing? Locking? MapReduce Framework Data storage? Scalability? Data center fires?
  • 9. Cool Thing #4: Unstructured Data • Unstructured data: media, text, forms, log data lumped structured data • Query languages like SQL and Pig assume some sort of “structure” • MapReduce is just Java: You can do anything Java can do in a Mapper or Reducer
  • 10. Why Python? • Python vs. Java • Compiled vs. scripts • Python libraries we all love • Integration with other things
  • 11. Why Not? • Python vs. Java • Almost nothing is native • Performance • Being out of date • Being “weird” • Smaller community, almost no official support
  • 13. mrjob • Write MapReduce jobs in Python! • Open sourced and maintained by Yelp • Wraps “Hadoop Streaming” in cpython Python 2.5+ • Well documented • Can run locally, in Amazon EMR, or Hadoop
  • 14. Canonical Word Count from mrjob.job import MRJob import re WORD_RE = re.compile(r"[w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in WORD_RE.findall(line): yield (word.lower(), 1) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run()
  • 15. Canonical Word Count from mrjob.job import MRJob import re WORD_RE = re.compile(r"[w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in WORD_RE.findall(line): yield (word.lower(), 1) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run() The quick brown fox jumps over the lazy dog the, 1 quick, 1 brown, 1 fox, 1 jumps, 1 over, 1 the, 1 lazy, 1 dog, 1
  • 16. Canonical Word Count from mrjob.job import MRJob import re WORD_RE = re.compile(r"[w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in WORD_RE.findall(line): yield (word.lower(), 1) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run() I like this Hadoop thing i, 1 like, 1 this, 1 hadoop, 1 thing, 1
  • 17. Canonical Word Count from mrjob.job import MRJob import re WORD_RE = re.compile(r"[w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in WORD_RE.findall(line): yield (word.lower(), 1) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run() dog, [1, 1, 1, 1, 1, 1] dog, 6
  • 18. Canonical Word Count from mrjob.job import MRJob import re WORD_RE = re.compile(r"[w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in WORD_RE.findall(line): yield (word.lower(), 1) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run() cat, [1, 1, 1, 1, 1, 1, 1, 1] cat, 8
  • 20. Other options http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/ Hadoop Streaming – More manual but faster Hadoopy, Dumbo, haven’t seen commits in years, mrjob in the past 12 hours Pydoop is main competitor (not in this list)
  • 21. Pydoop • Write MapReduce jobs in Python! • Uses Hadoop C++ Pipes, which should be faster than wrapping streaming • Actively being worked on • I’m not sure which is better
  • 22. Pydoop Word Count with open('stop.txt') as f: STOP_WORDS = set(l.strip() for l in f if not l.isspace()) def mapper(_, v, writer): for word in v.split(): if word in STOP_WORDS: writer.count("STOP_WORDS", 1) else: writer.emit(word, 1) def reducer(word, icounts, writer): writer.emit(word, sum(map(int, icounts))) $ pydoop script wc.py hdfs_input hdfs_output --upload- file-to-cache stop.txt
  • 23. Pig • Pig is a higher-level platform and language for analyzing data that happens to run MapReduce underneath a = LOAD ’inputdata.txt’; b = FOREACH a GENERATE FLATTEN(TOKENIZE((chararray)$0)) as word; c = GROUP b BY word; d = FOREACH c GENERATE group, COUNT(c); STORE d INTO ‘wc';
  • 24. Pig UDFs Users can write user-defined functions to extend the functionality of Pig Can use jython (faster) or cpython (access to more libs) b = FOREACH a GENERATE revster(phonenum); ... m = GROUP j BY username; n = FOREACH m GENERATE group, sortedconcat(j.tags); @outputSchema(“tags:chararray") def sortedconcat(bag): out = set() for tag in bag: out.add(tag) return ‘-’.join(sorted(out)) @outputSchema(“rev:chararray") def revstr(instr): return instr[::-1]
  • 25. • A pure Python client • Handles most NameNode ops (moving/renaming files, deleting files) • Handles most DataNode reading ops (reading files, getmerge) • Doesn’t handle writing to DataNodes yet • Two ways to use: library and command line interface
  • 26. - Library from snakebite.client import Client client = Client(”1.2.3.4", 54310, use_trash=False) for x in client.ls(['/data']): print x print ‘’.join(client.cat(‘/data/ref/refdata*.csv’)) Useful for doing HDFS file manipulation in data flows or job setups Can be used to read reference data from MapReduce jobs
  • 27. - CLI $ snakebite get /path/in/hdfs/mydata.txt /local/path/data.txt $ snakebite rm /path/in/hdfs/mydata.txt $ for fp in `snakebite ls /data/new/`; do snakebite mv “/data/new/$fp” “/data/in/`date ‘+%Y/%m/%d/’$fp done The “hadoop” CLI client is written in Java and spins up a new JVM every time (1-3 sec) Snakebite doesn’t have that problem, making it good for lots of programmatic interactions with HDFS.
  • 28. From the website: Apache HBase is the Hadoop database, a distributed, scalable, big data store. When Would I Use Apache HBase? Use Apache HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non- relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
  • 29. Python clients Starbase or Happybase Uses the HBase Thrift gateway interface (slow) Last commit 6 months ago Appears to be fully featured Not really there yet and have failed to gain community momentum. Java is still king.
  • 30. From the website: Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Scala, Java, and Python that make parallel jobs easy to write, and an optimized engine that supports general computation graphs. It also supports a rich set of higher-level tools including Shark (Hive on Spark), MLlib for machine learning, GraphX for graph processing, and Spark Streaming. In general, Spark is faster than MapReduce and easier to write than MapReduce
  • 31. PySpark • Spark’s native language is Scala, but it also supports Java and Python • Python API is always a tad behind Scala • Programming in Spark (and PySpark) is in the form of chaining transformations and actions on RDDs • RDDs are “Resilient Distributed Datasets” • RDDs are kept in memory for the most part
  • 32. PySpark Word Count Example import sys from operator import add from pyspark import SparkContext if __name__ == "__main__": if len(sys.argv) != 2: print >> sys.stderr, "Usage: wordcount <file>" exit(-1) sc = SparkContext(appName="PythonWordCount") lines = sc.textFile(sys.argv[1], 1) counts = lines.flatMap(lambda x: x.split(' ')) .map(lambda x: (x, 1)) .reduceByKey(add) output = counts.collect() for (word, count) in output: print "%s: %i" % (word, count) sc.stop()
  • 33. HADOOP WITH PYTHON Donald Miner @donaldpminer DC Python Meetup 3/10/15

Editor's Notes

  1. Donald Miner will do a quick introduction to Apache Hadoop, then discuss the different ways Python can be used to get the job done in Hadoop. This includes writing MapReduce jobs in Python in various different ways, interacting with HBase, writing custom behavior in Pig and Hive, interacting with the Hadoop Distributed File System, using Spark, and integration with other corners of the Hadoop ecosystem. The state of Python with Hadoop is far from stable, so we'll spend some honest time talking about the state of these open source projects and what's missing will also be discussed.
  2. Donald Miner will do a quick introduction to Apache Hadoop, then discuss the different ways Python can be used to get the job done in Hadoop. This includes writing MapReduce jobs in Python in various different ways, interacting with HBase, writing custom behavior in Pig and Hive, interacting with the Hadoop Distributed File System, using Spark, and integration with other corners of the Hadoop ecosystem. The state of Python with Hadoop is far from stable, so we'll spend some honest time talking about the state of these open source projects and what's missing will also be discussed.