Yahoo! is one of the most-visited web sites in the world. It runs one of the largest private cloud infrastructures, one that operates on petabytes of data every day. Being able to store and manage that data well is essential to the efficient functioning of Yahoo!`s Hadoop clusters. A key component that enables this efficient operation is data compression. With regard to compression algorithms, there is an underlying tension between compression ratio and compression performance. Consequently, Hadoop provides support for several compression algorithms, including gzip, bzip2, Snappy, LZ4 and others. This plethora of options can make it difficult for users to select appropriate codecs for their MapReduce jobs. This paper attempts to provide guidance in that regard. Performance results with Gridmix and with several corpuses of data are presented. The paper also describes enhancements we have made to the bzip2 codec that improve its performance. This will be of particular interest to the increasing number of users operating on “Big Data” who require the best possible ratios. The impact of using the Intel IPP libraries is also investigated; these have the potential to improve performance significantly. Finally, a few proposals for future enhancements to Hadoop in this area are outlined.
08448380779 Call Girls In Civil Lines Women Seeking Men
Compression Options in Hadoop - A Tale of Tradeoffs
1. Compression Options In Hadoop –
A Tale of Tradeoffs
Govind Kamat, Sumeet Singh
Hadoop Summit (San Jose), June 27, 2013
2. Introduction
2
Sumeet Singh
Director of Products, Hadoop
Cloud Engineering Group
701 First Avenue
Sunnyvale, CA 94089 USA
Govind Kamat
Technical Yahoo!, Hadoop
Cloud Engineering Group
Member of Technical Staff in the Hadoop Services
team at Yahoo!
Focuses on HBase and Hadoop performance
Worked with the Performance Engineering Group on
improving the performance and scalability of several
Yahoo! applications
Experience includes development of large-scale
software systems, microprocessor architecture,
instruction-set simulators, compiler technology and
electronic design
701 First Avenue
Sunnyvale, CA 94089 USA
Leads Hadoop products team at Yahoo!
Responsible for Product Management, Customer
Engagements, Evangelism, and Program
Management
Prior to this role, led Strategy functions for the Cloud
Platform Group at Yahoo!
3. Agenda
3
Data Compression in Hadoop1
Available Compression Options2
Understanding and Working with Compression Options3
Problems Faced at Yahoo! with Large Data Sets4
Performance Evaluations, Native Bzip2, and IPP Libraries5
Wrap-up and Future Work6
4. Compression Needs and Tradeoffs in Hadoop
4
Storage
Disk I/O
Network bandwidth
CPU Time
Hadoop jobs are data-intensive, compressing data can speed up the I/O operations
MapReduce jobs are almost always I/O bound
Compressed data can save storage space and speed up data transfers across the
network
Capital allocation for hardware can go further
Reduced I/O and network load can bring significant performance improvements
MapReduce jobs can finish faster overall
On the other hand, CPU utilization and processing time increases during
compression and decompression
Understanding the tradeoffs is important for MapReduce pipeline’s overall performance
The Compression Tradeoff
5. Data Compression in Hadoop’s MR Pipeline
5
Input
splits
Map
Source: Hadoop: The Definitive Guide, Tom White
Output
ReduceBuffer in
memory
Partition and Sort
fetch
Merge
on disk
Merge and sort
Other
maps
Other
reducers
I/P
compressed
Mapper
decompresses
Mapper O/P
compressed
1
Map Reduce
Reduce I/P
Map O/P
Reducer I/P
decompresses
Reducer O/P
compressed
2 3
Sort & Shuffle
Compress Decompress
6. Compression Options in Hadoop (1/2)
6
Format Algorithm Strategy Emphasis Comments
zlib
Uses DEFLATE
(LZ77 and Huffman
coding)
Dictionary-based, API Compression ratio Default codec
gzip Wrapper around zlib
Dictionary-based,
standard compression
utility
Same as zlib, codec
operates on and
produces standard gzip
files
For data interchange on
and off Hadoop
bzip2
Burrows-Wheeler
transform
Transform-based,
block-oriented
Higher compression
ratios than zlib
Common for Pig
LZO Variant of LZ77
Dictionary-based,
block-oriented, API
High compression
speeds
Common for
intermediate
compression, HBase
tables
LZ4
Simplified variant of
LZ77
Fast scan, API
Very high compression
speeds
Available in newer
Hadoop distributions
Snappy LZ77 Block-oriented, API
Very high compression
speeds
Came out of Google,
previously known as
Zippy
7. Compression Options in Hadoop (2/2)
7
Format Codec (Defined in io.compression.codecs) File Extn. Splittable
Java/
Native
zlib/ DEFLATE
(default)
org.apache.hadoop.io.compress.DefaultCodec .deflate N Y/ Y
gzip org.apache.hadoop.io.compress.GzipCodec .gz N Y/ Y
bzip2 org.apache.hadoop.io.compress.BZip2Codec .bz2 Y Y/ Y
LZO
(download
separately)
com.hadoop.compression.lzo.LzoCodec .lzo N N/ Y
LZ4 org.apache.hadoop.io.compress.Lz4Codec .lz4 N N/ Y
Snappy org.apache.hadoop.io.compress.SnappyCodec .snappy N N/ Y
NOTES:
Splittability – Bzip2 is “splittable”, can be decompressed in parallel by multiple MapReduce tasks. Other
algorithms require all blocks together for decompression with a single MapReduce task.
LZO – Removed from Hadoop because the LZO libraries are licensed under the GNU GPL. LZO format is still
supported and the codec can be downloaded separately and enabled manually.
Native bzip2 codec – added by Yahoo! as part of this work in Hadoop 0.23
8. Space-Time Tradeoff of Compression Options
8
64%, 32.3
71%, 60.0
47%, 4.842%, 4.0
44%, 2.4
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
40% 45% 50% 55% 60% 65% 70% 75%
CPUTimeinSec.
(Compress+Decompress)
Space Savings
Bzip2
Zlib
(Deflate, Gzip)
LZOSnappy
LZ4
Note:
A 265 MB corpus from Wikipedia was used for the performance comparisons.
Space savings is defined as [1 – (Compressed/ Uncompressed)]
Codec Performance on the Wikipedia Text Corpus
High Compression Ratio
High Compression Speed
9. Using Data Compression in Hadoop
9
Phase in MR
Pipeline
Config Values
Input data to
Map
File extension recognized automatically for
decompression
File extensions for supported formats
Note: For SequenceFile, headers have the
information [compression (boolean), block
compression (boolean), and compression
codec]
One of the supported codecs one defined in io.compression.codecs
Intermediate
(Map) Output
mapreduce.map.output.compress
false (default), true
mapreduce.map.output.compress.codec
one defined in io.compression.codecs
Final
(Reduce)
Output
mapreduce.output.fileoutputformat. compress
false (default), true
mapreduce.output.fileoutputformat.
compress.codec
one defined in io.compression.codecs
mapreduce.output.fileoutputformat.
compress.type
Type of compression to use for SequenceFile
outputs: NONE, RECORD (default), BLOCK
1
2
3
10. Compress the input
data, if large
Always use compression,
particularly if spillage or
slow network transfers
Compress for storage/
archival, better write
speeds, or between MR jobs
Use splittable algo such as
bzip2, or use zlib with
SequenceFile format
Use faster codecs such as
LZO, LZ4, or Snappy
Use standard utility such as
gzip or bzip2 for data
interchange, and faster
codecs for chained jobs
When to Use Compression and Which Codec
10
Map ReduceShuffle & Sort
Input data to Map Intermediate (Map) Output
I/P
compressed
Mapper
decompresses
Mapper O/P
compressed
1
Reducer I/P
decompresses
Reducer O/P
compressed
2 3
Compress Decompress
Final Reduce Output
11. Compression in the Hadoop Ecosystem
11
Component When to Use What to Use
Pig
Compressing data between MR
job
Typical in Pig scripts that include
joins or other operators that
expand your data size
Enable compression and select the codec:
pig.tmpfilecompression = true
pig.tmpfilecompression.codec = gzip, lzo
Hive
Intermediate files produced by
Hive between multiple map-
reduce jobs
Hive writes output to a table
Enable intermediate or output compression:
hive.exec.compress.intermediate = true
hive.exec.compress.output = true
HBase
Compress data at the CF level
(support for LZO, gzip, Snappy,
and LZ4)
List required JNI libraries:
hbase.regionserver.codecs
Enabling compression:
create ’table', { NAME => 'colfam', COMPRESSION =>
’LZO' }
alter ’table', { NAME => 'colfam', COMPRESSION =>
’LZO' }
12. 4.2M Jobs, Jun 10-16, 2013
Compression in Hadoop at Yahoo!
12
99.8%
0.2%
LZO 98.3%
gzip 1.1%
zlib / default 0.5%
bzip2 0.1%
Map ReduceShuffle & Sort
Input data to Map Intermediate (Map) Output
1 2 3
Final Reduce Output
39.0%
61.0%
LZO 55%
gzip 35%
bzip2 5%
zlib / default 5%
4.2M Jobs, Jun 10-16, 2013
98%
2%
zlib /
default
73%
gzip 22%
bzip2 4%
LZO 1%
380M Files on Jun 16, 2013
(/data, /projects)
Includes
intermediate
Pig/ Hive
compression
Pig
Intermediate
Compressed
13. Compression for Data Storage Efficiency
DSE considerations at Yahoo!
RCFile instead of SequenceFile
Faster implementation of bzip2
Native-code bzip2 codec
HADOOP-84621, available in 0.23.7
Substituting the IPP library
13
1 Native-code bzip2 implementation done in collaboration with Jason Lowe,
Hadoop Core PMC member
14. IPP Libraries
Integrated Performance Primitives from Intel
Algorithmic and architectural optimizations
Processor-specific variants of each function
Applications remain processor-neutral
Compression: LZ, RLE, BWT, LZO
High level formats include: zlib, gzip, bzip2 and LZO
14
15. Measuring Standalone Performance
Standard programs (gzip, bzip2) used
Driver program written for other cases
32-bit mode
Single-threaded
JVM load overhead discounted
Default compression level
Quad-core Xeon machine
15
16. Data Corpuses Used
Binary files
Generated text from randomtextwriter
Wikipedia corpus
Silesia corpus
16
26. Future Work
Splittability support for native-code bzip2 codec
Enhancing Pig to use common bzip2 codec
Optimizing the JNI interface and buffer copies
Varying the compression effort parameter
Performance evaluation for 64-bit mode
Updating the zlib codec to specify alternative libraries
Other codec combinations, such as zlib for transient data
Other compression algorithms
26
27. Considerations in Selecting Compression Type
Nature of the data set
Chained jobs
Data-storage efficiency requirements
Frequency of compression vs. decompression
Requirement for compatibility with a standard data format
Splittability requirements
Size of the intermediate and final data
Alternative implementations of compression libraries
27
Editor's Notes
Time: 30 sec (Total: 30 sec)Good afternoon and welcome to the Compression Options in Hadoop – A Tale of Tradeoffs. Before we get started, can I have a show of hands for how many of you use compression in hadoop or are familiar with the compression options in hadoop? Well, quite a few of you, I hope that you will be able to augment your current understanding and gather further insights into compression in hadoop from our talk.
Time: 30 sec (Total: 1 min)Before we start, let us introduce ourselves. Govind, who is going to talk about his compression work later in this talk, is a software developer in the Hadoop team at Yahoo! where he focuses on HBase and Hadoop performance. My name is Sumeet Singh and I lead the Hadoop products team at Yahoo!.
Time: 1 min (Total: 2 min)We have organized the talk in two broad groups. I will cover the first three topics- how data compression works in Hadoop, the available compression options in hadoop, and understanding these options in terms of when and how you use them. Govindwill then cover the compression problem he had at hand, results from his performance works on bzip2 and Intel’s IPP libraries, and some of the future considerations in this area.
Time: 2 min (Total: 4 min)Given that Hadoop was designed to operate on large volumes of data, lossless data compression (the type of compression we deal with in Hadoop) is quite integral to the MapReduce jobs. Compression provides several benefits:MR jobs are almost always I/O bound, compressing data can speed up the IO operations that are more often than not the performance bottleneckYou can do more with less with compression turned on i.e. improve thecluster utilization through space savings and faster data transfers across the network as you will send less data. This is particularly true as Hadoop uses a 3x data replication by default for fault tolerance.As a user, you can improve your overall job performance and your jobs may take less time to complete.Compression does not come for free though. These benefits come at the cost of increased CPU utilization in compressing and decompressing data. So, using compression itself presents a tradeoff between storage savings, faster I/O, and better use of network bandwidth with increase in CPU load. But given the nature of Hadoop, using compression generally turns out to be a good tradeoff to make.
Time: 2 min (Total: 6 min)Let’s look at the compression decompression process in a MapReduce pipeline. More often than note, you are dealing with large datasets for processing that comes in compressed for processing. Map phase detects supported compression format (that we will talk about in detail here) and decompresses data for processing.Map outputs to the reducers are sorted on its keys. The process of sorting and transferring data to the reducer phase is called shuffle. Map writes are buffered in memory, as spilled to the disk when they get full. ((Data is partitioned before the spills according to the reducer it needs to go to). Several spills files may get generated as part of the above process that are then merged on the disk to create a bigger partition. The data on the disk write can again be compressed. The reducer may need data from several map tasks on other nodes. Transferring this data compressed helps. The reduce phase merges output from map tasks either in memory or in a combination of in-memory and disk that is fed to the reducer for further processing upon which the data is written to HDFS (and that can be written compressed).As you can clearly see, compression is integral to a MapReduce pipeline and can have a significant impact on the job’s overall performance.
Time: 2 min (Total: 8 min)Now given the importance of compression in Hadoop, it provides support for multiple codecs as a result. The class of these algorithms are called lossless as data needs to precisely regenerated from the compressed form. There are 6 available options to you in Hadoop 0.23 version that we run at Yahoo! right now.Compression algorithms operate by finding and eliminating redundancy and duplication in data. Thus, truly random data can never be compressed. Compression strategies generally have three phases: a preprocessing or transform phase, followed by duplicate elimination and finally a phase that focuses on bit reduction. The algorithms used in each compression format vary and have a strong impact on the efficacy and speed of compression.
Time: 2 min (Total: 10 min)Splittability was not there initially. Seq. file format was designed to tackle with the splittability issue (aware of keys and values). Compression cares about byte streams only. Pig has the same problem, added split capability to their compression. Made a copy of bzip2 code and added split capability. Later on, hadoop added split support and made it work for bzip2. Split capability could be added to block oriented compression algos such as LZO, Snappy and LZ4.
Time: 1 min (Total: 11 min)
Time: 1 min (Total: 12 min)Try to add Yahoo! numbers here.
Time: 2 min (Total: 14 min)Input data is large (Govind to provide a rule of thumb)Intermediate (Spillage is one, network transfers are slow)Final (space, speed of writes, chained MR)I/P – O/P once that gives you better space savings | compression ratios such as Zlib or Bzip2.Intermediate (LZO type compression, faster codecs)
Time: 2 min (Total: 16 min)
Time: 1 min (Total: 17 min)
Circle around zlib and bzip2Circle around LZO, Snappy and LZ4
Circle around zlib and IPP-zlibCircle around bzip2 and IPP-bzip2