Text compression in LZW and Flate

By
Subeer Rangra
(08EBKCS059)
&
Mukul Ranjan
(08EBKCS029)

Index
1. Introduction to Data Compression
2. Introduction to Text Compression
3. LZW
3.1 LZW Encoding Algorithm
3.2 Encoding a String Example
3.2 LZW Decoding Algorithm
3.3 Decoding a String Example.
4. Flate Compression
4.1 Decomposition
4.1.1 Huffman Coding
4.1.2 LZ77 Compression
4.1.3 Putting both together
5. Advantages and Disadvantages
5.1 LZW
5.2 Flate
6. Conclusion

1. Introduction to Data
Compression
 Encoding information using fewer bits than the
original representation.
 Data Compression is achieved when redundancies are
reduced or eliminated
 Lossless where no information is lost.

 Lossy where some information is lost.

 Compression reduces the data storage space.

Introduction to Data
Compression…. Contd.
 Reduces transmission time needed over the network.

 Data must be decompressed or decoded to be reused.

 Symmetrical or Asymmetrical

 Software or Hardware

2. Introduction to Text
Compression
 The compression of Text based data.

 Major difference between Text and Image compression.

 Databases, binary programs, text on one side and sound,
image, video signals on the other.

 Text compression needs Losseless Compression.

 Needed in literary works, product catalogues, genomic
databases, raw text databases.

3. LZW (Lempel-Ziv-Welch)
 Starts with a dictionary of all the single characters and gradually
builds the dictionary as the information is sent through.

 Lossless compression hence works good for text compression.

 A dictionary or code table based encoding algorithm.

 Uses a code table with 4096 as a common choice for number of
entries.

 It tries to identify repeated sequences of data and adds them to
the code table.

LZW (Lempel-Ziv-Welch)….contd.
 A general compression algorithm capable of working
on almost any type of data.

 Large size Text files in English language can be
typically be compressed to half it’s size.

 Used in GIF (Graphics Interchange Format) to reduce
the size without degrading the visual quality.

3.1 LZW Encoding Algorithm
1. STRING = get input character
2. WHILE not end of input stream DO
3. CHARACTER = get input character
4. IF STRING+CHARACTER is in the string table then
5. STRING = STRING+CHARACTER
6. ELSE
7. Output the code for STRING
8. add STRING+CHARACTER to the STRING table
9. STRING = CHARACTER
10. END of IF
11. END of WHILE
12. Output the code for STRING

3.2 Encoding a String example
 To encode a string of characters
1. First Generate a initial dictionary of single characters

Symbol Binary Decimal
# 00000 0
A 00001 1
B 00010 2
C 00011 3
D 00100 4
E 00101 5
Contd……..
upto Z

Encoding a String Example …..contd
2. Example TOBEORNOTTOBEORTOBEORNOT
Current Output
Next Char Extended Dictionary Comments
Sequence Code Bits
NULL T

T O 20 10100 27: TO 27 = first available code after 0 through 26

O B 15 01111 28: OB
B E 2 00010 29: BE
E O 5 00101 30: EO
O R 15 01111 31: OR

32 requires 6 bits, so for next output use 6
R N 18 10010 32: RN
bits

N O 14 001110 33: NO
O T 15 001111 34: OT
T T 20 010100 35: TT
TO B 27 011011 36: TOB

BE O 29 011101 37: BEO

Encoding a String Example …..contd
TO B 27 011011 36: TOB

BE O 29 011101 37: BEO

OR T 31 011111 38: ORT

TOB E 36 100100 39: TOBE

EO R 30 011110 40: EOR

RN O 32 100000 41: RNO

# stops the algorithm;
OT # 34 100010
send the cur seq

0 000000 and the stop code

3.3 LZW Decoding Algorithm
1. Read OLD_CODE
2. output OLD_CODE
3. CHARACTER = OLD_CODE
4. WHILE there are still input characters DO
5. Read NEW_CODE
6. IF NEW_CODE is not in the translation table THEN
7. STRING = get translation of OLD_CODE
8. STRING = STRING+CHARACTER
9. ELSE
10. STRING = get translation of NEW_CODE
11. END of IF
12. output STRING
13. CHARACTER = first character in STRING
14. add OLD_CODE + CHARACTER to the translation table
15. OLD_CODE = NEW_CODE
16. END of WHILE

3.4 Decoding a String Example
 To decode an LZW-compressed archive, one needs to know
in advance the initial dictionary used, but additional
entries can be reconstructed as they are always simply
concatenations of previous entries.
Input New Dictionary Entry
Output
Comments
Bits Code Sequence Full Conjecture
10100 20 T 27: T?
01111 15 O 27: TO 28: O?
00010 2 B 28: OB 29: B?
00101 5 E 29: BE 30: E?
01111 15 O 30: EO 31: O?
created code 31 (last to fit
10010 18 R 31: OR 32: R?
in 5 bits)

so start reading input at 6
001110 14 N 32: RN 33: N?
bits

4. Flate Compression
 A lossless data compression.
 Can discover and exploit many patterns in the input
data.
 An improvement over LZW compression, Flate
encoded data is usually much more compact than
LZW encoded output.
 It was originally defined by Phil Katz for version 2 of
his PKZIP archiving tool and was later specified in RFC
1951.
 Used in PDF compression, Adobe uses a Flate
compression tool for PDF files.

4.1 Decomposition
 Flate specifications defines a lossless data format that
compresses data using a combination of LZ77 algorithm
and Huffman coding.
 Hence the format can be implemented readily in a manner
not covered by patents.
 The manner in which these two algorithms work are
explained below and then the combination of the two
which work to produce Flate compression.

4.1.1 Huffman Coding
 A type of entropy encoding algorithm.

 Used for lossless data compression.

 Can be used to generate variable-length codes.

 The variable length codes are generated based on the
frequency of the occurrence of the characters.
 The idea of assigning shortest code to the character
with the highest probability of occurrence.

Huffman Coding…. contd.
 The algorithm starts by assigning each element a
‘weight’ a number that represents the relative
frequency within the data to be compressed.
Taking an example for the set of weights {1,2,3,3,4}

1. They are assigned to be the nodes or leaves of the
Huffman tree to be formed

2. During the first step, the two nodes with weights
(highest priority OR lowest probability) 1 and 2 are
merged, to create a new tree with a root of weight 3.

3. Now we have three nodes with weights 3 at their
roots, so choosing one of the 3 weighted node.

4. Now our two minimum trees are the two singleton
nodes of weights 3 and 4. We will combine these to
form a new tree of weight 7.

5. Finally we merge our last two remaining trees.

 When all nodes have been recombined into a single
``Huffman tree,'' then by starting at the root and
selecting 0 or 1 at each step, you can reach any element
in the tree.
 Each element now has a Huffman code, which is the
sequence of 0's and 1's that represents that path
through the tree.

4.1.2 LZ77 Compression
 Works by finding the sequence of data that are
repeated.
 A lossless data compression algorithm.
 Maintains a ‘sliding window during compression’
which means that the compressor have a record of
what last characters were.
 Goes through the text in a sliding window consisting
of a search buffer and a look ahead buffer.
 The search buffer is used as dictionary.

LZ77 Compression…. contd.
1. Suppose the input text is
AABABBBABAABABBBABBABB
2. The first block found is simply A, encoded as (0,A).
The next is AB, encoded as (1,B) where 1 is a reference
to A:
A|AB|ABBBABAABABBBABBABB
3. The next block is ABB, which is encoded as (2,B)
where 2 is a reference to AB, entered in the
dictionary one iteration ago. Going this way, the
string parses into
A|AB|ABB|B|ABA|ABAB|BB|ABBA|BB

LZ77 Compression…. Contd.
 At the end of the algorithm, the dictionary is:
Reference Phrase Encoding
1 A (0,A)
2 AB (1,B)
3 ABB (2,B)
4 B (0,B)
5 ABA (2,A)
6 ABAB (5,B)
7 BB (4,B)
8 ABBA (3,A)
9 BB (7,0)

4.1.3 Putting Both Together
The Flate is a smart algorithm that adapts the way it
compresses data to the actual data themselves. There are
three modes of compression that the compressor has
available:
1. Not compressed at all an intelligent choice when the
data has already been compressed.
2. Compression, first with LZ77 and then with a slightly
modified version of Huffman coding. The trees that
are used are defined by the Flate specification itself.

Putting Both Together….contd.
3. Compression first with LZ77 and then with Huffman
coding with trees that compressor creates and stores
along with the data.
The data is broken up into blocks each block uses a
single mode of compression.

5. Advantages & Disadvantages
5.1 LZW
Advantage
 Is a lossless compression algo. Hence no information is lost.
 One need not pass the code table between the two
compression and the decompression.
 Simple, fast and good compression.
Disadvantage
 What happens when the dictionary becomes too large.
 One approach is to throw the dictionary away when it reaches
a certain size.
 Useful only for a large amount of text data where redundancy
is high.

Advantages & Disadvantages
5.1 Flate Compression
Advantage
 Huffman is easy to implement.
 Flate is a lossless compression technique hence no loss of text.
 Simple, fast and good compression.
 Freedom to chose the type of compression based on the need of the
content.
Disadvantage
 Overhead is generated due to Huffman tree generation.
 The actual resulting compression code becomes too complex as it
combines LZ77 and Huffman.
 It’s quiet tricky to understand and correctly apply the correct
combination of LZ77 and Huffman.

6. Conclusion
 LZW has various advantages when being used to
compress large text data, in English language which
has high redundancy.
 Both LZW and Flate are software based, Dictionary
and lossless methods of compression.
 The text compression needs lossless technique of
compression.
 Flate which is readily used in PDF files, is an adaptive,
changeable and complex way to compress text.

Text compression in LZW and Flate

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Text compression in LZW and Flate

Similar to Text compression in LZW and Flate (20)

Recently uploaded

Recently uploaded (20)

Text compression in LZW and Flate