SlideShare a Scribd company logo
1 of 20
Download to read offline
An Empirical Evaluation of XML Compression Tools

                                    Sherif Sakr
                       School of Computer Science and Engineering
                             University of New South Wales



1st International Workshop on Benchmarking of XML and Semantic Web Applications
                                 (BenchmarX’09)


                                   20 April 2009




 S. Sakr (CSE, UNSW)                  BenchmarX’09                  20 April 2009   1 / 20
XML Compression: Why?

   XML has become a popular standard with many useful applications.

   XML is often referred as self-describing data.
         On one hand, this self-describing feature grants the XML great
         flexibility.
         On the other hand, it introduces the main problem of verbosity.


   XML compression has many advantages such as:
         Reducing the network bandwidth required for data exchange.
         Reducing the disk space required for storage.
         Minimizing the main memory requirements of processing and querying
         XML documents.



  S. Sakr (CSE, UNSW)             BenchmarX’09                  20 April 2009   2 / 20
XML Compressors: Classifications I

With respect to the awareness of the structure of the XML documents:

    General Text Compressors: They are XML-Blind, treats XML
    documents as usual plain text documents and applies the traditional
    text compression techniques.

    XML Conscious Compressors: They are designed to take the
    advantage of the awareness of the XML document structure to
    achieve better compression ratios over the general text compressors.
          Schema dependent compressors: Both of the encoder and decoder
          must have access to the document schema information achieve the
          compression process. They are not commonly used in practice.
          Schema independent compressors: The availability of the schema
          information is not required to achieve the encoding and decoding
          processes.


   S. Sakr (CSE, UNSW)            BenchmarX’09                20 April 2009   3 / 20
XML Compressors: Classifications II

With respect to the ability of supporting queries:

     Non-Queriable (Archival) XML Compressors: They do not allow
     any queries to be processed over the compressed format. They are
     mainly focusing to achieve the highest compression ratio.
           By default, general purpose text compressors belong to the
           non-queriable group of compressors.

     Queriable XML Compressors: They allow queries to be processed
     over their compressed formats. The compression ratio is usually worse
     than that of the archival XML compressors. The main focus is to
     avoid full document decompression during query execution.
           The ability to perform direct queries on compressed XML formats is
           important for many applications which are hosted on resource-limited
           computing devices such as: mobile devices and GPS systems .
           By default, all queriable compressors are XML conscious compressors
           as well.

    S. Sakr (CSE, UNSW)             BenchmarX’09                  20 April 2009   4 / 20
Examination Criteria

In our study we considered, to the best of our knowledge, all XML
compressors which are fulfilling the following conditions:

    Is publicly and freely available either in the form of open source codes
    or binary versions.

    Is schema-independent.

    Be able to run under our Linux version of operating system.
          Ubuntu 7.10 (Linux 2.6.20 Kernel)
          Ubuntu 7.10 (Linux 2.6.22 Kernel)




   S. Sakr (CSE, UNSW)            BenchmarX’09                20 April 2009   5 / 20
Examined Compressors

                             XML Compressors List
   Compressor            Features    Code           Compressor   Features       Code
                                    Available                                  Available
   GZIP (1.3.12)          GAI          Y            XGrind        SQI             Y
   BZIP2 (1.0.4)          GAI          Y            XBzip         SQI             N
   PPM (j.1)              GAI          Y            XQueC         SQI             N
   XMill (0.7)            SAI          Y            XCQ           SQI             N
   XMLPPM (0.98.3)        SAI          Y            XPress        SQI             N
   SCMPPM (0.93.3)        SAI          Y            XQzip         SQI             N
   XWRT (3.2)             SAI          Y            XSeq          SQI             N
   Exalt (0.1.0)          SAI          Y            QXT           SQI             N
   AXECHOP                SAI          Y            ISX           SQI             N
   DTDPPM                 SAD          Y            XAUST         SAD             Y
   rngzip                 SQD          Y            Millau        SAD             N

                    Symbols list of XML compressors features
  Symbol    Description                     Symbol       Description
  G         General Text Compressor         S            Specific XML Compressor
  D         Schema dependent Compressor     I            Schema Independent Compressor
  A         Archival XML Compressor         Q            Queriable XML Compressor

   S. Sakr (CSE, UNSW)               BenchmarX’09                           20 April 2009   6 / 20
Testing Corpus (Data Sets)

    Determining the XML files that should be used for evaluating the set
    of XML compression tools is not a simple task.

    The documents of our corpus are classified into four categories:
          Regular Documents: Regular document structure and short data
          contents. They reflect the XML view of relational data. The data ratio
          of these documents is in the range of between 40% and 60%.
          Irregular documents: Very deep, complex and irregular structure.
          More challenging in terms of compression efficiency.
          Textual documents: Simple structure and high ratio of the contents
          is preserved to the data values. The ratio of the data contents of these
          documents represent more than 70% of the document size.
          Structural documents: No data contents at all. 100% of each
          document size is preserved to its structure information. They are used
          to assess the claim of XML conscious compressors on using the well
          known structure of XML documents for achieving higher compression
          ratios on the structural parts of XML documents.
   S. Sakr (CSE, UNSW)              BenchmarX’09                   20 April 2009   7 / 20
Testing Corpus (Data Sets)
 Data Set Name      Document Name                Size (MB)    Tags   Number of Nodes   Depth    Data Ratio
                    Telecomp.xml                       0.65     39            651398       7          0.48
                    Weblog.xml                         2.60     12            178419       3          0.31
 EXI                Invoice.xml                        0.93     52             78377       7          0.57
                    Array.xml                         22.18     47           1168115      10          0.68
                    Factbook.xml                       4.12    199            104117       5          0.53
                    Geographic Coordinates.xml        16.20     17                55       3             1
                    XMark1.xml                        11.40     74            520546      12          0.74
 XMark              XMark2.xml                       113.80     74           5167121      12          0.74
                    XMark3.xml                       571.75     74          25900899      12          0.74
                    DCSD-Small.xml                    10.60     50           6190628       8          0.45
                    DCSD-Normal.xml                  105.60     50           6190628       8          0.45
 XBench
                    TCSD-Small.xml                    10.95     24            831393       8          0.78
                    TCSD-Normal.xml                  106.25     24           8085816       8          0.78
                    EnWikiNews.xml                    71.09     20           2013778       5          0.91
                    EnWikiQuote.xml                  127.25     20           2672870       5          0.97
 Wikipedia          EnWikiSource.xml                1036.66     20          13423014       5          0.98
                    EnWikiVersity.xml                 83.35     20           3333622       5          0.91
                    EnWikTionary.xml                 570.00     20          28656178       5          0.77
 DBLP               DBLP.xml                         130.72     32           4718588       5          0.58
 U.S House          USHouse.xml                        0.52     43             16963      16          0.77
 SwissProt          SwissProt.xml                    112.13     85          13917441       5          0.60
 NASA               NASA.xml                          24.45     61           2278447       8          0.66
 Shakespeare        Shakespeare.xml                    7.47     22            574156       7          0.64
 Lineitem           Lineitem.xml                      31.48     18           2045953       3          0.19
 Mondial            Mondial.xml                        1.75     23            147207       5          0.77
 BaseBall           BaseBall.xml                       0.65     46             57812       6          0.11
 Treebank           Treebank.xml                      84.06    250          10795711      36          0.70
                    Random-R1.xml                     14.20    100           1249997      28             0
 Random             Random-R2.xml                     53.90    200           3750002      34             0
                    Random-R3.xml                     97.85    300           7500017      30             0

       S. Sakr (CSE, UNSW)                       BenchmarX’09                           20 April 2009   8 / 20
Testing Environments

    To ensure the consistency of the performance behaviors of the
    evaluated XML compressors, we ran our experiments on two different
    environments.

    One environment with high computing resources and the other with
    considerably limited computing resources.

              High Resources Setup                Limited Resources Setup
    OS        Ubuntu 7.10 (Linux 2.6.22 Kernel)   Ubuntu 7.10 (Linux 2.6.20 Kernel)
    CPU       Intel Core 2 Duo E6850              Intel Pentium 4
              3.00 GHz, FSB 1333MHz               2.66GHz, FSB 533MHz
              4MB L2 Cache                        512KB L2 Cache
    HD        Seagate ST3250820AS - 250 GB        Western Digital WD400BB - 40 GB
    RAM       4 GB                                512 MB




   S. Sakr (CSE, UNSW)                 BenchmarX’09                      20 April 2009   9 / 20
Performance Criteria

We measure and compare the performance of the XML compression tools
using the following metrics:

    Compression Ratio: represents the ratio between the sizes of
    compressed and uncompressed XML documents.
        Compression Ratio = (Compressed Size) / (Uncompressed Size)

    Compression Time: represents the elapsed time during the
    compression process.

    Decompression Time: represents the elapsed time during the
    decompression process.

For all metrics: the lower the metric value, the better the compressor.



   S. Sakr (CSE, UNSW)         BenchmarX’09              20 April 2009   10 / 20
Experimental Framework

    We evaluated 11 XML compressors: 3 general purpose text
    compressors and 8 XML conscious compressors.
    Our corpus consists of 57 documents: 27 original documents, 27
    structural copies and 3 randomly generated structural documents.
    We run the experiments on two different platforms.
    For each combination of an XML test document and an XML
    compressor, we run two different operations (compression -
    decompression).
    To ensure accuracy, all reported numbers for our time metrics are the
    average of five executions with the highest and the lowest values
    removed.
    We created our own mix of Unix shell and Perl scripts to run and
    collect the results of these huge number of runs.
    The web page of this study provides access to the test files, examined
    XML compressors and the detailed results of this study.
    http://xmlcompbench.sourceforge.net/.
   S. Sakr (CSE, UNSW)         BenchmarX’09                20 April 2009   11 / 20
Compression Ratio




                                       0.00
                                              0.01
                                                     0.02
                                                            0.03
                                                                   0.04
                                                                          0.05
                                                                                 0.06
                                                                                        0.07
                                                                                               0.08
                                                                                                      0.09
                                                                                                             0.10
                           BaseBall
                               DBLP
                         EnWikiNew
                       EnWikiQuote




S. Sakr (CSE, UNSW)
                      EnWikiSource
                      EnWikiVersity
                      EnWikTionary
                          EXI-Array
                       EXI-factbook
                         EXI-Invoice
                                                                                                                    Structural Documents




                      EXI-Telecomp
                         EXI-weblog
                           Lineitem
                            Mondial
                               Nasa
                       Shakespeare
                          SwissProt
                          Treebank




BenchmarX’09
                          USHouse
                      DCSD-Normal
                       DCSD-Small
                      TCSD-Normal
                        TCSD-Small
                            XMark1
                            XMark2
                            XMark3
                        Random-R1
                        Random-R2
                        Random-R3
                                                            Gzip




20 April 2009
                                                            PPM




                                                            Exalt
                                                            Bzip2




                                                            XWRT
                                                                                                                    Experimental Results : Detailed Compression Ratios of




                                                            Axechop
                                                            XMLPPM
                                                            XMillGzip
                                                            XMillPPM


                                                            SCMPPM
                                                            XMillBzip2




12 / 20
Compression Ratio




                                        0.00
                                               0.05
                                                      0.10
                                                              0.15
                                                                     0.20
                                                                            0.25
                                                                                   0.30
                                                                                          0.35
                                                                                                 0.40
                                                                                                        0.45
                                                                                                               0.50
                            BaseBall
                                DBLP
                          EnWikiNew
                        EnWikiQuote




S. Sakr (CSE, UNSW)
                       EnWikiSource
                       EnWikiVersity
                       EnWikTionary
                           EXI-Array
                                                                                                                      Original Documents




                        EXI-factbook
                      EXI-GeogCoord
                          EXI-Invoice
                       EXI-Telecomp
                          EXI-weblog
                            Lineitem
                             Mondial
                                Nasa
                        Shakespeare




BenchmarX’09
                           SwissProt
                           Treebank
                           USHouse
                       DCSD-Normal
                        DCSD-Small
                       TCSD-Normal
                         TCSD-Small
                             XMark1
                             XMark2
                             XMark3
                                                             Gzip
                                                             PPM




                                                             Exalt
                                                             Bzip2




20 April 2009
                                                             XWRT
                                                                                                                      Experimental Results : Detailed Compression Ratios of




                                                             Axechop
                                                             XMLPPM
                                                             XMillGzip
                                                             XMillPPM


                                                             SCMPPM
                                                             XMillBzip2




13 / 20
Experimental Results : Average Compression Ratios

                                                                                                                                                                       0.24
                               0.07


                               0.06                                                                                                                                    0.22




                                                                                                                                           Avergae Compression Ratio
   Average Compression Ratio




                               0.05                                                                                                                                    0.20


                               0.04                                                                                                                                    0.18


                               0.03
                                                                                                                                                                       0.16

                               0.02
                                                                                                                                                                       0.14

                               0.01
                                                                                                                                                                       0.12
                               0.00
                                       XMillPPM


                                                  XMillGzip


                                                              XMillBzip2


                                                                           XMLPPM


                                                                                    Bzip2




                                                                                                    Axechop


                                                                                                              Gzip


                                                                                                                     SCMPPM


                                                                                                                              PPM
                                                                                            Exalt




                                                                                                                                    XWRT




                                                                                                                                                                              SCMPPM


                                                                                                                                                                                       XMLPPM




                                                                                                                                                                                                       XMillBzip2


                                                                                                                                                                                                                    XMillPPM


                                                                                                                                                                                                                               Bzip2


                                                                                                                                                                                                                                       PPM


                                                                                                                                                                                                                                             XMillGzip


                                                                                                                                                                                                                                                         Gzip
                                                                                                                                                                                                XWRT
                                      (a) Structural documents.                                                                                                               (b) Original documents.

            S. Sakr (CSE, UNSW)                                                                                                 BenchmarX’09                                                                                     20 April 2009                  14 / 20
Overall Performance of XML Compressors

  7


  6


  5                                                                       Bzip2
                                                                          Gzip
                                                                          PPM
  4                                                                       XMillBzip2
                                                                          XMillGzip
                                                                          XMillPPM
  3                                                                       XMLPPM
                                                                          XWRT
                                                                          SCMPPM
  2


  1


  0
        Compression Ratio   Compression Time     Decompression Time



   S. Sakr (CSE, UNSW)                BenchmarX’09                    20 April 2009    15 / 20
Proposed Ranking of XML Compressors

    The results of our experiments have not shown a clear winner.

    Different ranking methods and different weights for the factors could
    be used for this task. Deciding the weight of each metric is mainly
    dependant on the scenarios and requirements of the applications
    where these compression tools could be used.

    We used three ranking functions which give different weights for our
    performance metrics:
          WF1 = (1/3 * CR) + (1/3 * CT) + (1/3 * DCT).
          WF2 = (1/2 * CR) + (1/4 * CT) + (1/4 * DCT).
          WF3 = (3/5 * CR) + (1/5 * CT) + (1/5 * DCT).
    where CR represents the compression ratio metric, CT represents the
    compression time metric and DCT represents the decompression time
    metric.


   S. Sakr (CSE, UNSW)         BenchmarX’09               20 April 2009   16 / 20
Proposed Ranking of XML Compressors


  3.0



  2.5
                                                       Bzip2
                                                       Gzip
  2.0                                                  PPM
                                                       XMillBzip2
                                                       XMillGzip
                                                       XMillPPM
  1.5                                                  XMLPPM
                                                       XWRT
                                                       SCMPPM
  1.0



  0.5



  0.0
               WF1       WF2                  WF3



   S. Sakr (CSE, UNSW)         BenchmarX’09         20 April 2009   17 / 20
Conclusions

    The primary innovation in the XML compression mechanisms was
    introduced in XMill by separating the structural part of the XML
    document from the data part and then group the related data items
    into homogenous containers that can be compressed separably. Most
    of the following XML compressors have simulated this idea in
    different ways.
    The dominant practice in most of the XML compressors is to utilize
    the well-known structure of XML documents for applying a
    pre-processing encoding step and then forwarding the results of this
    step to general purpose compressors.
    There are no publicly available solid implementations for
    grammar-based XML compression techniques and queriable XML
    compressors.
    The authors of the XML compressors should provide more attention
    to provide the source code of their implementations available.
   S. Sakr (CSE, UNSW)          BenchmarX’09               20 April 2009   18 / 20
Conclusions

We believe that this paper could be valuable for both the developers of
new XML compression tools and interested users as well.
    For developers, they can use the results of this paper to effectively
    decide on the points which can be improved in order to make an
    effective contribution.
           We recommend tackling the area of developing stable efficient
           queriable XML compressors. Although there has been a lot of literature
           presented in this domain, we are still missing efficient, scalable and
           stable implementations in this domain.
     For users, this study could be helpful for making an effective decision
     to select the suitable compressor for their requirements.
           For users with highest compression ratio requirement, the results of our
           experiments recommend the usage of either the PPM compressor with
           the highest level of compression parameter or the XWRT compressor
           with the highest level of compression parameter.
           For users with fastest compression time and moderate compression ratio
           requirements, gzip and XMillGzip are considered to be the best choice.
    S. Sakr (CSE, UNSW)              BenchmarX’09                  20 April 2009   19 / 20
The End




                        Thank You




  S. Sakr (CSE, UNSW)     BenchmarX’09   20 April 2009   20 / 20

More Related Content

Viewers also liked

GANG Announcements, Sept 2009
GANG Announcements, Sept 2009GANG Announcements, Sept 2009
GANG Announcements, Sept 2009David Giard
 
Comparative study and non comparative study
Comparative study and non comparative studyComparative study and non comparative study
Comparative study and non comparative studyAbdullah al-kharusi
 
Legge 143/1949 Tabelle Tariffa Ingegneri Architetti
Legge 143/1949 Tabelle Tariffa Ingegneri ArchitettiLegge 143/1949 Tabelle Tariffa Ingegneri Architetti
Legge 143/1949 Tabelle Tariffa Ingegneri ArchitettiEugenio Agnello
 
Open Source Presentation
Open Source PresentationOpen Source Presentation
Open Source Presentationguest9a617
 
Workshop Weblog, Wiki & Twitter
Workshop Weblog, Wiki & TwitterWorkshop Weblog, Wiki & Twitter
Workshop Weblog, Wiki & TwitterGSlotboom
 
Μαθαίνοντας από τα λάθη μας - Μέρος 1ο
Μαθαίνοντας από τα λάθη μας  - Μέρος 1οΜαθαίνοντας από τα λάθη μας  - Μέρος 1ο
Μαθαίνοντας από τα λάθη μας - Μέρος 1οLiana Lignou
 
50 Words Powerpoint Brock
50 Words Powerpoint Brock50 Words Powerpoint Brock
50 Words Powerpoint Brockmrrobbo
 
Gang announcements 2011 06
Gang announcements 2011 06Gang announcements 2011 06
Gang announcements 2011 06David Giard
 
Pillars.io wake upstartup
Pillars.io wake upstartupPillars.io wake upstartup
Pillars.io wake upstartupBrian Link
 
City University Food Thinkers
City University Food ThinkersCity University Food Thinkers
City University Food Thinkersjackthur
 
Arc Ready Fy09 Q3 Live Mesh
Arc Ready Fy09 Q3   Live MeshArc Ready Fy09 Q3   Live Mesh
Arc Ready Fy09 Q3 Live MeshDavid Giard
 
Jacl Thurston talk at e-Democracy Conference
Jacl Thurston talk at e-Democracy ConferenceJacl Thurston talk at e-Democracy Conference
Jacl Thurston talk at e-Democracy Conferencejackthur
 
Gang announcements 2011 05
Gang announcements 2011 05Gang announcements 2011 05
Gang announcements 2011 05David Giard
 

Viewers also liked (20)

GANG Announcements, Sept 2009
GANG Announcements, Sept 2009GANG Announcements, Sept 2009
GANG Announcements, Sept 2009
 
Comparative study and non comparative study
Comparative study and non comparative studyComparative study and non comparative study
Comparative study and non comparative study
 
09 Vocab
09 Vocab09 Vocab
09 Vocab
 
Legge 143/1949 Tabelle Tariffa Ingegneri Architetti
Legge 143/1949 Tabelle Tariffa Ingegneri ArchitettiLegge 143/1949 Tabelle Tariffa Ingegneri Architetti
Legge 143/1949 Tabelle Tariffa Ingegneri Architetti
 
GrouperEye
GrouperEyeGrouperEye
GrouperEye
 
BDD
BDDBDD
BDD
 
Open Source Presentation
Open Source PresentationOpen Source Presentation
Open Source Presentation
 
Workshop Weblog, Wiki & Twitter
Workshop Weblog, Wiki & TwitterWorkshop Weblog, Wiki & Twitter
Workshop Weblog, Wiki & Twitter
 
Template-devil
Template-devilTemplate-devil
Template-devil
 
Μαθαίνοντας από τα λάθη μας - Μέρος 1ο
Μαθαίνοντας από τα λάθη μας  - Μέρος 1οΜαθαίνοντας από τα λάθη μας  - Μέρος 1ο
Μαθαίνοντας από τα λάθη μας - Μέρος 1ο
 
50 Words Powerpoint Brock
50 Words Powerpoint Brock50 Words Powerpoint Brock
50 Words Powerpoint Brock
 
Gang announcements 2011 06
Gang announcements 2011 06Gang announcements 2011 06
Gang announcements 2011 06
 
Koongo - De universele product export
Koongo - De universele product exportKoongo - De universele product export
Koongo - De universele product export
 
Pillars.io wake upstartup
Pillars.io wake upstartupPillars.io wake upstartup
Pillars.io wake upstartup
 
Obama's T-shirts
Obama's T-shirtsObama's T-shirts
Obama's T-shirts
 
City University Food Thinkers
City University Food ThinkersCity University Food Thinkers
City University Food Thinkers
 
Arc Ready Fy09 Q3 Live Mesh
Arc Ready Fy09 Q3   Live MeshArc Ready Fy09 Q3   Live Mesh
Arc Ready Fy09 Q3 Live Mesh
 
Forze
ForzeForze
Forze
 
Jacl Thurston talk at e-Democracy Conference
Jacl Thurston talk at e-Democracy ConferenceJacl Thurston talk at e-Democracy Conference
Jacl Thurston talk at e-Democracy Conference
 
Gang announcements 2011 05
Gang announcements 2011 05Gang announcements 2011 05
Gang announcements 2011 05
 

Similar to An Empirical Evaluation of XML Compression Tools: Why? How? Results

Performance analysis in a multitenant cloud environment Using Hadoop Cluster ...
Performance analysis in a multitenant cloud environment Using Hadoop Cluster ...Performance analysis in a multitenant cloud environment Using Hadoop Cluster ...
Performance analysis in a multitenant cloud environment Using Hadoop Cluster ...Orgad Kimchi
 
C* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian
C* Summit 2013: Time is Money Jake Luciani and Carl YeksigianC* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian
C* Summit 2013: Time is Money Jake Luciani and Carl YeksigianDataStax Academy
 
OpenMAX_IL_with_GSstreamer.pdf
OpenMAX_IL_with_GSstreamer.pdfOpenMAX_IL_with_GSstreamer.pdf
OpenMAX_IL_with_GSstreamer.pdfPreowner
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network ProcessingRyousei Takano
 
Joget Workflow Clustering and Performance Testing on Amazon Web Services (AWS)
Joget Workflow Clustering and Performance Testing on Amazon Web Services (AWS)Joget Workflow Clustering and Performance Testing on Amazon Web Services (AWS)
Joget Workflow Clustering and Performance Testing on Amazon Web Services (AWS)Joget Workflow
 
The post release technologies of Crysis 3 (Slides Only) - Stewart Needham
The post release technologies of Crysis 3 (Slides Only) - Stewart NeedhamThe post release technologies of Crysis 3 (Slides Only) - Stewart Needham
The post release technologies of Crysis 3 (Slides Only) - Stewart NeedhamStewart Needham
 
Experiences in ELK with D3.js for Large Log Analysis and Visualization
Experiences in ELK with D3.js  for Large Log Analysis  and VisualizationExperiences in ELK with D3.js  for Large Log Analysis  and Visualization
Experiences in ELK with D3.js for Large Log Analysis and VisualizationSurasak Sanguanpong
 
Amis Query (02-09-2008): Reports From Oracle Open World - Database
Amis Query (02-09-2008): Reports From Oracle Open World - DatabaseAmis Query (02-09-2008): Reports From Oracle Open World - Database
Amis Query (02-09-2008): Reports From Oracle Open World - DatabaseMarco Gralike
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
 
Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext
Tuning Solr for Logs: Presented by Radu Gheorghe, SematextTuning Solr for Logs: Presented by Radu Gheorghe, Sematext
Tuning Solr for Logs: Presented by Radu Gheorghe, SematextLucidworks
 
Environment Canada's Data Management Service
Environment Canada's Data Management ServiceEnvironment Canada's Data Management Service
Environment Canada's Data Management ServiceSafe Software
 
Mazda Use of Third Generation Xml Tools
Mazda Use of Third Generation Xml ToolsMazda Use of Third Generation Xml Tools
Mazda Use of Third Generation Xml ToolsCardinaleWay Mazda
 
Oow2007 performance
Oow2007 performanceOow2007 performance
Oow2007 performanceRicky Zhu
 
Distributed Computing for Everyone
Distributed Computing for EveryoneDistributed Computing for Everyone
Distributed Computing for EveryoneGiovanna Roda
 
Xml processing-by-asfak
Xml processing-by-asfakXml processing-by-asfak
Xml processing-by-asfakAsfak Mahamud
 
Seismic Analysis Code (SAC)
Seismic Analysis Code (SAC)Seismic Analysis Code (SAC)
Seismic Analysis Code (SAC)ssuser8193e7
 

Similar to An Empirical Evaluation of XML Compression Tools: Why? How? Results (20)

Parsing XML in J2ME
Parsing XML in J2MEParsing XML in J2ME
Parsing XML in J2ME
 
Performance analysis in a multitenant cloud environment Using Hadoop Cluster ...
Performance analysis in a multitenant cloud environment Using Hadoop Cluster ...Performance analysis in a multitenant cloud environment Using Hadoop Cluster ...
Performance analysis in a multitenant cloud environment Using Hadoop Cluster ...
 
Glomosim scenarios
Glomosim scenariosGlomosim scenarios
Glomosim scenarios
 
C* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian
C* Summit 2013: Time is Money Jake Luciani and Carl YeksigianC* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian
C* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian
 
OpenMAX_IL_with_GSstreamer.pdf
OpenMAX_IL_with_GSstreamer.pdfOpenMAX_IL_with_GSstreamer.pdf
OpenMAX_IL_with_GSstreamer.pdf
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Joget Workflow Clustering and Performance Testing on Amazon Web Services (AWS)
Joget Workflow Clustering and Performance Testing on Amazon Web Services (AWS)Joget Workflow Clustering and Performance Testing on Amazon Web Services (AWS)
Joget Workflow Clustering and Performance Testing on Amazon Web Services (AWS)
 
The post release technologies of Crysis 3 (Slides Only) - Stewart Needham
The post release technologies of Crysis 3 (Slides Only) - Stewart NeedhamThe post release technologies of Crysis 3 (Slides Only) - Stewart Needham
The post release technologies of Crysis 3 (Slides Only) - Stewart Needham
 
Experiences in ELK with D3.js for Large Log Analysis and Visualization
Experiences in ELK with D3.js  for Large Log Analysis  and VisualizationExperiences in ELK with D3.js  for Large Log Analysis  and Visualization
Experiences in ELK with D3.js for Large Log Analysis and Visualization
 
Amis Query (02-09-2008): Reports From Oracle Open World - Database
Amis Query (02-09-2008): Reports From Oracle Open World - DatabaseAmis Query (02-09-2008): Reports From Oracle Open World - Database
Amis Query (02-09-2008): Reports From Oracle Open World - Database
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext
Tuning Solr for Logs: Presented by Radu Gheorghe, SematextTuning Solr for Logs: Presented by Radu Gheorghe, Sematext
Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext
 
Environment Canada's Data Management Service
Environment Canada's Data Management ServiceEnvironment Canada's Data Management Service
Environment Canada's Data Management Service
 
Parsing XML Data
Parsing XML DataParsing XML Data
Parsing XML Data
 
optimizing_ceph_flash
optimizing_ceph_flashoptimizing_ceph_flash
optimizing_ceph_flash
 
Mazda Use of Third Generation Xml Tools
Mazda Use of Third Generation Xml ToolsMazda Use of Third Generation Xml Tools
Mazda Use of Third Generation Xml Tools
 
Oow2007 performance
Oow2007 performanceOow2007 performance
Oow2007 performance
 
Distributed Computing for Everyone
Distributed Computing for EveryoneDistributed Computing for Everyone
Distributed Computing for Everyone
 
Xml processing-by-asfak
Xml processing-by-asfakXml processing-by-asfak
Xml processing-by-asfak
 
Seismic Analysis Code (SAC)
Seismic Analysis Code (SAC)Seismic Analysis Code (SAC)
Seismic Analysis Code (SAC)
 

More from University of New South Wales (11)

Declarative analysis of noisy information networks
Declarative analysis of noisy information networksDeclarative analysis of noisy information networks
Declarative analysis of noisy information networks
 
InfiniteGraph
InfiniteGraphInfiniteGraph
InfiniteGraph
 
Gremlin
Gremlin Gremlin
Gremlin
 
DHHT - Modeling beyond plain graphs
DHHT - Modeling beyond plain graphsDHHT - Modeling beyond plain graphs
DHHT - Modeling beyond plain graphs
 
Dex
DexDex
Dex
 
Ontological Conjunctive Query Answering over Large Knowledge Bases
Ontological Conjunctive Query Answering over Large Knowledge BasesOntological Conjunctive Query Answering over Large Knowledge Bases
Ontological Conjunctive Query Answering over Large Knowledge Bases
 
Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud
Key-Key-Value Stores for Efficiently Processing Graph Data in the CloudKey-Key-Value Stores for Efficiently Processing Graph Data in the Cloud
Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud
 
Allegograph
AllegographAllegograph
Allegograph
 
Neo4j
Neo4jNeo4j
Neo4j
 
Dependable Cardinality Forecast for XQuery
Dependable Cardinality Forecast for XQueryDependable Cardinality Forecast for XQuery
Dependable Cardinality Forecast for XQuery
 
GraphREL: A Relational Graph Query Processor
GraphREL: A Relational Graph Query ProcessorGraphREL: A Relational Graph Query Processor
GraphREL: A Relational Graph Query Processor
 

Recently uploaded

Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxnelietumpap1
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 

Recently uploaded (20)

YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptx
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 

An Empirical Evaluation of XML Compression Tools: Why? How? Results

  • 1. An Empirical Evaluation of XML Compression Tools Sherif Sakr School of Computer Science and Engineering University of New South Wales 1st International Workshop on Benchmarking of XML and Semantic Web Applications (BenchmarX’09) 20 April 2009 S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 1 / 20
  • 2. XML Compression: Why? XML has become a popular standard with many useful applications. XML is often referred as self-describing data. On one hand, this self-describing feature grants the XML great flexibility. On the other hand, it introduces the main problem of verbosity. XML compression has many advantages such as: Reducing the network bandwidth required for data exchange. Reducing the disk space required for storage. Minimizing the main memory requirements of processing and querying XML documents. S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 2 / 20
  • 3. XML Compressors: Classifications I With respect to the awareness of the structure of the XML documents: General Text Compressors: They are XML-Blind, treats XML documents as usual plain text documents and applies the traditional text compression techniques. XML Conscious Compressors: They are designed to take the advantage of the awareness of the XML document structure to achieve better compression ratios over the general text compressors. Schema dependent compressors: Both of the encoder and decoder must have access to the document schema information achieve the compression process. They are not commonly used in practice. Schema independent compressors: The availability of the schema information is not required to achieve the encoding and decoding processes. S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 3 / 20
  • 4. XML Compressors: Classifications II With respect to the ability of supporting queries: Non-Queriable (Archival) XML Compressors: They do not allow any queries to be processed over the compressed format. They are mainly focusing to achieve the highest compression ratio. By default, general purpose text compressors belong to the non-queriable group of compressors. Queriable XML Compressors: They allow queries to be processed over their compressed formats. The compression ratio is usually worse than that of the archival XML compressors. The main focus is to avoid full document decompression during query execution. The ability to perform direct queries on compressed XML formats is important for many applications which are hosted on resource-limited computing devices such as: mobile devices and GPS systems . By default, all queriable compressors are XML conscious compressors as well. S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 4 / 20
  • 5. Examination Criteria In our study we considered, to the best of our knowledge, all XML compressors which are fulfilling the following conditions: Is publicly and freely available either in the form of open source codes or binary versions. Is schema-independent. Be able to run under our Linux version of operating system. Ubuntu 7.10 (Linux 2.6.20 Kernel) Ubuntu 7.10 (Linux 2.6.22 Kernel) S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 5 / 20
  • 6. Examined Compressors XML Compressors List Compressor Features Code Compressor Features Code Available Available GZIP (1.3.12) GAI Y XGrind SQI Y BZIP2 (1.0.4) GAI Y XBzip SQI N PPM (j.1) GAI Y XQueC SQI N XMill (0.7) SAI Y XCQ SQI N XMLPPM (0.98.3) SAI Y XPress SQI N SCMPPM (0.93.3) SAI Y XQzip SQI N XWRT (3.2) SAI Y XSeq SQI N Exalt (0.1.0) SAI Y QXT SQI N AXECHOP SAI Y ISX SQI N DTDPPM SAD Y XAUST SAD Y rngzip SQD Y Millau SAD N Symbols list of XML compressors features Symbol Description Symbol Description G General Text Compressor S Specific XML Compressor D Schema dependent Compressor I Schema Independent Compressor A Archival XML Compressor Q Queriable XML Compressor S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 6 / 20
  • 7. Testing Corpus (Data Sets) Determining the XML files that should be used for evaluating the set of XML compression tools is not a simple task. The documents of our corpus are classified into four categories: Regular Documents: Regular document structure and short data contents. They reflect the XML view of relational data. The data ratio of these documents is in the range of between 40% and 60%. Irregular documents: Very deep, complex and irregular structure. More challenging in terms of compression efficiency. Textual documents: Simple structure and high ratio of the contents is preserved to the data values. The ratio of the data contents of these documents represent more than 70% of the document size. Structural documents: No data contents at all. 100% of each document size is preserved to its structure information. They are used to assess the claim of XML conscious compressors on using the well known structure of XML documents for achieving higher compression ratios on the structural parts of XML documents. S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 7 / 20
  • 8. Testing Corpus (Data Sets) Data Set Name Document Name Size (MB) Tags Number of Nodes Depth Data Ratio Telecomp.xml 0.65 39 651398 7 0.48 Weblog.xml 2.60 12 178419 3 0.31 EXI Invoice.xml 0.93 52 78377 7 0.57 Array.xml 22.18 47 1168115 10 0.68 Factbook.xml 4.12 199 104117 5 0.53 Geographic Coordinates.xml 16.20 17 55 3 1 XMark1.xml 11.40 74 520546 12 0.74 XMark XMark2.xml 113.80 74 5167121 12 0.74 XMark3.xml 571.75 74 25900899 12 0.74 DCSD-Small.xml 10.60 50 6190628 8 0.45 DCSD-Normal.xml 105.60 50 6190628 8 0.45 XBench TCSD-Small.xml 10.95 24 831393 8 0.78 TCSD-Normal.xml 106.25 24 8085816 8 0.78 EnWikiNews.xml 71.09 20 2013778 5 0.91 EnWikiQuote.xml 127.25 20 2672870 5 0.97 Wikipedia EnWikiSource.xml 1036.66 20 13423014 5 0.98 EnWikiVersity.xml 83.35 20 3333622 5 0.91 EnWikTionary.xml 570.00 20 28656178 5 0.77 DBLP DBLP.xml 130.72 32 4718588 5 0.58 U.S House USHouse.xml 0.52 43 16963 16 0.77 SwissProt SwissProt.xml 112.13 85 13917441 5 0.60 NASA NASA.xml 24.45 61 2278447 8 0.66 Shakespeare Shakespeare.xml 7.47 22 574156 7 0.64 Lineitem Lineitem.xml 31.48 18 2045953 3 0.19 Mondial Mondial.xml 1.75 23 147207 5 0.77 BaseBall BaseBall.xml 0.65 46 57812 6 0.11 Treebank Treebank.xml 84.06 250 10795711 36 0.70 Random-R1.xml 14.20 100 1249997 28 0 Random Random-R2.xml 53.90 200 3750002 34 0 Random-R3.xml 97.85 300 7500017 30 0 S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 8 / 20
  • 9. Testing Environments To ensure the consistency of the performance behaviors of the evaluated XML compressors, we ran our experiments on two different environments. One environment with high computing resources and the other with considerably limited computing resources. High Resources Setup Limited Resources Setup OS Ubuntu 7.10 (Linux 2.6.22 Kernel) Ubuntu 7.10 (Linux 2.6.20 Kernel) CPU Intel Core 2 Duo E6850 Intel Pentium 4 3.00 GHz, FSB 1333MHz 2.66GHz, FSB 533MHz 4MB L2 Cache 512KB L2 Cache HD Seagate ST3250820AS - 250 GB Western Digital WD400BB - 40 GB RAM 4 GB 512 MB S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 9 / 20
  • 10. Performance Criteria We measure and compare the performance of the XML compression tools using the following metrics: Compression Ratio: represents the ratio between the sizes of compressed and uncompressed XML documents. Compression Ratio = (Compressed Size) / (Uncompressed Size) Compression Time: represents the elapsed time during the compression process. Decompression Time: represents the elapsed time during the decompression process. For all metrics: the lower the metric value, the better the compressor. S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 10 / 20
  • 11. Experimental Framework We evaluated 11 XML compressors: 3 general purpose text compressors and 8 XML conscious compressors. Our corpus consists of 57 documents: 27 original documents, 27 structural copies and 3 randomly generated structural documents. We run the experiments on two different platforms. For each combination of an XML test document and an XML compressor, we run two different operations (compression - decompression). To ensure accuracy, all reported numbers for our time metrics are the average of five executions with the highest and the lowest values removed. We created our own mix of Unix shell and Perl scripts to run and collect the results of these huge number of runs. The web page of this study provides access to the test files, examined XML compressors and the detailed results of this study. http://xmlcompbench.sourceforge.net/. S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 11 / 20
  • 12. Compression Ratio 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 BaseBall DBLP EnWikiNew EnWikiQuote S. Sakr (CSE, UNSW) EnWikiSource EnWikiVersity EnWikTionary EXI-Array EXI-factbook EXI-Invoice Structural Documents EXI-Telecomp EXI-weblog Lineitem Mondial Nasa Shakespeare SwissProt Treebank BenchmarX’09 USHouse DCSD-Normal DCSD-Small TCSD-Normal TCSD-Small XMark1 XMark2 XMark3 Random-R1 Random-R2 Random-R3 Gzip 20 April 2009 PPM Exalt Bzip2 XWRT Experimental Results : Detailed Compression Ratios of Axechop XMLPPM XMillGzip XMillPPM SCMPPM XMillBzip2 12 / 20
  • 13. Compression Ratio 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 BaseBall DBLP EnWikiNew EnWikiQuote S. Sakr (CSE, UNSW) EnWikiSource EnWikiVersity EnWikTionary EXI-Array Original Documents EXI-factbook EXI-GeogCoord EXI-Invoice EXI-Telecomp EXI-weblog Lineitem Mondial Nasa Shakespeare BenchmarX’09 SwissProt Treebank USHouse DCSD-Normal DCSD-Small TCSD-Normal TCSD-Small XMark1 XMark2 XMark3 Gzip PPM Exalt Bzip2 20 April 2009 XWRT Experimental Results : Detailed Compression Ratios of Axechop XMLPPM XMillGzip XMillPPM SCMPPM XMillBzip2 13 / 20
  • 14. Experimental Results : Average Compression Ratios 0.24 0.07 0.06 0.22 Avergae Compression Ratio Average Compression Ratio 0.05 0.20 0.04 0.18 0.03 0.16 0.02 0.14 0.01 0.12 0.00 XMillPPM XMillGzip XMillBzip2 XMLPPM Bzip2 Axechop Gzip SCMPPM PPM Exalt XWRT SCMPPM XMLPPM XMillBzip2 XMillPPM Bzip2 PPM XMillGzip Gzip XWRT (a) Structural documents. (b) Original documents. S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 14 / 20
  • 15. Overall Performance of XML Compressors 7 6 5 Bzip2 Gzip PPM 4 XMillBzip2 XMillGzip XMillPPM 3 XMLPPM XWRT SCMPPM 2 1 0 Compression Ratio Compression Time Decompression Time S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 15 / 20
  • 16. Proposed Ranking of XML Compressors The results of our experiments have not shown a clear winner. Different ranking methods and different weights for the factors could be used for this task. Deciding the weight of each metric is mainly dependant on the scenarios and requirements of the applications where these compression tools could be used. We used three ranking functions which give different weights for our performance metrics: WF1 = (1/3 * CR) + (1/3 * CT) + (1/3 * DCT). WF2 = (1/2 * CR) + (1/4 * CT) + (1/4 * DCT). WF3 = (3/5 * CR) + (1/5 * CT) + (1/5 * DCT). where CR represents the compression ratio metric, CT represents the compression time metric and DCT represents the decompression time metric. S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 16 / 20
  • 17. Proposed Ranking of XML Compressors 3.0 2.5 Bzip2 Gzip 2.0 PPM XMillBzip2 XMillGzip XMillPPM 1.5 XMLPPM XWRT SCMPPM 1.0 0.5 0.0 WF1 WF2 WF3 S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 17 / 20
  • 18. Conclusions The primary innovation in the XML compression mechanisms was introduced in XMill by separating the structural part of the XML document from the data part and then group the related data items into homogenous containers that can be compressed separably. Most of the following XML compressors have simulated this idea in different ways. The dominant practice in most of the XML compressors is to utilize the well-known structure of XML documents for applying a pre-processing encoding step and then forwarding the results of this step to general purpose compressors. There are no publicly available solid implementations for grammar-based XML compression techniques and queriable XML compressors. The authors of the XML compressors should provide more attention to provide the source code of their implementations available. S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 18 / 20
  • 19. Conclusions We believe that this paper could be valuable for both the developers of new XML compression tools and interested users as well. For developers, they can use the results of this paper to effectively decide on the points which can be improved in order to make an effective contribution. We recommend tackling the area of developing stable efficient queriable XML compressors. Although there has been a lot of literature presented in this domain, we are still missing efficient, scalable and stable implementations in this domain. For users, this study could be helpful for making an effective decision to select the suitable compressor for their requirements. For users with highest compression ratio requirement, the results of our experiments recommend the usage of either the PPM compressor with the highest level of compression parameter or the XWRT compressor with the highest level of compression parameter. For users with fastest compression time and moderate compression ratio requirements, gzip and XMillGzip are considered to be the best choice. S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 19 / 20
  • 20. The End Thank You S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 20 / 20