SlideShare a Scribd company logo
1 of 21
Download to read offline
Data model for analysis of scholarly documents in the
               MapReduce paradigm

 Adam Kawa        Lukasz Bolikowski Artur Czeczko        Piotr Jan Dendek
                            Dominika Tkaczyk

                   Centre for Open Science (CeON), ICM UW


                          Warsaw, July 6, 2012




  (CeON ICM UW)           Apache Hadoop in CeON ICM UW      Warsaw, July 6 2012   1 / 19
Agenda




 1   Problem definition
 2   Requirements specification
 3   Exemplary solutions based on Apache Hadoop Ecosystem tools




      (CeON ICM UW)         Apache Hadoop in CeON ICM UW   Warsaw, July 6 2012   2 / 19
The data that we are in possession



Vast collections of scholarly documents to store
     10 million of full texts
     (PDF, plain text)

     17 million of document metadata records
     (described in XML-based BWMeta format)

     4TB of data
     (10TB including data archives)




      (CeON ICM UW)             Apache Hadoop in CeON ICM UW   Warsaw, July 6 2012   3 / 19
The tasks that we are doing


Big knowledge to extract and discover
    17 million of document metadata records (XML)
          contain title, subtitles, abstract, keywords, references, contributors and
          their affiliations, publishing magazine, . . .
    input for many state-of-the-art machine learning algorithms
          relatively simple ones: searching documents with given title, finding
          scientific teams, . . .
          quite complex ones: author name disambiguation, bibliometrics,
          classification code assignment, . . .




      (CeON ICM UW)           Apache Hadoop in CeON ICM UW      Warsaw, July 6 2012   4 / 19
The requirements that we have specified


Multiple demands regarding storage and processing of large amounts of data:

     scalability and parallelism — easily handle tens of terabytes of data and
     parallelize the computation effectively
     flexible data model — possibility to add or update data, and enhance it
     content by implicit information discovered by our algorithms
     latency requirements — support batch offline processing as well as
     random, realtime read/write requests
     availability of many clients — accessible by programmers and researchers
     with diverse language preferences and expertise
     reliablility and cost-effectiveness — ideally an open-source software which
     does not require expensive hardware




      (CeON ICM UW)          Apache Hadoop in CeON ICM UW    Warsaw, July 6 2012   5 / 19
Document-related data as linked data
Information about document-related resources can be simply described a directed
labeled graph

     entities (e.g. documents, contributors, references) are nodes in the graph
     relationships between entities are directed labeled edges in the graph




      (CeON ICM UW)           Apache Hadoop in CeON ICM UW     Warsaw, July 6 2012   6 / 19
Linked graph as a collection of RDF triples
A directed labeled graph can be simply represented a collection of RDF triples

     a triple consists of subject, predicate and object
     a triple represents a statement which denotes that a resource (subject) holds
     a value (object) for some attribute (predicate) of that resource
     a triple can represent any statements about any resource




      (CeON ICM UW)           Apache Hadoop in CeON ICM UW      Warsaw, July 6 2012   7 / 19
Hadoop as a solution for scalability/performance issues
Apache Hadoop is most commonly used open-source solution for storing and
processing big data in reliable, high-performance and cost-effective way.

    Scalable storage
    Parallel processing
    Subprojects and many Hadoop-related projects
          HDFS — distributed file system that provides high-throughput access
          to large data
          MapReduce — framework for distributed processing of large data sets
          (Java and e.g. JavaScript, Python, Perl, Ruby via Streaming)
          HBase — scalable, distributed data store with flexible schema, random
          read/write access and fast scans
          Pig/Hive — higher-level abstractions on top of MapReduce (simple
          data manipulation languages)




      (CeON ICM UW)         Apache Hadoop in CeON ICM UW   Warsaw, July 6 2012   8 / 19
Apache Hadoop Ecosystem tools as RDF triple stores

   SHARD [3] — a Hadoop backed RDF triple store
        stores triples in flat files in HDFS
        data cannot be modified randomly
        less efficient for queries that requires the inspection of only a small
        number of triples
   PigSPARQL [6] — translates SPARQL queries to Pig Latin programs and
   runs them on Hadoop cluster
        stores RDF triples with the same predicate in separate, flat files in
        HDFS
   H2RDF [5] — a RDF store that combines MapReduce with HBase
        stores triples in HBase using three flat-wide tables
   Jena-HBase [4] — a HBase backed RDF triple store
        provides six different pluggable HBase storage layouts


    (CeON ICM UW)          Apache Hadoop in CeON ICM UW       Warsaw, July 6 2012   9 / 19
HBase as storage layer for RDF triples

Storing RDF triples in Apache HBase has several advantages

     flexible data model — columns can be dynamically added and removed;
     multiple versions of data in a particular cell; data serialized to a byte array
     random read and write — more suitable for semi-structured RDF data
     than HDFS where files cannot be modified randomly and usually whole file
     must be read sequentially to find subset of records
     availability of many clients
          interactive clients — native Java API, REST or Apache Thrift
          batch clients — MapReduce (Java), Pig (PigLatin) and Hive
          (HiveQL)
     automatically sorted records — quick lookups and partial scans; joins as
     fast (linear) merge-joins



      (CeON ICM UW)            Apache Hadoop in CeON ICM UW     Warsaw, July 6 2012   10 / 19
Exemplary HBase schema — Flat-wide layout

   Advantages
        no prior knowledge about data is required
        colocation of all information about a resource within a single row
        support of multi-valued properties
        support of reified statements (statements about statements)
   Disadvantages
        unlimited number of columns
        increase of storage space




    (CeON ICM UW)          Apache Hadoop in CeON ICM UW    Warsaw, July 6 2012   11 / 19
Exemplary HBase schema - Vertically Partitioned layout [1]

    Advantages
         support of multi-valued properties
         support of reified statements (statements about statements)
         storage space savings when compared to the previous layout
         first-step (predicate-bound) pairwise joins as fast merge-joins
    Disadvantages
         increased number of joins




     (CeON ICM UW)          Apache Hadoop in CeON ICM UW    Warsaw, July 6 2012   12 / 19
Exemplary HBase schema - Hexastore layout [2]
   Advantages
       support of multi-valued properties
       support of reified statements (statements about statements)
       first-step pairwise joins as fast merge-joins
   Disadvantages
       increased of number of joins
       increase of storage space
       complication of an update operation




    (CeON ICM UW)         Apache Hadoop in CeON ICM UW   Warsaw, July 6 2012   13 / 19
HBase schema - other layout




Some derivative and hybrid layouts exist to combine the advantages of original
layouts
     a combination of the vertically partitioned and the hexastore layout [4]
     a combination of the flat-wide and the vertically partitioned layouts [4]




      (CeON ICM UW)          Apache Hadoop in CeON ICM UW    Warsaw, July 6 2012   14 / 19
Challenges




   a large number of join operations
        relatively expensive
        and practically cannot be avoided (at least for more complex queries)
        but specialized join techniques can be used e.g. multi join, merge-sort
        join, replicated join, skewed join
   lack of a native support for cross-row atomicity (e.g. in the form of
   transactions)




    (CeON ICM UW)          Apache Hadoop in CeON ICM UW    Warsaw, July 6 2012   15 / 19
Possible performance optimization techniques



    property tables — properties often queried together are stored in the same
    record for a quick access [8, 9]
    materialized path expressions — precalculation and materialization of
    the most commonly used paths through an RDF graph in advance [1, 2]
    graph-oriented partitioning scheme [7]
         take advantage of the spatial locality inherent in graph pattern
         matching
         higher replication of data that is on the border of any particular
         partition (however, problematic for a graph that is modified)




     (CeON ICM UW)          Apache Hadoop in CeON ICM UW     Warsaw, July 6 2012   16 / 19
The ways of processing data from HBase

Many various tools are integrated with HBase and can read data from and write
data to HBase tables
     Java MapReduce
          possibility to use our legacy Java code in map and reduce methods
          delivers better perfromance than Apache Pig
     Apache Pig
          provides common data operations (e.g. filters, unions, joins, ordering)
          and nested types (e.g. tuples, bags, maps)
          supports multiple specialized joins implementation
          possibility to run MapReduce jobs directly from PigLatin scripts
          can be embeded in Python code
     Interactive clients (e.g. Java API, REST or Apache Thrift)
          interactive access to relatively small subset of our data by sending API
          calls on demand e.g. a web-based client


      (CeON ICM UW)          Apache Hadoop in CeON ICM UW     Warsaw, July 6 2012   17 / 19
Case study: author name disambiguation algorithm




The most complex algorithm that we have run over Apache HBase so far is
author name disambiguation algorithm.




      (CeON ICM UW)         Apache Hadoop in CeON ICM UW   Warsaw, July 6 2012   18 / 19
Thanks!




                      More information about CeON:
                      http://ceon.pl/en/research




    c 2012 Adam Kawa. Ten dokument jest dostepny na licencji Creative Commons Uznanie autorstwa 3.0 Polska
              Tre´´ licencji dostepna pod adresem: http://creativecommons.org/licenses/by/3.0/pl/
                 sc




     (CeON ICM UW)                    Apache Hadoop in CeON ICM UW                    Warsaw, July 6 2012    19 / 19
D. J. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach. Scalable
Semantic Web Data Management using vertical partitioning. In VLDB, pages
411–422, 2007.

C. Weiss, P. Karras, and A. Bernstein. Hexastore: Sextuple Indexing for
Semantic Web Data Management. In VLDB, pages 1008-1019, 2008.

K. Rohloff and R. Schantz. High-performance, massively scalable distributed
systems using the mapreduce software framework: The shard triple-store.
International Workshop on Programming Support Innovations for Emerging
Distributed Applications, 2010.

V. Khadilkar, M. Kantarcioglu, P. Castagna, and B. Thuraisingham.
Jena-HBase: A Distributed, Scalable and Efffient RDF Triple Store. Technical
report, 2012. http://www.utdallas.edu/ vvk072000/Research/Jena-HBase-
Ext/tech-report.pdf

N. Papailiou, I. Konstantinou, D. Tsoumakos and N. Koziris. H2RDF:
Adaptive Query Processing on RDF Data in the Cloud. In Proceedings of the
21th International Conference on World Wide Web (WWW demo track),
Lyon, France, 2012.
  (CeON ICM UW)         Apache Hadoop in CeON ICM UW    Warsaw, July 6 2012   19 / 19
A. Sch¨tzle, M. Przyjaciel-Zablocki and G. Lausen: PigSPARQL: Mapping
       a
SPARQL to Pig Latin. 3th International Workshop on Semantic Web
Information Management (SWIM 2011), in conjunction with the 2011 ACM
International Conference on Management of Data (SIGMOD 2011). Athens
(Greece).

J. Huang, D. Abadi and K. Ren. Scalable SPARQL Querying of Large RDF
Graphs. VLDB Endowment, Volume 4 (VLDB 2011).

K. Wilkinson, C. Sayers, H. Kuno, and D. Reynolds. Efficient RDF Storage
and Retrieval in Jena2. In SWDB, pages 131–150.

K. Wilkinson. Jena property table implementation. In SSWS, 2006.




  (CeON ICM UW)        Apache Hadoop in CeON ICM UW   Warsaw, July 6 2012   19 / 19

More Related Content

What's hot

Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 

What's hot (20)

Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
HDFS
HDFSHDFS
HDFS
 
Architecture of Hadoop
Architecture of HadoopArchitecture of Hadoop
Architecture of Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
A Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationA Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animation
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 

Viewers also liked

Systemy rekomendacji
Systemy rekomendacjiSystemy rekomendacji
Systemy rekomendacji
Adam Kawa
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
Adam Kawa
 
Big Data At Spotify
Big Data At SpotifyBig Data At Spotify
Big Data At Spotify
Adam Kawa
 

Viewers also liked (12)

Apache Hadoop Ecosystem (based on an exemplary data-driven…
Apache Hadoop Ecosystem (based on an exemplary data-driven…Apache Hadoop Ecosystem (based on an exemplary data-driven…
Apache Hadoop Ecosystem (based on an exemplary data-driven…
 
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUG
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARN
 
Systemy rekomendacji
Systemy rekomendacjiSystemy rekomendacji
Systemy rekomendacji
 
Toward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFSToward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFS
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Big Data At Spotify
Big Data At SpotifyBig Data At Spotify
Big Data At Spotify
 
Apache Hadoop Java API
Apache Hadoop Java APIApache Hadoop Java API
Apache Hadoop Java API
 

Similar to Data model for analysis of scholarly documents in the MapReduce paradigm

Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
BOSC 2010
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
S S
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
elliando dias
 

Similar to Data model for analysis of scholarly documents in the MapReduce paradigm (20)

Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Heart Proposal
Heart ProposalHeart Proposal
Heart Proposal
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Hive and querying data
Hive and querying dataHive and querying data
Hive and querying data
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
 
Lecture-20.pptx
Lecture-20.pptxLecture-20.pptx
Lecture-20.pptx
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
In15orlesss hadoop
In15orlesss hadoopIn15orlesss hadoop
In15orlesss hadoop
 
Ess1000 glossary
Ess1000 glossaryEss1000 glossary
Ess1000 glossary
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 

Data model for analysis of scholarly documents in the MapReduce paradigm

  • 1. Data model for analysis of scholarly documents in the MapReduce paradigm Adam Kawa Lukasz Bolikowski Artur Czeczko Piotr Jan Dendek Dominika Tkaczyk Centre for Open Science (CeON), ICM UW Warsaw, July 6, 2012 (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 1 / 19
  • 2. Agenda 1 Problem definition 2 Requirements specification 3 Exemplary solutions based on Apache Hadoop Ecosystem tools (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 2 / 19
  • 3. The data that we are in possession Vast collections of scholarly documents to store 10 million of full texts (PDF, plain text) 17 million of document metadata records (described in XML-based BWMeta format) 4TB of data (10TB including data archives) (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 3 / 19
  • 4. The tasks that we are doing Big knowledge to extract and discover 17 million of document metadata records (XML) contain title, subtitles, abstract, keywords, references, contributors and their affiliations, publishing magazine, . . . input for many state-of-the-art machine learning algorithms relatively simple ones: searching documents with given title, finding scientific teams, . . . quite complex ones: author name disambiguation, bibliometrics, classification code assignment, . . . (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 4 / 19
  • 5. The requirements that we have specified Multiple demands regarding storage and processing of large amounts of data: scalability and parallelism — easily handle tens of terabytes of data and parallelize the computation effectively flexible data model — possibility to add or update data, and enhance it content by implicit information discovered by our algorithms latency requirements — support batch offline processing as well as random, realtime read/write requests availability of many clients — accessible by programmers and researchers with diverse language preferences and expertise reliablility and cost-effectiveness — ideally an open-source software which does not require expensive hardware (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 5 / 19
  • 6. Document-related data as linked data Information about document-related resources can be simply described a directed labeled graph entities (e.g. documents, contributors, references) are nodes in the graph relationships between entities are directed labeled edges in the graph (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 6 / 19
  • 7. Linked graph as a collection of RDF triples A directed labeled graph can be simply represented a collection of RDF triples a triple consists of subject, predicate and object a triple represents a statement which denotes that a resource (subject) holds a value (object) for some attribute (predicate) of that resource a triple can represent any statements about any resource (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 7 / 19
  • 8. Hadoop as a solution for scalability/performance issues Apache Hadoop is most commonly used open-source solution for storing and processing big data in reliable, high-performance and cost-effective way. Scalable storage Parallel processing Subprojects and many Hadoop-related projects HDFS — distributed file system that provides high-throughput access to large data MapReduce — framework for distributed processing of large data sets (Java and e.g. JavaScript, Python, Perl, Ruby via Streaming) HBase — scalable, distributed data store with flexible schema, random read/write access and fast scans Pig/Hive — higher-level abstractions on top of MapReduce (simple data manipulation languages) (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 8 / 19
  • 9. Apache Hadoop Ecosystem tools as RDF triple stores SHARD [3] — a Hadoop backed RDF triple store stores triples in flat files in HDFS data cannot be modified randomly less efficient for queries that requires the inspection of only a small number of triples PigSPARQL [6] — translates SPARQL queries to Pig Latin programs and runs them on Hadoop cluster stores RDF triples with the same predicate in separate, flat files in HDFS H2RDF [5] — a RDF store that combines MapReduce with HBase stores triples in HBase using three flat-wide tables Jena-HBase [4] — a HBase backed RDF triple store provides six different pluggable HBase storage layouts (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 9 / 19
  • 10. HBase as storage layer for RDF triples Storing RDF triples in Apache HBase has several advantages flexible data model — columns can be dynamically added and removed; multiple versions of data in a particular cell; data serialized to a byte array random read and write — more suitable for semi-structured RDF data than HDFS where files cannot be modified randomly and usually whole file must be read sequentially to find subset of records availability of many clients interactive clients — native Java API, REST or Apache Thrift batch clients — MapReduce (Java), Pig (PigLatin) and Hive (HiveQL) automatically sorted records — quick lookups and partial scans; joins as fast (linear) merge-joins (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 10 / 19
  • 11. Exemplary HBase schema — Flat-wide layout Advantages no prior knowledge about data is required colocation of all information about a resource within a single row support of multi-valued properties support of reified statements (statements about statements) Disadvantages unlimited number of columns increase of storage space (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 11 / 19
  • 12. Exemplary HBase schema - Vertically Partitioned layout [1] Advantages support of multi-valued properties support of reified statements (statements about statements) storage space savings when compared to the previous layout first-step (predicate-bound) pairwise joins as fast merge-joins Disadvantages increased number of joins (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 12 / 19
  • 13. Exemplary HBase schema - Hexastore layout [2] Advantages support of multi-valued properties support of reified statements (statements about statements) first-step pairwise joins as fast merge-joins Disadvantages increased of number of joins increase of storage space complication of an update operation (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 13 / 19
  • 14. HBase schema - other layout Some derivative and hybrid layouts exist to combine the advantages of original layouts a combination of the vertically partitioned and the hexastore layout [4] a combination of the flat-wide and the vertically partitioned layouts [4] (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 14 / 19
  • 15. Challenges a large number of join operations relatively expensive and practically cannot be avoided (at least for more complex queries) but specialized join techniques can be used e.g. multi join, merge-sort join, replicated join, skewed join lack of a native support for cross-row atomicity (e.g. in the form of transactions) (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 15 / 19
  • 16. Possible performance optimization techniques property tables — properties often queried together are stored in the same record for a quick access [8, 9] materialized path expressions — precalculation and materialization of the most commonly used paths through an RDF graph in advance [1, 2] graph-oriented partitioning scheme [7] take advantage of the spatial locality inherent in graph pattern matching higher replication of data that is on the border of any particular partition (however, problematic for a graph that is modified) (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 16 / 19
  • 17. The ways of processing data from HBase Many various tools are integrated with HBase and can read data from and write data to HBase tables Java MapReduce possibility to use our legacy Java code in map and reduce methods delivers better perfromance than Apache Pig Apache Pig provides common data operations (e.g. filters, unions, joins, ordering) and nested types (e.g. tuples, bags, maps) supports multiple specialized joins implementation possibility to run MapReduce jobs directly from PigLatin scripts can be embeded in Python code Interactive clients (e.g. Java API, REST or Apache Thrift) interactive access to relatively small subset of our data by sending API calls on demand e.g. a web-based client (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 17 / 19
  • 18. Case study: author name disambiguation algorithm The most complex algorithm that we have run over Apache HBase so far is author name disambiguation algorithm. (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 18 / 19
  • 19. Thanks! More information about CeON: http://ceon.pl/en/research c 2012 Adam Kawa. Ten dokument jest dostepny na licencji Creative Commons Uznanie autorstwa 3.0 Polska Tre´´ licencji dostepna pod adresem: http://creativecommons.org/licenses/by/3.0/pl/ sc (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 19 / 19
  • 20. D. J. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach. Scalable Semantic Web Data Management using vertical partitioning. In VLDB, pages 411–422, 2007. C. Weiss, P. Karras, and A. Bernstein. Hexastore: Sextuple Indexing for Semantic Web Data Management. In VLDB, pages 1008-1019, 2008. K. Rohloff and R. Schantz. High-performance, massively scalable distributed systems using the mapreduce software framework: The shard triple-store. International Workshop on Programming Support Innovations for Emerging Distributed Applications, 2010. V. Khadilkar, M. Kantarcioglu, P. Castagna, and B. Thuraisingham. Jena-HBase: A Distributed, Scalable and Efffient RDF Triple Store. Technical report, 2012. http://www.utdallas.edu/ vvk072000/Research/Jena-HBase- Ext/tech-report.pdf N. Papailiou, I. Konstantinou, D. Tsoumakos and N. Koziris. H2RDF: Adaptive Query Processing on RDF Data in the Cloud. In Proceedings of the 21th International Conference on World Wide Web (WWW demo track), Lyon, France, 2012. (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 19 / 19
  • 21. A. Sch¨tzle, M. Przyjaciel-Zablocki and G. Lausen: PigSPARQL: Mapping a SPARQL to Pig Latin. 3th International Workshop on Semantic Web Information Management (SWIM 2011), in conjunction with the 2011 ACM International Conference on Management of Data (SIGMOD 2011). Athens (Greece). J. Huang, D. Abadi and K. Ren. Scalable SPARQL Querying of Large RDF Graphs. VLDB Endowment, Volume 4 (VLDB 2011). K. Wilkinson, C. Sayers, H. Kuno, and D. Reynolds. Efficient RDF Storage and Retrieval in Jena2. In SWDB, pages 131–150. K. Wilkinson. Jena property table implementation. In SSWS, 2006. (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 19 / 19