SlideShare a Scribd company logo
1 of 41
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas
and Friends
RDF and the Hadoop Ecosystem
Rob Vesse
Twitter: @RobVesse
Email: rvesse@gmail.com
C O M P U T E | S T O R E | A N A L Y Z E
About Me
● Software Engineer at Cray Inc
● Working on:
● RDF and SPARQL
● Big Data Analytics
● Active open source contributor
● Apache Jena
● dotNetRDF
● Minor contributions to other Apache projects
● Assorted other bits and pieces on my GitHub and BitBucket
● Primarily interested in intersection of RDF/SPARQL world
with rest of Big Data world
C O M P U T E | S T O R E | A N A L Y Z E
Talk Overview
● What's missing in the Hadoop ecosystem?
● What's already available?
● Apache Jena Elephas
● Intel Graph Builder
● Other interesting projects
● Getting Involved
● Questions
C O M P U T E | S T O R E | A N A L Y Z E
What's missing in the Hadoop
ecosystem?
Apache, the projects and their logo shown here are registered trademarks or trademarks of The Apache Software Foundation in the U.S. and/or other countries
C O M P U T E | S T O R E | A N A L Y Z E
Where's RDF?
● No first class projects
● Some very limited support in other projects
● Giraph can support RDF by bridging through the Tinkerpop 2
stack
● Few existing projects
● Mostly academic proofs of concept (POC)
● Some open source efforts but often task specific
● e.g. Infovore targeted at creating curated Freebase and DBPedia
datasets
C O M P U T E | S T O R E | A N A L Y Z E
What's needed for RDF?
● Minimum Viable Product
● Standard Writable implementations for primitives
● Input and Output support
● Would be nice to have:
● Tools for translating data to and from RDF
● Integration with the common analytic frameworks
● e.g. Spark, Giraph, Hive, Pig
C O M P U T E | S T O R E | A N A L Y Z E
What's already available?
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Background
● Started as a POC at Cray
● Donated to the Apache Jena
project 1st April 2014
● JENA-666
● Originally known as Hadoop
RDF Tools
● Renamed to Elephas in
December 2014
● Name was suggested by
Claude Warren
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - What is it?
● Set of modules part of the Apache Jena project
● Currently only developer SNAPSHOT builds available
● Will be included as part of upcoming Jena 2.13.0 release
● Aims to fulfill all the basic requirements for enabling RDF
on Hadoop
● Built against Hadoop 2.x APIs
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - How do I use it?
● Read the documentation
● http://jena.apache.org/documentation/hadoop/
● Add appropriate Maven dependencies to your code
● http://jena.apache.org/documentation/hadoop/artifacts.html
● Will also need to declare relevant Hadoop dependencies as
"provided"
● Use the APIs as-is for basic tasks or use as starting point
for more complex applications
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Common API
● Provides Writable types for the RDF primitives
● NodeWritable
● TripleWritable
● QuadWritable
● NodeTupleWritable
● An arbitrarily sized tuples of RDF terms
● Backed by RDF Thrift
● A compact binary serialization for RDF using Apache Thrift
● See http://afs.github.io/rdf-thrift/
● Extremely efficient to serialize and de-serialize
● Allows for efficient WritableComparator implementations that
perform comparisons directly on the binary forms
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - IO API
● Provides Hadoop InputFormat and OutputFormat
implementations for RDF
● Covers all RDF serializations Jena supports
● Easily extended with custom formats
● Splits and parallelizes processing of input where the RDF
serialization allows it
● Blank Nodes can be awkward
● Transparently handles compressed IO
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Blank Nodes
● Blank Nodes can be
problematic
● Need to consistently assign
IDs in parallel
● However you will typically
produce multiple
intermediate output files in
multi-job workflows
● Thus need to allow for
document versus globally
scoped IDs
● Configuration setting
controls this
● See documentation for
more information
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Map/Reduce API
● Various reusable basic Mapper and Reducer
implementations
● Covers common tasks:
● Counting
● Filtering
● Grouping
● Splitting
● Transformation
● Mostly intended for use as a starting point
● Some of these are bundled into a RDF stats demo
application
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Example Job
● Node Count (aka word count for RDF)
● All the classes referenced (bar Example.class) are provided by Elephas
Job job = Job.getInstance(config);
job.setJarByClass(Example.class);
job.setJobName("RDF Triples Node Usage Count");
// Map/Reduce classes
job.setMapperClass(TripleNodeCountMapper.class);
job.setMapOutputKeyClass(NodeWritable.class);
job.setMapOutputValueClass(LongWritable.class);
job.setReducerClass(NodeCountReducer.class);
// Input and Output
job.setInputFormatClass(NTriplesInputFormat.class);
job.setOutputFormatClass(NTriplesNodeOutputFormat.class);
FileInputFormat.setInputPath(job, new Path("/inputs/rdf"));
FileOutputFormat.setOutputPath(job, new Path("/outputs/rdf"));
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Node Count
Demo
See end of slide deck for steps to run the demo
and screenshots
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Performance Notes
● For NTriples inputs we compared performance of a Text
based node count versus RDF based node count
● Performance typically as good (within 10%) and
sometimes significantly better
● Heavily dataset dependent
● Varies considerably with cluster setup
● Also depends on how the input is processed
● Be aware YMMV!
C O M P U T E | S T O R E | A N A L Y Z E
Intel Graph Builder - What is it?
● Tools for transforming/creating large graphs
● Developed by Intel
● Cray has some proposed improvements that are awaiting
merging at time of writing
● Open source under Apache License
● https://github.com/01org/graphbuilder/tree/2.0.alpha
● 2.0.alpha is the preferred branch
● See https://github.com/cray/graphbuilder for the version
discussed here
● Allows graphs to be created/transformed from arbitrary
data sources using Apache Pig
C O M P U T E | S T O R E | A N A L Y Z E
Intel Graph Builder - How do I use it?
● REGISTER the Graph Builder JAR in your Pig script
● May optionally want to IMPORT the pig/graphbuilder.pig
script which aliases some of the provided UDFs
● LOAD your data
● Use the provided UDFs to generate a graph
● Can create both property graphs and RDF
● Currently data must be mapped to a property graph and then
into RDF
● STORE the resulting graph
C O M P U T E | S T O R E | A N A L Y Z E
Intel Graph Builder - How it works?
● Uses a declarative mapping based on Pig primitives
● Has to be explicitly joined to the data
● Limitation of Pig UDFs
● RDF mappings operate on property graphs
● Must map data to a property graph first
● Direct mapping to RDF is a possible future enhancement
C O M P U T E | S T O R E | A N A L Y Z E
Intel Graph Builder - Pig Script Example
https://github.com/Cray/graphbuilder/blob/2.0.alpha/examples/property_graphs_and_rdf_example.pig
-- Rest of script omitted for brevity
-- Declare our mappings
propertyGraphWithMappings = FOREACH propertyGraph GENERATE (*,
[ 'idBase' # 'http://example.org/instances/',
'base' # 'http://example.org/ontology/',
'namespaces' # [ 'foaf' # 'http://xmlns.com/foaf/0.1/' ],
'propertyMap' # [ 'type' # 'a',
'name' # 'foaf:name',
'age' # 'foaf:age' ],
'uriProperties' # ( 'type' ),
'idProperty' # 'id' ]);
-- Convert to NTriples
rdf_triples = FOREACH propertyGraphWithMappings GENERATE FLATTEN(RDF(*));
-- Write out NTriples
STORE rdf_triples INTO '/tmp/rdf_triples' USING PigStorage();
C O M P U T E | S T O R E | A N A L Y Z E
Intel Graph Builder - RDF
Generation Demo
See end of slide deck for steps to run the demo
and screenshots
C O M P U T E | S T O R E | A N A L Y Z E
Other Projects - Infovore
● Framework developed by Paul Houle
● Open source on GitHub
● https://github.com/paulhoule/infovore/wiki
● Apache License 2.0
● Produces a cleaned and curated Freebase dataset using
Hadoop for the processing
● Designed to be easily self-deployed on Amazon EC2
● Also some related projects for working with Wikipedia
● https://github.com/paulhoule/telepath
● Currently unclear what direction these projects will take
after the Freebase shutdown at end of March this year
C O M P U T E | S T O R E | A N A L Y Z E
Other Projects - CumulusRDF
● Academic project from Institute of Applied Informatics
and Formal Description Methods
● https://code.google.com/p/cumulusrdf/
● RDF store backed by Apache Cassandra
● Reasonable performance compared to native RDF stores
● See NoSQL Databases for RDF: An Empirical Evaluation
● Philippe Cudŕe-Mauroux et al
● http://exascale.info/sites/default/files/nosqlrdf.pdf
● Reasonably active development
C O M P U T E | S T O R E | A N A L Y Z E
Getting Involved
C O M P U T E | S T O R E | A N A L Y Z E
How to contribute
● Please download and try out these projects
● Interact with the communities and developers involved
● What works?
● What is broken?
● What is missing?
● How could the documentation be better?
● Contribute
● Open source ultimately lives or dies with community
participation
● If there's a missing feature then suggest it
● Or better still contribute it yourself!
C O M P U T E | S T O R E | A N A L Y Z E
Questions?
Personal Email: rvesse@gmail.com
Apache Jena User List: users@jena.apache.org
These slides will be posted to my SlideShare:
http://www.slideshare.net/RobVesse
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Node Count
Demo
C O M P U T E | S T O R E | A N A L Y Z E
Environment Pre-requisites
● Hadoop 2.x cluster
● Assumes hadoop command is on your PATH
● Download the latest JAR file
● Or build youself from source
● jena-hadoop-rdf-stats-VERSION-hadoop-job.jar
● Upload some RDF data to a HDFS folder
C O M P U T E | S T O R E | A N A L Y Z E
Run the Demo
● --node-count requests the Node Count statistics be calculated
● Assumes mixed quads and triples input if no --input-type specified
● Using this for triples only data can skew statistics
● e.g. can result in high node counts for default graph node
● Hence we explicitly specify input as triples
> hadoop jar jena-hadoop-rdf-stats-0.9.0-SNAPSHOT-hadoop-job.jar
org.apache.jena.hadoop.rdf.stats.RdfStats --node-count --output /user/output --
input-type triples /user/input
C O M P U T E | S T O R E | A N A L Y Z E
Intel Graph Builder - RDF
Generation Demo
C O M P U T E | S T O R E | A N A L Y Z E
Environment Pre-requisites
● Pig 0.12
● Should work with higher but not tested
● Assumes pig command is on your PATH
● Clone the Cray version of the Graph Builder code
● https://github.com/cray/graphbuilder
C O M P U T E | S T O R E | A N A L Y Z E
Run the Demo
● Running Pig in local mode for simplicity
● Output goes to /tmp/rdf_triples/
> pig -x local examples/property_graphs_and_rdf.pig
> cat /tmp/rdf_triples/part-m-00000
Apache Jena Elephas and Friends
Apache Jena Elephas and Friends

More Related Content

What's hot

Why Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldWhy Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldDean Wampler
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Databricks
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Modelssparktc
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Holden Karau
 
SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)andyseaborne
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...Holden Karau
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Holden Karau
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
Semantic web meetup – sparql tutorial
Semantic web meetup – sparql tutorialSemantic web meetup – sparql tutorial
Semantic web meetup – sparql tutorialAdonisDamian
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018Holden Karau
 
Introduction to and Extending Spark ML
Introduction to and Extending Spark MLIntroduction to and Extending Spark ML
Introduction to and Extending Spark MLHolden Karau
 

What's hot (20)

Why Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldWhy Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data World
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
SPARQL Cheat Sheet
SPARQL Cheat SheetSPARQL Cheat Sheet
SPARQL Cheat Sheet
 
3 avro hug-2010-07-21
3 avro hug-2010-07-213 avro hug-2010-07-21
3 avro hug-2010-07-21
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
 
SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Semantic web meetup – sparql tutorial
Semantic web meetup – sparql tutorialSemantic web meetup – sparql tutorial
Semantic web meetup – sparql tutorial
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
 
Introduction to and Extending Spark ML
Introduction to and Extending Spark MLIntroduction to and Extending Spark ML
Introduction to and Extending Spark ML
 

Similar to Apache Jena Elephas and Friends

New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...Holden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Challenges and patterns for semantics at scale
Challenges and patterns for semantics at scaleChallenges and patterns for semantics at scale
Challenges and patterns for semantics at scaleRob Vesse
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache BeamKnoldus Inc.
 
Slides semantic web and Drupal 7 NYCCamp 2012
Slides semantic web and Drupal 7 NYCCamp 2012Slides semantic web and Drupal 7 NYCCamp 2012
Slides semantic web and Drupal 7 NYCCamp 2012scorlosquet
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingAll Things Open
 
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...DataStax Academy
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Holden Karau
 
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Neo4j
 
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)Holden Karau
 
The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013scorlosquet
 

Similar to Apache Jena Elephas and Friends (20)

New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Apache pig
Apache pigApache pig
Apache pig
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Challenges and patterns for semantics at scale
Challenges and patterns for semantics at scaleChallenges and patterns for semantics at scale
Challenges and patterns for semantics at scale
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
Slides semantic web and Drupal 7 NYCCamp 2012
Slides semantic web and Drupal 7 NYCCamp 2012Slides semantic web and Drupal 7 NYCCamp 2012
Slides semantic web and Drupal 7 NYCCamp 2012
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
 
New Analytics Toolbox
New Analytics ToolboxNew Analytics Toolbox
New Analytics Toolbox
 
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
 
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
 
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)
 
The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013
 

Recently uploaded

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 

Recently uploaded (20)

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Apache Jena Elephas and Friends

  • 1. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas and Friends RDF and the Hadoop Ecosystem Rob Vesse Twitter: @RobVesse Email: rvesse@gmail.com
  • 2. C O M P U T E | S T O R E | A N A L Y Z E About Me ● Software Engineer at Cray Inc ● Working on: ● RDF and SPARQL ● Big Data Analytics ● Active open source contributor ● Apache Jena ● dotNetRDF ● Minor contributions to other Apache projects ● Assorted other bits and pieces on my GitHub and BitBucket ● Primarily interested in intersection of RDF/SPARQL world with rest of Big Data world
  • 3. C O M P U T E | S T O R E | A N A L Y Z E Talk Overview ● What's missing in the Hadoop ecosystem? ● What's already available? ● Apache Jena Elephas ● Intel Graph Builder ● Other interesting projects ● Getting Involved ● Questions
  • 4. C O M P U T E | S T O R E | A N A L Y Z E What's missing in the Hadoop ecosystem?
  • 5. Apache, the projects and their logo shown here are registered trademarks or trademarks of The Apache Software Foundation in the U.S. and/or other countries
  • 6. C O M P U T E | S T O R E | A N A L Y Z E Where's RDF? ● No first class projects ● Some very limited support in other projects ● Giraph can support RDF by bridging through the Tinkerpop 2 stack ● Few existing projects ● Mostly academic proofs of concept (POC) ● Some open source efforts but often task specific ● e.g. Infovore targeted at creating curated Freebase and DBPedia datasets
  • 7. C O M P U T E | S T O R E | A N A L Y Z E What's needed for RDF? ● Minimum Viable Product ● Standard Writable implementations for primitives ● Input and Output support ● Would be nice to have: ● Tools for translating data to and from RDF ● Integration with the common analytic frameworks ● e.g. Spark, Giraph, Hive, Pig
  • 8. C O M P U T E | S T O R E | A N A L Y Z E What's already available?
  • 9. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - Background ● Started as a POC at Cray ● Donated to the Apache Jena project 1st April 2014 ● JENA-666 ● Originally known as Hadoop RDF Tools ● Renamed to Elephas in December 2014 ● Name was suggested by Claude Warren
  • 10. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - What is it? ● Set of modules part of the Apache Jena project ● Currently only developer SNAPSHOT builds available ● Will be included as part of upcoming Jena 2.13.0 release ● Aims to fulfill all the basic requirements for enabling RDF on Hadoop ● Built against Hadoop 2.x APIs
  • 11. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - How do I use it? ● Read the documentation ● http://jena.apache.org/documentation/hadoop/ ● Add appropriate Maven dependencies to your code ● http://jena.apache.org/documentation/hadoop/artifacts.html ● Will also need to declare relevant Hadoop dependencies as "provided" ● Use the APIs as-is for basic tasks or use as starting point for more complex applications
  • 12. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - Common API ● Provides Writable types for the RDF primitives ● NodeWritable ● TripleWritable ● QuadWritable ● NodeTupleWritable ● An arbitrarily sized tuples of RDF terms ● Backed by RDF Thrift ● A compact binary serialization for RDF using Apache Thrift ● See http://afs.github.io/rdf-thrift/ ● Extremely efficient to serialize and de-serialize ● Allows for efficient WritableComparator implementations that perform comparisons directly on the binary forms
  • 13. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - IO API ● Provides Hadoop InputFormat and OutputFormat implementations for RDF ● Covers all RDF serializations Jena supports ● Easily extended with custom formats ● Splits and parallelizes processing of input where the RDF serialization allows it ● Blank Nodes can be awkward ● Transparently handles compressed IO
  • 14. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - Blank Nodes ● Blank Nodes can be problematic ● Need to consistently assign IDs in parallel ● However you will typically produce multiple intermediate output files in multi-job workflows ● Thus need to allow for document versus globally scoped IDs ● Configuration setting controls this ● See documentation for more information
  • 15. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - Map/Reduce API ● Various reusable basic Mapper and Reducer implementations ● Covers common tasks: ● Counting ● Filtering ● Grouping ● Splitting ● Transformation ● Mostly intended for use as a starting point ● Some of these are bundled into a RDF stats demo application
  • 16. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - Example Job ● Node Count (aka word count for RDF) ● All the classes referenced (bar Example.class) are provided by Elephas Job job = Job.getInstance(config); job.setJarByClass(Example.class); job.setJobName("RDF Triples Node Usage Count"); // Map/Reduce classes job.setMapperClass(TripleNodeCountMapper.class); job.setMapOutputKeyClass(NodeWritable.class); job.setMapOutputValueClass(LongWritable.class); job.setReducerClass(NodeCountReducer.class); // Input and Output job.setInputFormatClass(NTriplesInputFormat.class); job.setOutputFormatClass(NTriplesNodeOutputFormat.class); FileInputFormat.setInputPath(job, new Path("/inputs/rdf")); FileOutputFormat.setOutputPath(job, new Path("/outputs/rdf"));
  • 17. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - Node Count Demo See end of slide deck for steps to run the demo and screenshots
  • 18. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - Performance Notes ● For NTriples inputs we compared performance of a Text based node count versus RDF based node count ● Performance typically as good (within 10%) and sometimes significantly better ● Heavily dataset dependent ● Varies considerably with cluster setup ● Also depends on how the input is processed ● Be aware YMMV!
  • 19. C O M P U T E | S T O R E | A N A L Y Z E Intel Graph Builder - What is it? ● Tools for transforming/creating large graphs ● Developed by Intel ● Cray has some proposed improvements that are awaiting merging at time of writing ● Open source under Apache License ● https://github.com/01org/graphbuilder/tree/2.0.alpha ● 2.0.alpha is the preferred branch ● See https://github.com/cray/graphbuilder for the version discussed here ● Allows graphs to be created/transformed from arbitrary data sources using Apache Pig
  • 20. C O M P U T E | S T O R E | A N A L Y Z E Intel Graph Builder - How do I use it? ● REGISTER the Graph Builder JAR in your Pig script ● May optionally want to IMPORT the pig/graphbuilder.pig script which aliases some of the provided UDFs ● LOAD your data ● Use the provided UDFs to generate a graph ● Can create both property graphs and RDF ● Currently data must be mapped to a property graph and then into RDF ● STORE the resulting graph
  • 21. C O M P U T E | S T O R E | A N A L Y Z E Intel Graph Builder - How it works? ● Uses a declarative mapping based on Pig primitives ● Has to be explicitly joined to the data ● Limitation of Pig UDFs ● RDF mappings operate on property graphs ● Must map data to a property graph first ● Direct mapping to RDF is a possible future enhancement
  • 22. C O M P U T E | S T O R E | A N A L Y Z E Intel Graph Builder - Pig Script Example https://github.com/Cray/graphbuilder/blob/2.0.alpha/examples/property_graphs_and_rdf_example.pig -- Rest of script omitted for brevity -- Declare our mappings propertyGraphWithMappings = FOREACH propertyGraph GENERATE (*, [ 'idBase' # 'http://example.org/instances/', 'base' # 'http://example.org/ontology/', 'namespaces' # [ 'foaf' # 'http://xmlns.com/foaf/0.1/' ], 'propertyMap' # [ 'type' # 'a', 'name' # 'foaf:name', 'age' # 'foaf:age' ], 'uriProperties' # ( 'type' ), 'idProperty' # 'id' ]); -- Convert to NTriples rdf_triples = FOREACH propertyGraphWithMappings GENERATE FLATTEN(RDF(*)); -- Write out NTriples STORE rdf_triples INTO '/tmp/rdf_triples' USING PigStorage();
  • 23. C O M P U T E | S T O R E | A N A L Y Z E Intel Graph Builder - RDF Generation Demo See end of slide deck for steps to run the demo and screenshots
  • 24. C O M P U T E | S T O R E | A N A L Y Z E Other Projects - Infovore ● Framework developed by Paul Houle ● Open source on GitHub ● https://github.com/paulhoule/infovore/wiki ● Apache License 2.0 ● Produces a cleaned and curated Freebase dataset using Hadoop for the processing ● Designed to be easily self-deployed on Amazon EC2 ● Also some related projects for working with Wikipedia ● https://github.com/paulhoule/telepath ● Currently unclear what direction these projects will take after the Freebase shutdown at end of March this year
  • 25. C O M P U T E | S T O R E | A N A L Y Z E Other Projects - CumulusRDF ● Academic project from Institute of Applied Informatics and Formal Description Methods ● https://code.google.com/p/cumulusrdf/ ● RDF store backed by Apache Cassandra ● Reasonable performance compared to native RDF stores ● See NoSQL Databases for RDF: An Empirical Evaluation ● Philippe Cudŕe-Mauroux et al ● http://exascale.info/sites/default/files/nosqlrdf.pdf ● Reasonably active development
  • 26. C O M P U T E | S T O R E | A N A L Y Z E Getting Involved
  • 27. C O M P U T E | S T O R E | A N A L Y Z E How to contribute ● Please download and try out these projects ● Interact with the communities and developers involved ● What works? ● What is broken? ● What is missing? ● How could the documentation be better? ● Contribute ● Open source ultimately lives or dies with community participation ● If there's a missing feature then suggest it ● Or better still contribute it yourself!
  • 28. C O M P U T E | S T O R E | A N A L Y Z E Questions? Personal Email: rvesse@gmail.com Apache Jena User List: users@jena.apache.org These slides will be posted to my SlideShare: http://www.slideshare.net/RobVesse
  • 29. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - Node Count Demo
  • 30. C O M P U T E | S T O R E | A N A L Y Z E Environment Pre-requisites ● Hadoop 2.x cluster ● Assumes hadoop command is on your PATH ● Download the latest JAR file ● Or build youself from source ● jena-hadoop-rdf-stats-VERSION-hadoop-job.jar ● Upload some RDF data to a HDFS folder
  • 31. C O M P U T E | S T O R E | A N A L Y Z E Run the Demo ● --node-count requests the Node Count statistics be calculated ● Assumes mixed quads and triples input if no --input-type specified ● Using this for triples only data can skew statistics ● e.g. can result in high node counts for default graph node ● Hence we explicitly specify input as triples > hadoop jar jena-hadoop-rdf-stats-0.9.0-SNAPSHOT-hadoop-job.jar org.apache.jena.hadoop.rdf.stats.RdfStats --node-count --output /user/output -- input-type triples /user/input
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37. C O M P U T E | S T O R E | A N A L Y Z E Intel Graph Builder - RDF Generation Demo
  • 38. C O M P U T E | S T O R E | A N A L Y Z E Environment Pre-requisites ● Pig 0.12 ● Should work with higher but not tested ● Assumes pig command is on your PATH ● Clone the Cray version of the Graph Builder code ● https://github.com/cray/graphbuilder
  • 39. C O M P U T E | S T O R E | A N A L Y Z E Run the Demo ● Running Pig in local mode for simplicity ● Output goes to /tmp/rdf_triples/ > pig -x local examples/property_graphs_and_rdf.pig > cat /tmp/rdf_triples/part-m-00000