SlideShare a Scribd company logo
1 of 76
The other Apache
Technologies your Big
Data solution needs!
Nick Burch
CTO, Quanticate
Copyright © Quanticate 2013
Nick Burch
CTO, Quanticate
The Apache Software Foundation
Apache Technologies as in the ASF
117 Top Level Projects
37 Incubating Projects (> 100 past ones)
Y is the only letter we lack
A, C, O, S & T are popular letters, >= 10 each
Meritocratic, Community driven
Open Source
A Growing Foundation...
June 2013:
117 Top Level Projects
37 Incubating Projects
June 2013:
104 Top Level Projects
50 Incubator Projects
June 2011:
91 Top Level Projects
59 Incubating Projects
What we're not covering
Projects not being covered
Cassandra cassandra.apache.org
CouchDB couchdb.apache.org
Drill (Incubating) incubator.apache.org/drill/
Giraph giraph.apache.org
Hadoop hadoop.apache.org
Hive hive.apache.org
HBase hbase.apache.org
Projects not being covered, part 2
Lucene lucene.apache.org
Solr lucene.apache.org/solr/
Lucy lucy.apache.org
Mahout mahout.apache.org
Nutch nutch.apache.org
OODT oodt.apache.org
All covered in other sessions
What we are looking at
Talk Structure
Loading and querying Big Data
Building your MapReduce Jobs
Deploying and Building for the Cloud
Servers for Big Data
Building out your solution
Many projects to cover –
this is only an overview!
Audience Participation:
A show of hands...
Making the most of the talk
Slides will be available after the talk
– See @Gagravarr for the link
Lots of information and projects to cover
Just an overview, see project sites for more
Try to take note / remember the projects you
think will matter to you
Don't try to take notes on
everything – there's too much!
Loading and Querying
Pig – pig.apache.org
Originally from Yahoo, entered the Incubator
in 2007, graduated 2008, very widely used
Provides an easy way to query data, which is
compiled into Hadoop M/R (not SQL-like)
Typically 1/20th
of the lines of code, and 1/15th
of the development time
Optimising compiler – often only
slower, occasionally faster!
Pig – pig.apache.org
Shell, scripting and embedded Java
Local mode for development
Built-ins for loading, filtering, joining,
processing, sorting and saving
User Defined Functions too
Similar range of operations as SQL, but
quicker and easier to learn
Allows non coders to easily query
Gora – gora.apache.org
ORM Framework for (NoSQL) Column Stores
Grew out of the Nutch project
Supports HBase, Cassanda, Hypertable
K/V: Voldermort, Redis
HDFS Flat Files, plus basic SQL
Data is stored in Avro (more later)
Query with Pig, Lucene, Hive, Cascading,
Hadoop M/R, or native Store code
Gora – gora.apache.org
Example: Web Server Log
Avro data bean, JSON
{
"type": "record",
"name": "Pageview",
"namespace": "org.apache.gora.tutorial.log.generated",
"fields" : [
{"name": "url", "type": "string"},
{"name": "timestamp", "type": "long"},
{"name": "ip", "type": "string"},
{"name": "httpMethod", "type": "string"},
{"name": "httpStatusCode", "type": "int"},
{"name": "responseSize", "type": "int"},
{"name": "referrer", "type": "string"},
{"name": "userAgent", "type": "string"}
]
}
Gora
// ID is a long, Pageview is compiled Avro bean
dataStore = DataStoreFactory.getDataStore(Long.class, Pageview.class);
// Parse the log file, and store
while(going) {
Pageview page = parseLine(reader.readLine());
dataStore.put(logFileId, page);
}
DataStore.close();
private Pageview parseLine(String line) throws ParseException {
StringTokenizer matcher = new StringTokenizer(line);
//parse the log line
String ip = matcher.nextToken();
...
//construct and return pageview object
Pageview pageview = new Pageview();
pageview.setIp(new Utf8(ip));
pageview.setTimestamp(timestamp);
...
return pageview;
}
Accumulo – accumulo.apache.org
Distributed Key/Value store, built on top of
Hadoop, Zookeeper and Thrift
Inspired by BigTable, with some
improvements to the design
Cell level permissioning (access labels),
write constraints, and server side hooks to
tweak data as it's read/written
Supports v.large rows
Initial work mostly done by the NSA!
Accumulo – accumulo.apache.org
High availability (uses Zookeeper)
Distributed WAL via loggers, later to HDFS
Logical time – tablets can have wrong dates!
FATE – Fault Tolerant Executions
Testing Friendly!
Mocks, Functional Tests, Scaling Tests, and
random walk tests
Admin support OOTB
Sqoop – sqoop.apache.org (no u!)
Bulk data transfer tool for big data systems
Hadoop (HDFS), HBase and Hive on one side
SQL Databases on the other
Can be used to import existing data into your
big data cluster
Or, export the results of a big data
job out to your data wharehouse
Generates code to work with data
Chukwa (Incubating)
Log collection and analysis framework based
on Hadoop
Incubating since 2010
Collects and aggregates logs from many
different machines
Stores data in HDFS, in chunks that are both
HDFS and Hadoop friendly
Lets you dump, query and analyze
Chukwa (Incubating)
Chukwa agent runs on source nodes
Collects from Log4j, Syslog, plain text log
files etc
Agent sends to a Collector on the Hadoop
cluster
Collector can transform if needed
Data written to HDFS, and optionally
to HBase (needed for visualiser)
Chukwa (Incubating)
Map/Reduce and Pig query the HDFS files,
and/or the HBase store
Can do M/R anomaly detection
Can integrate with Hive
eg Netflix collect weblogs with Chukwa,
transform with Thrift, and store in HDFS
ready for Hive queries
Flume – flume.apache.org
Another Log collection framework
Concentrates on rapidly collecting,
aggregating and moving large amounts of
logs from many sources to a central place
Typically write to HDFS + Hive + FTS
Source, Channels and Sinks
Source consumes events, eg
log file entries, avro data
Flume – flume.apache.org
Source places into a 1 or more channels,
which can be persistent or in-memory
Sink pulls events from channel, then stores
Can chain together, fan-in or fan-out
Data and Control planes independent
Does more OOTB than Chukwa
Adding source+sink easy
Less scope to customise otherwise
Flume – flume.apache.org
Another Log collection framework
Concentrates on rapidly getting data to a
variety of sources
Typically write to HDFS + Hive + FTS
Joint Agent+Collector model
Data and Control planes independent
More OOTB, less scope to alter
Blur (Incubating)
Big Data search for structured data
Hadoop + HDFS + Lucene + Thrift +
Zookeeper
Map/Reduce to process into thrift data + index
Blur distributes index shards across cluster
Multiple controllers for fault tolerance
Automatic Shard Fallover
Web based query interface
Falcon (Incubating)
Data LifeCycle Management for Hadoop
Provides support to applications to implement
the key parts – import, export, cleaning,
duplication, recovery, retention etc
Declarative definitions, Falcon does deps etc
Controlled replication, for example only outputs
of certain stages in the flow
Lots of tests for the key parts
Hama – hama.apache.org
Bulk Synchronous Parallel (BSP) on HDFS
Massive scale scientific computions, eg Matrix,
Graph and Network algorithms
BSP can be better than M/R or MPI
Message Passing based applications
No deadlocks or collisions
Flexible, simple API
High performance
MRQL – mrql.apache.org
MRQL, pronounced as Miracle
Large scale data analytics engine
Works on top of Hadoop or Hama
SQL-like query language
Supports more complex types and constructs
than Pig or Hive, so you can avoid having to
write explicit M/R code
Page Rank, k-means, matrix etc
Tajo (Incubating)
Data warehouse system for Hadoop
Low-latency and scalable ad-hoc queries
HDFS for storage, custom query engine which
controls execution, optimisation and data flow
Runs SQL queries, inc filter, group, sort, join
Some ETL for data format translation
Java API to submit queries
Command line tools
Building MapReduce Jobs
Avro – avro.apache.org
Language neutral data serialization
Rich data structures (JSON based)
Compact and fast binary data format
Code generation optional for dynamic
languages
Supports RPC
Data includes schema details
Avro – avro.apache.org
Schema is always present – allows dynamic
typing and smaller sizes
Java, C, C++, C#, Python, Perl, Ruby, PHP
Different languages can transparently talk to
each other, and make RPC calls to each
other
Often faster than Thrift and ProtoBuf
No streaming support though
Thrift – thrift.apache.org
Java, C++, Python, PHP, Ruby, Erlang, Perl,
Haskell, C#, JS, Cocoa, OCaml and more!
From Facebook, at Apache since 2008
Rich data structure, compiled down into
suitable code
RPC support too
Streaming is available
Worth reading the White Paper!
HCatalog (Just joined Hive)
Provides a table like structure on top of HDFS
files, with friendly addressing
Allows Pig, Hadoop MR jobs etc to easily
read/write data structured data
Simpler, lighter weight than Avro or Thrift
based serialisation
Based on Hive's metastore format
Doesn't require an additional datastore
MRUnit – mrunit.apache.org
Built on top of JUnit
Checks Map, Reduce, then combined
Provides test drivers for Hadoop
Avoids you needing lots of boiler plate code
to start/stop Hadoop
Avoids brittle mock objects
Handles multiple input K/Vs
Counter checking
MRUnit
IdentityMapper – same input/output
public class TestExample extends TestCase {
private Mapper mapper;
private MapDriver driver;
@Before
public void setUp() {
mapper = new IdentityMapper();
driver = new MapDriver(mapper);
}
@Test
public void testIdentityMapper() {
// Pass in { “foo”, “bar” }, ensure it comes back again
driver.withInput(new Text("foo"), new Text("bar"))
.withOutput(new Text("foo"), new Text("bar"))
.runTest();
assertEquals(1, driver.getCounters().findCounter(“foo”,”bar”));
}
}
Hadoop Development Tools (Incbtr)
Set of Eclipse plugins for Hadoop
Work with multiple versions of Hadoop from
within the same Eclipse
Quite a new project, lot still “TODO”
Makes it easy to build Hadoop programs
Query the cluster
Submit jobs, then track them
Remote debug jobs
Oozie – oozie.apache.org
Workflow, scheduler and dependency
manager for Hadoop jobs (inc Pig etc)
Define a workflow to describe the data flow
for your desired output
Oozie handles running dependencies as
needed, and scheduled execution of steps
as requested
Builds up a data pipe, then
executes as required on a cloud scale
BigTop – bigtop.apache.org
Build, Package and Test code built on top of
Hadoop related projects
Allows you to check that a given mix of say
HDFS, Hadoop Core and ZooKeeper work
well together
Then package a tested bundle
Integration testing of your stack
Test upgrades, generate packages
Ambari (Incubating)
Monitoring, Admin and LifeCycle
management for Hadoop clusters
eg HBase, HDFS, Hive, Pig, ZooKeeper
Deploy+Configure stack to a cluster of
machines
Update software stack versions
Monitoring and Service Admin
REST APIs for cluster management
For the Cloud
Provider Independent Cloud APIs
Lets you provision, manage and query Cloud
services, without vendor lock-in
Translates general calls to the specific (often
proprietary) ones for a given cloud provider
Work with remote and local cloud providers
(almost) transparently
Provider Independent Cloud APIs
Create, stop, start, reboot and destroy
instances
Control what's run on new instances
List active instances
Fetch available and active profiles
EC2, Eycalyptos, Rackspace, RHEV,
vSphere, Linode, OpenStack
LibCloud – libcloud.apache.org
Python library (limited Java support)
Very wide range of providers
Script your cloud services
from libcloud.compute.types import Provider
from libcloud.compute.providers import get_driver
EC2_ACCESS_ID = 'your access id'
EC2_SECRET_KEY = 'your secret key'
Driver = get_driver(Provider.EC2)
conn = Driver(EC2_ACCESS_ID, EC2_SECRET_KEY)
nodes = conn.list_nodes()
# [<Node: uuid=..., state=3, public_ip=['1.1.1.1'],
provider=EC2 ...>, ...]
DeltaCloud – deltacloud.apache.org
REST API (xml) + web portal
Major Cloud Providers, RHEV-M, vSphere
<instances>
<instance href="http://fancycloudprovider.com/api/instances/inst1" id='inst1'>
<owner_id>larry</owner_id>
<name>Production JBoss Instance</name>
<image href="http://fancycloudprovider.com/api/images/img3"/>
<hardware_profile href="http://fancycloudprovider.com/api/hardware_profiles/m1-small"/>
<realm href="http://fancycloudprovider.com/api/realms/us"/>
<state>RUNNING</state>
<actions>
<link rel="reboot" href="http://fancycloudprovider.com/api/instances/inst1/reboot"/>
<link rel="stop" href="http://fancycloudprovider.com/api/instances/inst1/stop"/>
</actions>
<public_addresses>
<address>inst1.larry.fancycloudprovider.com</address>
</public_addresses>
<private_addresses>
<address>inst1.larry.internal</address>
</private_addresses>
</instance>
</instances>
Provisionr (Incubating)
Higher Level than DeltaCloud or LibCloud
Semi-Automated workflows to provision
“Build me a cluster on these assumptions”
Cloud provider independent
Builds on Puppet, lets you work higher level
Handles retries, throttling etc
Can work on multiple clouds
CloudStack – cloudstack.apache.org
Software to run an IaaS cloud computing
platform (on demand, elastic etc)
Can run public, private or hybrid clouds
Entire stack to run an IaaS
Hypervisors: VMware, KVM, XenServer, XCP
Web interface, CLI or RESTful to manage
API compatible with EC2 for hybrid
Servers for Big Data
ZooKeeper – zookeeper.apache.org
Centralised service for configuration, naming
and synchronisation
Provides Consensus, Group Mangement
and Presence tracking
Single co-ordination service across all the
different components
ZooKeeper is distributed and highly
reliable (avoid config being SPOF)
ZooKeeper – zookeeper.apache.org
“A central nervous system for distributed
applications and services”
Bindings for Java, C, Perl, Python, Scala,
.Net (C#), Node.js, Erlang
Applications can read nodes, send events,
and watch for them
eg fetch config, come up, perform leader
election, share active list
Curator (Incubating)
“A Zookeeper Keeper”
Provides implementations of common things
you want to do using Zookeeper
Improved API on top of Zookeeper, eg with
retries, error cases, cleaner API
Test support – in memory ZK server, cluster
Provides SOA service discovery
built using Zookeeper
Airavata – airavata.apache.org
Toolkit to build, manage, execute and monitor
large scale applications
Workflow driven processing
Historically aimed at Scientific Processing,
but now expanding
Facilities for developer to deploy
End user launches, workflow manages, then
able to monitor eg via widgets
Whirr - whirr.apache.org
Grew out of Hadoop
Aimed at running Hadoop, Cassandra,
HBase and ZooKeeper
Higher level – running services not running
machines
Java (jclouds) and Python versions
Can be run on the command line
ACE – ace.apache.org
Software Distribution Framework
Built on OSGi, but can work for anything
“Puppet for OSGi”
Tracks and installs components and deps
Group components, set targets to deploy to
Can do QA, staging, production flow
Web based management
River – river.apache.org
Toolkit for building distributed systems
Communicate over JERI, which works over
TCP, SSL, HTTP, HTTPS, Kerberos, JRMP
Can do dynamic discovery
Uses JINI for naming/addressing
Not as fully featured as ZooKeeper
Better for JINI types?
Helix (Incubating)
Cluster Management Framework
State Machine based, describe in terms of
transitions and states
For partitioned, replicated and distributed
resources across clusters of nodes
Sets up the nodes in the cluster
System tasks, eg maintenance, backups
“Global Brain for the system”
Knox (Incubating)
Secure Gateway to a Hadoop Cluster
Perimeter Security & Authentication
Single endpoint for users to connect to
Knox Gateway sits in DMZ, provides the
choke-point for cluster access
Hadoop sites behind, protected
Allows users access
Allows admin access
Mesos (Incubating)
Like virtualisation on top of your cluster
Resource Sharing and Isolation
Hadoop, MPI, Hypertable, Spark
Can run multiple copies of each, eg production,
ad-hoc and testing versions of Hadoop
Can manage long-running (eg HBase) at the
same time as batch
Zookeeper based
Building out your Solution
Tika – tika.apache.org
Text and Metadata extraction
Identify file type, language, encoding
Extracts text as structured XHTML
Consistent Metadata across formats
Java library, CLI and Network Server
SOLR integration
Handles format differences for you
UIMA – uima.apache.org
Unstructured Information analysis
Lets you build a tool to extract information
from unstructured data
Components in C++ and Java
Network enabled – can spread work out
across a cluster
Helped IBM to win Jeopardy!
OpenNLP – opennlp.apache.org
Natural Language Processing
Various tools for sentence detection,
tokenization, tagging, chunking, entity
detection etc
UIMA likely to be better if you want a whole-
solution
OpenNLP good when integrating
NLP into your own solution
cTakes – ctakes.apache.org
Clinical Text Analysis & Knowledge Extraction
Builds on top of UIMA and OpenNLP
Extracts information from free text in electronic
medical records (and OCR'd paper)
Identifies named entities from common or
custom dictionaries (eg UMLS)
For each, produce attributes, eg
subject, mapping, context
MINA – mina.apache.org
Framework for writing scalable, high
performance network apps in Java
TCP and UDP, Client and Server
Build non blocking, event driven networking
code in Java
MINA also provides pure Java SSH, XMPP,
Web and FTP servers
Etch – etch.apache.org
Framework for building, producing and
consuming network services.
Cross platform, language and transport
Java, C#, C, C++, Go, JS, Python
Produce a formal description of the exchange
1-way, 2-way and realtime communication
Scalable, high performance
Heterogeneous systems
ActiveMQ / Qpid / Synapse
Messaging, Queueing and Brokerage
solutions across most languages
Decide on your chosen message format,
endpoint languages and messaging needs,
one of these three will likely fit your needs!
Queues, Message Brokers, Enterprise
Service Buses, high performance
and yet also buzzword compliant!
Kafka (Incubating)
Distributed push-subscribe (pub-sub)
messaging system
Allows high throughput on one system,
partitionable for many
More general than log based systems such as
Flume
Supports persisted messages, clients can catch
up
Distributed, remote sources and sinks
S4 (Incubating)
Simple Scalable Streaming System
Platform for building tools to work on
continuous, unbounded data streams
Stream is broken into events, which are
routed to PEs and processed
Uses Actors model, highly concurrent
PEs are Java based, but event sources and
sinks can be in any language
Commons – commons.apache.org
Collection of libraries for Java projects
Some historic, many still useful!
Commons CLI – parameters / options
Commons Codec – Base64, Hex, Phonetic
Commons Compress – zip, tar, gz, bz2
Commons Daemon – OS friendly start/stop
Commons Pool – Object pools (db, conn etc)
JMeter – jmeter.apache.org
Loading testing tool
Performance test network services
Defined a series of tasks, execute in parallel
Talks to Web, SOAP, LDAP, JSM, JDBC etc
Handy for checking how external resources
and systems will hold up, once a
big data system start to make
heavy use of them!
Chemistry – chemistry.apache.org
Java, Python, .Net, PHP, JS, Android, iOS
interface to Content Management Systems
Implements the OASIS CMIS spec
Browse, read and write data in your content
repositories
Rich information and structure
Supported by Alfresco, Microsoft, SAP,
Adobe, EMC, OpenText and more
ManifoldCF – manifoldcf.apache.org
Framework for content (mostly text)
extraction from content repositories
Aimed at indexing solutions, eg SOLR
Connectors for reading and writing
Simpler than Chemistry, but also works for
CIFS, file systems, RSS etc
Extract from SharePoint, FileNet,
Documentum, LiveLink etc
Questions?
Thanks!
Twitter - @Gagravarr
Email – nick.burch@quanticate.com
The Apache Software Foundation:
http://www.apache.org/
Apache projects list:
http://projects.apache.org/

More Related Content

What's hot

Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillMapR Technologies
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoopAmbuj Kumar
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst TrainingCloudera, Inc.
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillMapR Technologies
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...Yahoo Developer Network
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Batch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application AdoptionBatch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application AdoptionDataWorks Summit/Hadoop Summit
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesMithun Radhakrishnan
 
Cassandra/Hadoop Integration
Cassandra/Hadoop IntegrationCassandra/Hadoop Integration
Cassandra/Hadoop IntegrationJeremy Hanna
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney
 
Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Vince Gonzalez
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 

What's hot (20)

Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst Training
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Batch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application AdoptionBatch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application Adoption
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
 
Cassandra/Hadoop Integration
Cassandra/Hadoop IntegrationCassandra/Hadoop Integration
Cassandra/Hadoop Integration
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 

Similar to The other Apache technologies your big data solution needs!

The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsgagravarr
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellKhalid Imran
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Yahoo Developer Network
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big dealeduarderwee
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologieszahid-mian
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 
A glimpse of test automation in hadoop ecosystem by Deepika Achary
A glimpse of test automation in hadoop ecosystem by Deepika AcharyA glimpse of test automation in hadoop ecosystem by Deepika Achary
A glimpse of test automation in hadoop ecosystem by Deepika AcharyQA or the Highway
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceChris Nauroth
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 

Similar to The other Apache technologies your big data solution needs! (20)

The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
The ABC of Big Data
The ABC of Big DataThe ABC of Big Data
The ABC of Big Data
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Hive and querying data
Hive and querying dataHive and querying data
Hive and querying data
 
A glimpse of test automation in hadoop ecosystem by Deepika Achary
A glimpse of test automation in hadoop ecosystem by Deepika AcharyA glimpse of test automation in hadoop ecosystem by Deepika Achary
A glimpse of test automation in hadoop ecosystem by Deepika Achary
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 
מיכאל
מיכאלמיכאל
מיכאל
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 

More from gagravarr

Turning XML to XLS on the JVM, without loosing your Sanity, with Groovy
Turning XML to XLS on the JVM, without loosing your Sanity, with GroovyTurning XML to XLS on the JVM, without loosing your Sanity, with Groovy
Turning XML to XLS on the JVM, without loosing your Sanity, with Groovygagravarr
 
But we're already open source! Why would I want to bring my code to Apache?
But we're already open source! Why would I want to bring my code to Apache?But we're already open source! Why would I want to bring my code to Apache?
But we're already open source! Why would I want to bring my code to Apache?gagravarr
 
What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?gagravarr
 
What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzz...
What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzz...What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzz...
What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzz...gagravarr
 
The Apache Way
The Apache WayThe Apache Way
The Apache Waygagravarr
 
How Big is Big – Tall, Grande, Venti Data?
How Big is Big – Tall, Grande, Venti Data?How Big is Big – Tall, Grande, Venti Data?
How Big is Big – Tall, Grande, Venti Data?gagravarr
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!gagravarr
 
But We're Already Open Source! Why Would I Want To Bring My Code To Apache?
But We're Already Open Source! Why Would I Want To Bring My Code To Apache?But We're Already Open Source! Why Would I Want To Bring My Code To Apache?
But We're Already Open Source! Why Would I Want To Bring My Code To Apache?gagravarr
 
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...gagravarr
 
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...gagravarr
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-endgagravarr
 
Apache Content Technologies
Apache Content TechnologiesApache Content Technologies
Apache Content Technologiesgagravarr
 

More from gagravarr (12)

Turning XML to XLS on the JVM, without loosing your Sanity, with Groovy
Turning XML to XLS on the JVM, without loosing your Sanity, with GroovyTurning XML to XLS on the JVM, without loosing your Sanity, with Groovy
Turning XML to XLS on the JVM, without loosing your Sanity, with Groovy
 
But we're already open source! Why would I want to bring my code to Apache?
But we're already open source! Why would I want to bring my code to Apache?But we're already open source! Why would I want to bring my code to Apache?
But we're already open source! Why would I want to bring my code to Apache?
 
What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?
 
What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzz...
What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzz...What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzz...
What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzz...
 
The Apache Way
The Apache WayThe Apache Way
The Apache Way
 
How Big is Big – Tall, Grande, Venti Data?
How Big is Big – Tall, Grande, Venti Data?How Big is Big – Tall, Grande, Venti Data?
How Big is Big – Tall, Grande, Venti Data?
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 
But We're Already Open Source! Why Would I Want To Bring My Code To Apache?
But We're Already Open Source! Why Would I Want To Bring My Code To Apache?But We're Already Open Source! Why Would I Want To Bring My Code To Apache?
But We're Already Open Source! Why Would I Want To Bring My Code To Apache?
 
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
 
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Apache Content Technologies
Apache Content TechnologiesApache Content Technologies
Apache Content Technologies
 

Recently uploaded

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 

Recently uploaded (20)

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 

The other Apache technologies your big data solution needs!

  • 1. The other Apache Technologies your Big Data solution needs! Nick Burch CTO, Quanticate
  • 2. Copyright © Quanticate 2013 Nick Burch CTO, Quanticate
  • 3. The Apache Software Foundation Apache Technologies as in the ASF 117 Top Level Projects 37 Incubating Projects (> 100 past ones) Y is the only letter we lack A, C, O, S & T are popular letters, >= 10 each Meritocratic, Community driven Open Source
  • 4. A Growing Foundation... June 2013: 117 Top Level Projects 37 Incubating Projects June 2013: 104 Top Level Projects 50 Incubator Projects June 2011: 91 Top Level Projects 59 Incubating Projects
  • 5. What we're not covering
  • 6. Projects not being covered Cassandra cassandra.apache.org CouchDB couchdb.apache.org Drill (Incubating) incubator.apache.org/drill/ Giraph giraph.apache.org Hadoop hadoop.apache.org Hive hive.apache.org HBase hbase.apache.org
  • 7. Projects not being covered, part 2 Lucene lucene.apache.org Solr lucene.apache.org/solr/ Lucy lucy.apache.org Mahout mahout.apache.org Nutch nutch.apache.org OODT oodt.apache.org All covered in other sessions
  • 8. What we are looking at
  • 9. Talk Structure Loading and querying Big Data Building your MapReduce Jobs Deploying and Building for the Cloud Servers for Big Data Building out your solution Many projects to cover – this is only an overview!
  • 11. Making the most of the talk Slides will be available after the talk – See @Gagravarr for the link Lots of information and projects to cover Just an overview, see project sites for more Try to take note / remember the projects you think will matter to you Don't try to take notes on everything – there's too much!
  • 13. Pig – pig.apache.org Originally from Yahoo, entered the Incubator in 2007, graduated 2008, very widely used Provides an easy way to query data, which is compiled into Hadoop M/R (not SQL-like) Typically 1/20th of the lines of code, and 1/15th of the development time Optimising compiler – often only slower, occasionally faster!
  • 14. Pig – pig.apache.org Shell, scripting and embedded Java Local mode for development Built-ins for loading, filtering, joining, processing, sorting and saving User Defined Functions too Similar range of operations as SQL, but quicker and easier to learn Allows non coders to easily query
  • 15. Gora – gora.apache.org ORM Framework for (NoSQL) Column Stores Grew out of the Nutch project Supports HBase, Cassanda, Hypertable K/V: Voldermort, Redis HDFS Flat Files, plus basic SQL Data is stored in Avro (more later) Query with Pig, Lucene, Hive, Cascading, Hadoop M/R, or native Store code
  • 16. Gora – gora.apache.org Example: Web Server Log Avro data bean, JSON { "type": "record", "name": "Pageview", "namespace": "org.apache.gora.tutorial.log.generated", "fields" : [ {"name": "url", "type": "string"}, {"name": "timestamp", "type": "long"}, {"name": "ip", "type": "string"}, {"name": "httpMethod", "type": "string"}, {"name": "httpStatusCode", "type": "int"}, {"name": "responseSize", "type": "int"}, {"name": "referrer", "type": "string"}, {"name": "userAgent", "type": "string"} ] }
  • 17. Gora // ID is a long, Pageview is compiled Avro bean dataStore = DataStoreFactory.getDataStore(Long.class, Pageview.class); // Parse the log file, and store while(going) { Pageview page = parseLine(reader.readLine()); dataStore.put(logFileId, page); } DataStore.close(); private Pageview parseLine(String line) throws ParseException { StringTokenizer matcher = new StringTokenizer(line); //parse the log line String ip = matcher.nextToken(); ... //construct and return pageview object Pageview pageview = new Pageview(); pageview.setIp(new Utf8(ip)); pageview.setTimestamp(timestamp); ... return pageview; }
  • 18. Accumulo – accumulo.apache.org Distributed Key/Value store, built on top of Hadoop, Zookeeper and Thrift Inspired by BigTable, with some improvements to the design Cell level permissioning (access labels), write constraints, and server side hooks to tweak data as it's read/written Supports v.large rows Initial work mostly done by the NSA!
  • 19. Accumulo – accumulo.apache.org High availability (uses Zookeeper) Distributed WAL via loggers, later to HDFS Logical time – tablets can have wrong dates! FATE – Fault Tolerant Executions Testing Friendly! Mocks, Functional Tests, Scaling Tests, and random walk tests Admin support OOTB
  • 20. Sqoop – sqoop.apache.org (no u!) Bulk data transfer tool for big data systems Hadoop (HDFS), HBase and Hive on one side SQL Databases on the other Can be used to import existing data into your big data cluster Or, export the results of a big data job out to your data wharehouse Generates code to work with data
  • 21. Chukwa (Incubating) Log collection and analysis framework based on Hadoop Incubating since 2010 Collects and aggregates logs from many different machines Stores data in HDFS, in chunks that are both HDFS and Hadoop friendly Lets you dump, query and analyze
  • 22. Chukwa (Incubating) Chukwa agent runs on source nodes Collects from Log4j, Syslog, plain text log files etc Agent sends to a Collector on the Hadoop cluster Collector can transform if needed Data written to HDFS, and optionally to HBase (needed for visualiser)
  • 23. Chukwa (Incubating) Map/Reduce and Pig query the HDFS files, and/or the HBase store Can do M/R anomaly detection Can integrate with Hive eg Netflix collect weblogs with Chukwa, transform with Thrift, and store in HDFS ready for Hive queries
  • 24. Flume – flume.apache.org Another Log collection framework Concentrates on rapidly collecting, aggregating and moving large amounts of logs from many sources to a central place Typically write to HDFS + Hive + FTS Source, Channels and Sinks Source consumes events, eg log file entries, avro data
  • 25. Flume – flume.apache.org Source places into a 1 or more channels, which can be persistent or in-memory Sink pulls events from channel, then stores Can chain together, fan-in or fan-out Data and Control planes independent Does more OOTB than Chukwa Adding source+sink easy Less scope to customise otherwise
  • 26. Flume – flume.apache.org Another Log collection framework Concentrates on rapidly getting data to a variety of sources Typically write to HDFS + Hive + FTS Joint Agent+Collector model Data and Control planes independent More OOTB, less scope to alter
  • 27. Blur (Incubating) Big Data search for structured data Hadoop + HDFS + Lucene + Thrift + Zookeeper Map/Reduce to process into thrift data + index Blur distributes index shards across cluster Multiple controllers for fault tolerance Automatic Shard Fallover Web based query interface
  • 28. Falcon (Incubating) Data LifeCycle Management for Hadoop Provides support to applications to implement the key parts – import, export, cleaning, duplication, recovery, retention etc Declarative definitions, Falcon does deps etc Controlled replication, for example only outputs of certain stages in the flow Lots of tests for the key parts
  • 29. Hama – hama.apache.org Bulk Synchronous Parallel (BSP) on HDFS Massive scale scientific computions, eg Matrix, Graph and Network algorithms BSP can be better than M/R or MPI Message Passing based applications No deadlocks or collisions Flexible, simple API High performance
  • 30. MRQL – mrql.apache.org MRQL, pronounced as Miracle Large scale data analytics engine Works on top of Hadoop or Hama SQL-like query language Supports more complex types and constructs than Pig or Hive, so you can avoid having to write explicit M/R code Page Rank, k-means, matrix etc
  • 31. Tajo (Incubating) Data warehouse system for Hadoop Low-latency and scalable ad-hoc queries HDFS for storage, custom query engine which controls execution, optimisation and data flow Runs SQL queries, inc filter, group, sort, join Some ETL for data format translation Java API to submit queries Command line tools
  • 33. Avro – avro.apache.org Language neutral data serialization Rich data structures (JSON based) Compact and fast binary data format Code generation optional for dynamic languages Supports RPC Data includes schema details
  • 34. Avro – avro.apache.org Schema is always present – allows dynamic typing and smaller sizes Java, C, C++, C#, Python, Perl, Ruby, PHP Different languages can transparently talk to each other, and make RPC calls to each other Often faster than Thrift and ProtoBuf No streaming support though
  • 35. Thrift – thrift.apache.org Java, C++, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, JS, Cocoa, OCaml and more! From Facebook, at Apache since 2008 Rich data structure, compiled down into suitable code RPC support too Streaming is available Worth reading the White Paper!
  • 36. HCatalog (Just joined Hive) Provides a table like structure on top of HDFS files, with friendly addressing Allows Pig, Hadoop MR jobs etc to easily read/write data structured data Simpler, lighter weight than Avro or Thrift based serialisation Based on Hive's metastore format Doesn't require an additional datastore
  • 37. MRUnit – mrunit.apache.org Built on top of JUnit Checks Map, Reduce, then combined Provides test drivers for Hadoop Avoids you needing lots of boiler plate code to start/stop Hadoop Avoids brittle mock objects Handles multiple input K/Vs Counter checking
  • 38. MRUnit IdentityMapper – same input/output public class TestExample extends TestCase { private Mapper mapper; private MapDriver driver; @Before public void setUp() { mapper = new IdentityMapper(); driver = new MapDriver(mapper); } @Test public void testIdentityMapper() { // Pass in { “foo”, “bar” }, ensure it comes back again driver.withInput(new Text("foo"), new Text("bar")) .withOutput(new Text("foo"), new Text("bar")) .runTest(); assertEquals(1, driver.getCounters().findCounter(“foo”,”bar”)); } }
  • 39. Hadoop Development Tools (Incbtr) Set of Eclipse plugins for Hadoop Work with multiple versions of Hadoop from within the same Eclipse Quite a new project, lot still “TODO” Makes it easy to build Hadoop programs Query the cluster Submit jobs, then track them Remote debug jobs
  • 40. Oozie – oozie.apache.org Workflow, scheduler and dependency manager for Hadoop jobs (inc Pig etc) Define a workflow to describe the data flow for your desired output Oozie handles running dependencies as needed, and scheduled execution of steps as requested Builds up a data pipe, then executes as required on a cloud scale
  • 41. BigTop – bigtop.apache.org Build, Package and Test code built on top of Hadoop related projects Allows you to check that a given mix of say HDFS, Hadoop Core and ZooKeeper work well together Then package a tested bundle Integration testing of your stack Test upgrades, generate packages
  • 42. Ambari (Incubating) Monitoring, Admin and LifeCycle management for Hadoop clusters eg HBase, HDFS, Hive, Pig, ZooKeeper Deploy+Configure stack to a cluster of machines Update software stack versions Monitoring and Service Admin REST APIs for cluster management
  • 44. Provider Independent Cloud APIs Lets you provision, manage and query Cloud services, without vendor lock-in Translates general calls to the specific (often proprietary) ones for a given cloud provider Work with remote and local cloud providers (almost) transparently
  • 45. Provider Independent Cloud APIs Create, stop, start, reboot and destroy instances Control what's run on new instances List active instances Fetch available and active profiles EC2, Eycalyptos, Rackspace, RHEV, vSphere, Linode, OpenStack
  • 46. LibCloud – libcloud.apache.org Python library (limited Java support) Very wide range of providers Script your cloud services from libcloud.compute.types import Provider from libcloud.compute.providers import get_driver EC2_ACCESS_ID = 'your access id' EC2_SECRET_KEY = 'your secret key' Driver = get_driver(Provider.EC2) conn = Driver(EC2_ACCESS_ID, EC2_SECRET_KEY) nodes = conn.list_nodes() # [<Node: uuid=..., state=3, public_ip=['1.1.1.1'], provider=EC2 ...>, ...]
  • 47. DeltaCloud – deltacloud.apache.org REST API (xml) + web portal Major Cloud Providers, RHEV-M, vSphere <instances> <instance href="http://fancycloudprovider.com/api/instances/inst1" id='inst1'> <owner_id>larry</owner_id> <name>Production JBoss Instance</name> <image href="http://fancycloudprovider.com/api/images/img3"/> <hardware_profile href="http://fancycloudprovider.com/api/hardware_profiles/m1-small"/> <realm href="http://fancycloudprovider.com/api/realms/us"/> <state>RUNNING</state> <actions> <link rel="reboot" href="http://fancycloudprovider.com/api/instances/inst1/reboot"/> <link rel="stop" href="http://fancycloudprovider.com/api/instances/inst1/stop"/> </actions> <public_addresses> <address>inst1.larry.fancycloudprovider.com</address> </public_addresses> <private_addresses> <address>inst1.larry.internal</address> </private_addresses> </instance> </instances>
  • 48. Provisionr (Incubating) Higher Level than DeltaCloud or LibCloud Semi-Automated workflows to provision “Build me a cluster on these assumptions” Cloud provider independent Builds on Puppet, lets you work higher level Handles retries, throttling etc Can work on multiple clouds
  • 49. CloudStack – cloudstack.apache.org Software to run an IaaS cloud computing platform (on demand, elastic etc) Can run public, private or hybrid clouds Entire stack to run an IaaS Hypervisors: VMware, KVM, XenServer, XCP Web interface, CLI or RESTful to manage API compatible with EC2 for hybrid
  • 51. ZooKeeper – zookeeper.apache.org Centralised service for configuration, naming and synchronisation Provides Consensus, Group Mangement and Presence tracking Single co-ordination service across all the different components ZooKeeper is distributed and highly reliable (avoid config being SPOF)
  • 52. ZooKeeper – zookeeper.apache.org “A central nervous system for distributed applications and services” Bindings for Java, C, Perl, Python, Scala, .Net (C#), Node.js, Erlang Applications can read nodes, send events, and watch for them eg fetch config, come up, perform leader election, share active list
  • 53. Curator (Incubating) “A Zookeeper Keeper” Provides implementations of common things you want to do using Zookeeper Improved API on top of Zookeeper, eg with retries, error cases, cleaner API Test support – in memory ZK server, cluster Provides SOA service discovery built using Zookeeper
  • 54. Airavata – airavata.apache.org Toolkit to build, manage, execute and monitor large scale applications Workflow driven processing Historically aimed at Scientific Processing, but now expanding Facilities for developer to deploy End user launches, workflow manages, then able to monitor eg via widgets
  • 55. Whirr - whirr.apache.org Grew out of Hadoop Aimed at running Hadoop, Cassandra, HBase and ZooKeeper Higher level – running services not running machines Java (jclouds) and Python versions Can be run on the command line
  • 56. ACE – ace.apache.org Software Distribution Framework Built on OSGi, but can work for anything “Puppet for OSGi” Tracks and installs components and deps Group components, set targets to deploy to Can do QA, staging, production flow Web based management
  • 57. River – river.apache.org Toolkit for building distributed systems Communicate over JERI, which works over TCP, SSL, HTTP, HTTPS, Kerberos, JRMP Can do dynamic discovery Uses JINI for naming/addressing Not as fully featured as ZooKeeper Better for JINI types?
  • 58. Helix (Incubating) Cluster Management Framework State Machine based, describe in terms of transitions and states For partitioned, replicated and distributed resources across clusters of nodes Sets up the nodes in the cluster System tasks, eg maintenance, backups “Global Brain for the system”
  • 59. Knox (Incubating) Secure Gateway to a Hadoop Cluster Perimeter Security & Authentication Single endpoint for users to connect to Knox Gateway sits in DMZ, provides the choke-point for cluster access Hadoop sites behind, protected Allows users access Allows admin access
  • 60. Mesos (Incubating) Like virtualisation on top of your cluster Resource Sharing and Isolation Hadoop, MPI, Hypertable, Spark Can run multiple copies of each, eg production, ad-hoc and testing versions of Hadoop Can manage long-running (eg HBase) at the same time as batch Zookeeper based
  • 61. Building out your Solution
  • 62. Tika – tika.apache.org Text and Metadata extraction Identify file type, language, encoding Extracts text as structured XHTML Consistent Metadata across formats Java library, CLI and Network Server SOLR integration Handles format differences for you
  • 63. UIMA – uima.apache.org Unstructured Information analysis Lets you build a tool to extract information from unstructured data Components in C++ and Java Network enabled – can spread work out across a cluster Helped IBM to win Jeopardy!
  • 64. OpenNLP – opennlp.apache.org Natural Language Processing Various tools for sentence detection, tokenization, tagging, chunking, entity detection etc UIMA likely to be better if you want a whole- solution OpenNLP good when integrating NLP into your own solution
  • 65. cTakes – ctakes.apache.org Clinical Text Analysis & Knowledge Extraction Builds on top of UIMA and OpenNLP Extracts information from free text in electronic medical records (and OCR'd paper) Identifies named entities from common or custom dictionaries (eg UMLS) For each, produce attributes, eg subject, mapping, context
  • 66. MINA – mina.apache.org Framework for writing scalable, high performance network apps in Java TCP and UDP, Client and Server Build non blocking, event driven networking code in Java MINA also provides pure Java SSH, XMPP, Web and FTP servers
  • 67. Etch – etch.apache.org Framework for building, producing and consuming network services. Cross platform, language and transport Java, C#, C, C++, Go, JS, Python Produce a formal description of the exchange 1-way, 2-way and realtime communication Scalable, high performance Heterogeneous systems
  • 68. ActiveMQ / Qpid / Synapse Messaging, Queueing and Brokerage solutions across most languages Decide on your chosen message format, endpoint languages and messaging needs, one of these three will likely fit your needs! Queues, Message Brokers, Enterprise Service Buses, high performance and yet also buzzword compliant!
  • 69. Kafka (Incubating) Distributed push-subscribe (pub-sub) messaging system Allows high throughput on one system, partitionable for many More general than log based systems such as Flume Supports persisted messages, clients can catch up Distributed, remote sources and sinks
  • 70. S4 (Incubating) Simple Scalable Streaming System Platform for building tools to work on continuous, unbounded data streams Stream is broken into events, which are routed to PEs and processed Uses Actors model, highly concurrent PEs are Java based, but event sources and sinks can be in any language
  • 71. Commons – commons.apache.org Collection of libraries for Java projects Some historic, many still useful! Commons CLI – parameters / options Commons Codec – Base64, Hex, Phonetic Commons Compress – zip, tar, gz, bz2 Commons Daemon – OS friendly start/stop Commons Pool – Object pools (db, conn etc)
  • 72. JMeter – jmeter.apache.org Loading testing tool Performance test network services Defined a series of tasks, execute in parallel Talks to Web, SOAP, LDAP, JSM, JDBC etc Handy for checking how external resources and systems will hold up, once a big data system start to make heavy use of them!
  • 73. Chemistry – chemistry.apache.org Java, Python, .Net, PHP, JS, Android, iOS interface to Content Management Systems Implements the OASIS CMIS spec Browse, read and write data in your content repositories Rich information and structure Supported by Alfresco, Microsoft, SAP, Adobe, EMC, OpenText and more
  • 74. ManifoldCF – manifoldcf.apache.org Framework for content (mostly text) extraction from content repositories Aimed at indexing solutions, eg SOLR Connectors for reading and writing Simpler than Chemistry, but also works for CIFS, file systems, RSS etc Extract from SharePoint, FileNet, Documentum, LiveLink etc
  • 76. Thanks! Twitter - @Gagravarr Email – nick.burch@quanticate.com The Apache Software Foundation: http://www.apache.org/ Apache projects list: http://projects.apache.org/