SlideShare a Scribd company logo
1 of 28
Using Big Data techniques
with Open Street Map
Stephen Knox
Arup
Partly based on research for an MSc in Geographical Information Systems and Science
Kingston University 2015
Disclaimer
• I am in no way an expert on Hadoop!
• I am a Geographic Information Systems specialist who can program
(and is interested in big data)
• Hopefully I can tell you something you didn’t know about
OpenStreetMap and geographic big data processing
Outline
• Background to OpenStreetMap (OSM) and growth
• Background to Geographic Big Data
• Dissertation Research
• Aims & Objectives
• Methodology
• Results
• Conclusions
• My general experiences of using Hadoop/SpatialHadoop and related
tools
2006 2016
INPUT
STORAGE
GRAPHICAL
OUTPUT
(MAPS)
DATA
OUTPUT
OSM Size and Growth
• Current Data – c. 0.5 – 1 TB
• Current and Historical Data – 5.15TB
• Growing at 1TB per annum
0
5
10
15
20
25
30
35
40
45
50
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
DB dump Size (XML BZ2)
2 processor cores
8GB RAM
6TB disk
4 processor cores
64GB RAM
6TB disk
64GB SSD
8 processor cores
256GB RAM
24TB disk
400GB SSD
OSM DB server
?
Source: Planet OSM http://planet.openstreetmap.org
Source: OSM http://wiki.openstreetmap.org/wiki/Servers
Source: OSM http://munin.openstreetmap.org/openstreetmap/katla.openstreetmap/postgres_size_openstreetmap_9_1_main.html
OSM Potential Growth (1)
Population
Africa Antarctica Asia
Australia Central America Europe
N. America S. America
Land Area
Africa Antarctica Asia
Australia Central America Europe
N. America S. America
Data in OSM
Africa Antarctica Asia
Australia Central America Europe
N. America S. America
+38%
+29%
+22%
+27%
+16%
+10%
+21%
Source: Geofabrik individual region download pages for OSM size and growth (http://download.geofabrik.de/index.html) , ArcGIS Continents
(http://www.arcgis.com/home/item.html?id=3c4741e22e2e4af2bd4050511b9fc6ad) and UN Department of Economic & Social Affairs Total
Population – Both Sexes (http://esa.un.org/unpd/wpp/Excel-
Data/EXCEL_FILES/1_Population/WPP2012_POP_F01_1_TOTAL_POPULATION_BOTH_SEXES.XLS)
Scaling systems
Scale-up Scale-out(parallel) Scale-out(NoSql)
• More
memory
• More cores
• More SSD
• More hard
disk
Controlling
Server
$$$
$$$
$$
$
$
$$
$$
$$$
Hardware costs
Software
acquisition &
development costs
Maintenance costs
Training costs
$$
$$$
$$$
$
Sources: Scale-up vs Scale-out for Hadoop: Time to rethink?
http://www.msr-waypoint.com/pubs/204499/a20-appuswamy.pdf
Scaling Up vs. scaling Out: Hidden Costs:
http://blog.codinghorror.com/scaling-up-vs-scaling-out-hidden-costs/
It’s getting complicated …. !
Source: The 451 Group https://blogs.the451group.com/information_management/2011/04/15/nosql-newsql-and-beyond/
What is the right tool for the job?
1MB 1GB 1TB 1PB 1EB
?
Transaction
Logs
Tool
Application /
Data
NoSQL Spatial
• Key research topic is indexing across multiple nodes
Source: Geowave Docs http://ngageoint.github.io/geowave/documentation.html#theory
• Implementations that add spatial capabilities to NoSQL databases
• SpatialHadoop, Hadoop GIS, ESRI tools for Hadoop
• SpatialSpark, GeoTrellis
• Geomesa, Geowave
• MongoDB (extension)
• Geocouch
Dissertation - Aims
• Investigate whether a parallel non-relational solution could be used
to:
• Analyse data from OSM (read-only)?
• Become the main storage platform (reads & writes)?
In terms of performance, and practicality (whole life cost)
• Does the size and growth rate of OSM make it likely that a non-
relational parallel storage solution will become technically or
economically desirable in the future?
Dissertation - Methodology
• Compare common current OSM tasks to an equivalent task using Big
Data tools
• Chose technologies in the Hadoop ecosystem rather than parallel
databases. Used SpatialHadoop and Hbase as principal platforms
• Started using a test Hadoop cluster @ work, but ran into issues, so
used cloud platforms
• Keep processing power and cost constant, so performance could be
directly compared
1 16 core server
64GB RAM
8 2-core servers
8GB RAM each
Master node
Broadly equivalent in cost and equivalent in nominal performance
SpatialHadoop
• University of Minnesota Open Source project
• Uses pig as an execution engine
• Creates spatial indexes and operators for big geographic datasets
Methodology (continued)
• 3 stages:
• Data loading & preparation for data analysis
• Test whether a data reader to read the OSM binary format was quicker than using the
XML format
• Data querying (read / analyse data)
• Spatial – give me the total features in this area [using spatial index]
• Non-spatial (e.g. count the total number of shops in the osm database)
• Simulation of master database (reads and writes)
• downloading existing data to work on (by bounding box)
• uploading new data changes
Uncompressed XML Compressed XML PBF
UK OSM data 17GB 1.2GB 765MB
Results – Loading Data
File & size Cluster Time
UK PBF* (765MB) 4 high memory
nodes
37m
UK XML (17GB) 4 high memory
nodes
75.5m
UK XML BZ2+
(1.2GB)
4 high memory
nodes
66m
Europe PBF (15.7GB) 8 high memory
nodes
246m
Europe XML (345GB) Not undertaken –
too big to process
Europe XML BZ2
(24GB)
8 high memory
nodes
Did not
complete
Europe PBF (15.7GB) 16 high memory
nodes
143m
Europe XML Not undertaken –
too big to process
Europe XML BZ2
(24GB)
16 high memory
nodes
Did not
complete
* Protocol Buffer Format – binary format
+ without taking into account decompression time – c. 7 minutes
File & size Cluster Time
UK XML BZ2 (1.2GB) 1 x 8 core machine
(52GB RAM)
17m
Europe XML BZ2
(24GB)
1 x 16 core machine
(104GB RAM)
578m
OverpassHadoop
Results – Querying Data
Index type Time Taken
Grid 75m
R-tree 81m
Quad-tree 56m
Operation Cluster config Cluster Time Standalone config Standalone time
Europe data small
bounding box
8 x 2-core high memory
nodes (13GB RAM)
Grid: 50s
R-tree: 25s
Q-tree: 6s
1 x 16 core machine
(104GB RAM)
<1s
Europe data medium
bounding box
8 x 2-core high memory
nodes (13GB RAM)
Grid: 85s
R-tree: 141s
Q-tree:12s
1 x 16 core machine
(104GB RAM)
4s
Europe data large
bounding box (1°2)
8 x 2 core high memory
nodes (13GB RAM)
Grid: 91m
R-tree: 83s
Q-tree: 56s
1 x 16 core machine
(104GB RAM)
39s
Europe data huge
bounding box (3°2)
8 x 2 core high memory
nodes (13GB RAM)
Only attempted with
Q-tree: 88s
1 x 16 core machine
(104GB RAM)
Out of memory
Shops query 8 x 2 core high memory
nodes (13GB RAM)
729s 1 x 16 core machine
(104GB RAM)
349s (but also got
out of mem
errors)
Shops query after
indexing
8 x 2 core high memory
nodes (13GB RAM)
40s BUT… indexing took 714 seconds!
Results – Reading & Writing Data
• Used Hbase and Jython, but did not have time to implement spatial
indexes
Operation Cluster configuration Cluster Time Standalone
configuration
Standalone time
Data loading England PBF
(610MB)
8 x 2-core high memory
nodes (13GB ram each)
30m 1 x 16 core machine
(104GB RAM)
527m
Data retrieval (small town) 8 x 2-core high memory
nodes (13GB ram each)
1 x 16 core machine
(104GB RAM)
3s
Data retrieval (large town) 8 x 2-core high memory
nodes (13GB ram each)
1 x 16 core machine
(104GB RAM)
113s
Data retrieval (city) 8 x 2-core high memory
nodes (13GB ram each)
1 x 16 core machine
(104GB RAM)
Did not complete (> 300s
and 50,000 nodes)
Conclusions
• It’s possible to replicate much of what OSM requires in Hadoop
• Open Street Map is growing quickly, but it is a long way from requiring horizontal
sharing of databases
• In general, it is not quicker to run geographic queries in a cluster at the TB order
of magnitude (at least with current OSM tools)
• Indexes do significantly speed up geographic queries (Quad-tree seems to be the
best)
• There is a high barrier of entry (technical & cost) for Hadoop and ecosystem that
will make it difficult for OSM to adopt the technology
• OSM should also consider parallel databases if they do have a requirement to
scale-out as there is less mismatch between their current system
• Spatial extensions to big data platforms are relatively immature, but there is a
huge potential there to do data analytics on massive datasets and gain new
insights
• I’ve learnt a lot personally!
Experiences with Azure
+Easy to use – click to deploy
+Good free trial program
+Good integration with storage
- Less customisable
- It was impossible to deploy >= 8
node clusters (rate limits?) so I
gave up
- Technical support was
responsive but not especially
helpful
Experiences with Google Cloud
+Already had Hortonworks
Hadoop distribution automated
setup
+Easy to customise – everything
on GitHub.
+Uses a standard setup (Ambari)
- Not always reliable
- Free trial was quite limited
- More difficult to connect with
Google Storage buckets
- Bit more work to deploy solution
as code-based and have to
download 3rd party tool (gcloud)
General Hadoop experiences
• Choosing the correct tool can be a significant part of the problem
• Setting up Hadoop clusters is hard!
• Spatial Big Data is still a little niche (although I did get lots of help)
• Running Hadoop jobs (even with Pig) is hard!
• Trial and error to experiment with memory requirements
• Size of files is a real barrier (especially when you are paying!)
• Often jobs failed half way through
• Debugging is not easy
• Have to recompile Java whenever there is a change (and sometimes deploy to
nodes)

More Related Content

What's hot

Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSBouquet
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLliuknag
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudZhenxiao Luo
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks
 
Spark, spark streaming & tachyon
Spark, spark streaming & tachyonSpark, spark streaming & tachyon
Spark, spark streaming & tachyonJohan hong
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Databricks
 
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Spark Summit
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Joydeep Sen Sarma
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentationargonauts007
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit
 
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...Databricks
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupPhaneendra Chiruvella
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduJen Aman
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5SAP Concur
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack
 

What's hot (20)

October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQL
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
 
Spark, spark streaming & tachyon
Spark, spark streaming & tachyonSpark, spark streaming & tachyon
Spark, spark streaming & tachyon
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
 
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
 
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark Group
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
 

Viewers also liked

The NoSQL Geospatial Landscape
The NoSQL Geospatial LandscapeThe NoSQL Geospatial Landscape
The NoSQL Geospatial LandscapeRaj Singh
 
GeoMesa LocationTech DC
GeoMesa LocationTech DCGeoMesa LocationTech DC
GeoMesa LocationTech DCCCRinc
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech ProjectsJody Garnett
 
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifactahuguk
 
Geospatial Temporal Open Standards for Big Data from Space (BiDS2014)
Geospatial Temporal Open Standards for Big Data from Space (BiDS2014)Geospatial Temporal Open Standards for Big Data from Space (BiDS2014)
Geospatial Temporal Open Standards for Big Data from Space (BiDS2014)George Percivall
 
C+P-and-partners_Working process
C+P-and-partners_Working processC+P-and-partners_Working process
C+P-and-partners_Working processQusi Alqarqaz
 
Supporting Geo-Ontology Engineering through Spatial Data Analytics
Supporting Geo-Ontology Engineering through Spatial Data AnalyticsSupporting Geo-Ontology Engineering through Spatial Data Analytics
Supporting Geo-Ontology Engineering through Spatial Data AnalyticsIrene Celino
 
BlueBRIDGE: Cloud infrastructure serving aquafarms and supporting models
BlueBRIDGE: Cloud infrastructure serving aquafarms and supporting modelsBlueBRIDGE: Cloud infrastructure serving aquafarms and supporting models
BlueBRIDGE: Cloud infrastructure serving aquafarms and supporting modelsBlue BRIDGE
 
Optimizing location-based apps with open data
Optimizing location-based apps with open dataOptimizing location-based apps with open data
Optimizing location-based apps with open dataRaj Singh
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...huguk
 
Geo-analytics Architecture - Technologies
Geo-analytics Architecture - TechnologiesGeo-analytics Architecture - Technologies
Geo-analytics Architecture - TechnologiesBlue BRIDGE
 
Automation of Glass fiber Deployments in The Netherlands
Automation of Glass fiber Deployments in The NetherlandsAutomation of Glass fiber Deployments in The Netherlands
Automation of Glass fiber Deployments in The NetherlandsKiran Solipuram. DEP, CFHP
 
Algorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisAlgorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisAtner Yegorov
 
Foundation Comparison
Foundation ComparisonFoundation Comparison
Foundation ComparisonJody Garnett
 
GeoMesa – Spatio-Temporal Indexing in Accumulo
GeoMesa – Spatio-Temporal Indexing in AccumuloGeoMesa – Spatio-Temporal Indexing in Accumulo
GeoMesa – Spatio-Temporal Indexing in AccumuloCvilleDataScience
 
Processing Geospatial Data At Scale @locationtech
Processing Geospatial Data At Scale @locationtechProcessing Geospatial Data At Scale @locationtech
Processing Geospatial Data At Scale @locationtechRob Emanuele
 
Using entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion APIUsing entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion APISpazioDati
 
Processing Geospatial at Scale at LocationTech
Processing Geospatial at Scale at LocationTechProcessing Geospatial at Scale at LocationTech
Processing Geospatial at Scale at LocationTechRob Emanuele
 
From Spark to Ignition: Fueling Your Business on Real-Time Analytics
From Spark to Ignition: Fueling Your Business on Real-Time AnalyticsFrom Spark to Ignition: Fueling Your Business on Real-Time Analytics
From Spark to Ignition: Fueling Your Business on Real-Time AnalyticsSingleStore
 
Big Geo Data: Open Source and Open Standards
Big Geo Data: Open Source and Open StandardsBig Geo Data: Open Source and Open Standards
Big Geo Data: Open Source and Open StandardsGeorge Percivall
 

Viewers also liked (20)

The NoSQL Geospatial Landscape
The NoSQL Geospatial LandscapeThe NoSQL Geospatial Landscape
The NoSQL Geospatial Landscape
 
GeoMesa LocationTech DC
GeoMesa LocationTech DCGeoMesa LocationTech DC
GeoMesa LocationTech DC
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech Projects
 
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
 
Geospatial Temporal Open Standards for Big Data from Space (BiDS2014)
Geospatial Temporal Open Standards for Big Data from Space (BiDS2014)Geospatial Temporal Open Standards for Big Data from Space (BiDS2014)
Geospatial Temporal Open Standards for Big Data from Space (BiDS2014)
 
C+P-and-partners_Working process
C+P-and-partners_Working processC+P-and-partners_Working process
C+P-and-partners_Working process
 
Supporting Geo-Ontology Engineering through Spatial Data Analytics
Supporting Geo-Ontology Engineering through Spatial Data AnalyticsSupporting Geo-Ontology Engineering through Spatial Data Analytics
Supporting Geo-Ontology Engineering through Spatial Data Analytics
 
BlueBRIDGE: Cloud infrastructure serving aquafarms and supporting models
BlueBRIDGE: Cloud infrastructure serving aquafarms and supporting modelsBlueBRIDGE: Cloud infrastructure serving aquafarms and supporting models
BlueBRIDGE: Cloud infrastructure serving aquafarms and supporting models
 
Optimizing location-based apps with open data
Optimizing location-based apps with open dataOptimizing location-based apps with open data
Optimizing location-based apps with open data
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
 
Geo-analytics Architecture - Technologies
Geo-analytics Architecture - TechnologiesGeo-analytics Architecture - Technologies
Geo-analytics Architecture - Technologies
 
Automation of Glass fiber Deployments in The Netherlands
Automation of Glass fiber Deployments in The NetherlandsAutomation of Glass fiber Deployments in The Netherlands
Automation of Glass fiber Deployments in The Netherlands
 
Algorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisAlgorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysis
 
Foundation Comparison
Foundation ComparisonFoundation Comparison
Foundation Comparison
 
GeoMesa – Spatio-Temporal Indexing in Accumulo
GeoMesa – Spatio-Temporal Indexing in AccumuloGeoMesa – Spatio-Temporal Indexing in Accumulo
GeoMesa – Spatio-Temporal Indexing in Accumulo
 
Processing Geospatial Data At Scale @locationtech
Processing Geospatial Data At Scale @locationtechProcessing Geospatial Data At Scale @locationtech
Processing Geospatial Data At Scale @locationtech
 
Using entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion APIUsing entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion API
 
Processing Geospatial at Scale at LocationTech
Processing Geospatial at Scale at LocationTechProcessing Geospatial at Scale at LocationTech
Processing Geospatial at Scale at LocationTech
 
From Spark to Ignition: Fueling Your Business on Real-Time Analytics
From Spark to Ignition: Fueling Your Business on Real-Time AnalyticsFrom Spark to Ignition: Fueling Your Business on Real-Time Analytics
From Spark to Ignition: Fueling Your Business on Real-Time Analytics
 
Big Geo Data: Open Source and Open Standards
Big Geo Data: Open Source and Open StandardsBig Geo Data: Open Source and Open Standards
Big Geo Data: Open Source and Open Standards
 

Similar to Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup

The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudNicolas Poggi
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQueryCsaba Toth
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisMike Pittaro
 
Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis PyData
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerIke Ellis
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 

Similar to Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup (20)

The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 

More from huguk

ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introhuguk
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoophuguk
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...huguk
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watsonhuguk
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink huguk
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitchinghuguk
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoringhuguk
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startuphuguk
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapulthuguk
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysishuguk
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analyticshuguk
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Socialhuguk
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligencehuguk
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive huguk
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...huguk
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthyhuguk
 
Fast real-time approximations using Spark streaming
Fast real-time approximations using Spark streamingFast real-time approximations using Spark streaming
Fast real-time approximations using Spark streaminghuguk
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
 
Kevin O'Dell - Fraud and event detection using the Enterprise Data Hub
Kevin O'Dell - Fraud and event detection using the Enterprise Data HubKevin O'Dell - Fraud and event detection using the Enterprise Data Hub
Kevin O'Dell - Fraud and event detection using the Enterprise Data Hubhuguk
 

More from huguk (20)

ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp intro
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitching
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoring
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startup
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapult
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysis
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analytics
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Social
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligence
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
 
Fast real-time approximations using Spark streaming
Fast real-time approximations using Spark streamingFast real-time approximations using Spark streaming
Fast real-time approximations using Spark streaming
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
Kevin O'Dell - Fraud and event detection using the Enterprise Data Hub
Kevin O'Dell - Fraud and event detection using the Enterprise Data HubKevin O'Dell - Fraud and event detection using the Enterprise Data Hub
Kevin O'Dell - Fraud and event detection using the Enterprise Data Hub
 

Recently uploaded

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Recently uploaded (20)

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup

  • 1. Using Big Data techniques with Open Street Map Stephen Knox Arup Partly based on research for an MSc in Geographical Information Systems and Science Kingston University 2015
  • 2. Disclaimer • I am in no way an expert on Hadoop! • I am a Geographic Information Systems specialist who can program (and is interested in big data) • Hopefully I can tell you something you didn’t know about OpenStreetMap and geographic big data processing
  • 3. Outline • Background to OpenStreetMap (OSM) and growth • Background to Geographic Big Data • Dissertation Research • Aims & Objectives • Methodology • Results • Conclusions • My general experiences of using Hadoop/SpatialHadoop and related tools
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 11. OSM Size and Growth • Current Data – c. 0.5 – 1 TB • Current and Historical Data – 5.15TB • Growing at 1TB per annum 0 5 10 15 20 25 30 35 40 45 50 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 DB dump Size (XML BZ2) 2 processor cores 8GB RAM 6TB disk 4 processor cores 64GB RAM 6TB disk 64GB SSD 8 processor cores 256GB RAM 24TB disk 400GB SSD OSM DB server ? Source: Planet OSM http://planet.openstreetmap.org Source: OSM http://wiki.openstreetmap.org/wiki/Servers Source: OSM http://munin.openstreetmap.org/openstreetmap/katla.openstreetmap/postgres_size_openstreetmap_9_1_main.html
  • 12. OSM Potential Growth (1) Population Africa Antarctica Asia Australia Central America Europe N. America S. America Land Area Africa Antarctica Asia Australia Central America Europe N. America S. America Data in OSM Africa Antarctica Asia Australia Central America Europe N. America S. America +38% +29% +22% +27% +16% +10% +21% Source: Geofabrik individual region download pages for OSM size and growth (http://download.geofabrik.de/index.html) , ArcGIS Continents (http://www.arcgis.com/home/item.html?id=3c4741e22e2e4af2bd4050511b9fc6ad) and UN Department of Economic & Social Affairs Total Population – Both Sexes (http://esa.un.org/unpd/wpp/Excel- Data/EXCEL_FILES/1_Population/WPP2012_POP_F01_1_TOTAL_POPULATION_BOTH_SEXES.XLS)
  • 13. Scaling systems Scale-up Scale-out(parallel) Scale-out(NoSql) • More memory • More cores • More SSD • More hard disk Controlling Server $$$ $$$ $$ $ $ $$ $$ $$$ Hardware costs Software acquisition & development costs Maintenance costs Training costs $$ $$$ $$$ $ Sources: Scale-up vs Scale-out for Hadoop: Time to rethink? http://www.msr-waypoint.com/pubs/204499/a20-appuswamy.pdf Scaling Up vs. scaling Out: Hidden Costs: http://blog.codinghorror.com/scaling-up-vs-scaling-out-hidden-costs/
  • 14. It’s getting complicated …. ! Source: The 451 Group https://blogs.the451group.com/information_management/2011/04/15/nosql-newsql-and-beyond/
  • 15. What is the right tool for the job? 1MB 1GB 1TB 1PB 1EB ? Transaction Logs Tool Application / Data
  • 16. NoSQL Spatial • Key research topic is indexing across multiple nodes Source: Geowave Docs http://ngageoint.github.io/geowave/documentation.html#theory • Implementations that add spatial capabilities to NoSQL databases • SpatialHadoop, Hadoop GIS, ESRI tools for Hadoop • SpatialSpark, GeoTrellis • Geomesa, Geowave • MongoDB (extension) • Geocouch
  • 17. Dissertation - Aims • Investigate whether a parallel non-relational solution could be used to: • Analyse data from OSM (read-only)? • Become the main storage platform (reads & writes)? In terms of performance, and practicality (whole life cost) • Does the size and growth rate of OSM make it likely that a non- relational parallel storage solution will become technically or economically desirable in the future?
  • 18. Dissertation - Methodology • Compare common current OSM tasks to an equivalent task using Big Data tools • Chose technologies in the Hadoop ecosystem rather than parallel databases. Used SpatialHadoop and Hbase as principal platforms • Started using a test Hadoop cluster @ work, but ran into issues, so used cloud platforms • Keep processing power and cost constant, so performance could be directly compared 1 16 core server 64GB RAM 8 2-core servers 8GB RAM each Master node Broadly equivalent in cost and equivalent in nominal performance
  • 19. SpatialHadoop • University of Minnesota Open Source project • Uses pig as an execution engine • Creates spatial indexes and operators for big geographic datasets
  • 20. Methodology (continued) • 3 stages: • Data loading & preparation for data analysis • Test whether a data reader to read the OSM binary format was quicker than using the XML format • Data querying (read / analyse data) • Spatial – give me the total features in this area [using spatial index] • Non-spatial (e.g. count the total number of shops in the osm database) • Simulation of master database (reads and writes) • downloading existing data to work on (by bounding box) • uploading new data changes
  • 21. Uncompressed XML Compressed XML PBF UK OSM data 17GB 1.2GB 765MB
  • 22. Results – Loading Data File & size Cluster Time UK PBF* (765MB) 4 high memory nodes 37m UK XML (17GB) 4 high memory nodes 75.5m UK XML BZ2+ (1.2GB) 4 high memory nodes 66m Europe PBF (15.7GB) 8 high memory nodes 246m Europe XML (345GB) Not undertaken – too big to process Europe XML BZ2 (24GB) 8 high memory nodes Did not complete Europe PBF (15.7GB) 16 high memory nodes 143m Europe XML Not undertaken – too big to process Europe XML BZ2 (24GB) 16 high memory nodes Did not complete * Protocol Buffer Format – binary format + without taking into account decompression time – c. 7 minutes File & size Cluster Time UK XML BZ2 (1.2GB) 1 x 8 core machine (52GB RAM) 17m Europe XML BZ2 (24GB) 1 x 16 core machine (104GB RAM) 578m OverpassHadoop
  • 23. Results – Querying Data Index type Time Taken Grid 75m R-tree 81m Quad-tree 56m Operation Cluster config Cluster Time Standalone config Standalone time Europe data small bounding box 8 x 2-core high memory nodes (13GB RAM) Grid: 50s R-tree: 25s Q-tree: 6s 1 x 16 core machine (104GB RAM) <1s Europe data medium bounding box 8 x 2-core high memory nodes (13GB RAM) Grid: 85s R-tree: 141s Q-tree:12s 1 x 16 core machine (104GB RAM) 4s Europe data large bounding box (1°2) 8 x 2 core high memory nodes (13GB RAM) Grid: 91m R-tree: 83s Q-tree: 56s 1 x 16 core machine (104GB RAM) 39s Europe data huge bounding box (3°2) 8 x 2 core high memory nodes (13GB RAM) Only attempted with Q-tree: 88s 1 x 16 core machine (104GB RAM) Out of memory Shops query 8 x 2 core high memory nodes (13GB RAM) 729s 1 x 16 core machine (104GB RAM) 349s (but also got out of mem errors) Shops query after indexing 8 x 2 core high memory nodes (13GB RAM) 40s BUT… indexing took 714 seconds!
  • 24. Results – Reading & Writing Data • Used Hbase and Jython, but did not have time to implement spatial indexes Operation Cluster configuration Cluster Time Standalone configuration Standalone time Data loading England PBF (610MB) 8 x 2-core high memory nodes (13GB ram each) 30m 1 x 16 core machine (104GB RAM) 527m Data retrieval (small town) 8 x 2-core high memory nodes (13GB ram each) 1 x 16 core machine (104GB RAM) 3s Data retrieval (large town) 8 x 2-core high memory nodes (13GB ram each) 1 x 16 core machine (104GB RAM) 113s Data retrieval (city) 8 x 2-core high memory nodes (13GB ram each) 1 x 16 core machine (104GB RAM) Did not complete (> 300s and 50,000 nodes)
  • 25. Conclusions • It’s possible to replicate much of what OSM requires in Hadoop • Open Street Map is growing quickly, but it is a long way from requiring horizontal sharing of databases • In general, it is not quicker to run geographic queries in a cluster at the TB order of magnitude (at least with current OSM tools) • Indexes do significantly speed up geographic queries (Quad-tree seems to be the best) • There is a high barrier of entry (technical & cost) for Hadoop and ecosystem that will make it difficult for OSM to adopt the technology • OSM should also consider parallel databases if they do have a requirement to scale-out as there is less mismatch between their current system • Spatial extensions to big data platforms are relatively immature, but there is a huge potential there to do data analytics on massive datasets and gain new insights • I’ve learnt a lot personally!
  • 26. Experiences with Azure +Easy to use – click to deploy +Good free trial program +Good integration with storage - Less customisable - It was impossible to deploy >= 8 node clusters (rate limits?) so I gave up - Technical support was responsive but not especially helpful
  • 27. Experiences with Google Cloud +Already had Hortonworks Hadoop distribution automated setup +Easy to customise – everything on GitHub. +Uses a standard setup (Ambari) - Not always reliable - Free trial was quite limited - More difficult to connect with Google Storage buckets - Bit more work to deploy solution as code-based and have to download 3rd party tool (gcloud)
  • 28. General Hadoop experiences • Choosing the correct tool can be a significant part of the problem • Setting up Hadoop clusters is hard! • Spatial Big Data is still a little niche (although I did get lots of help) • Running Hadoop jobs (even with Pig) is hard! • Trial and error to experiment with memory requirements • Size of files is a real barrier (especially when you are paying!) • Often jobs failed half way through • Debugging is not easy • Have to recompile Java whenever there is a change (and sometimes deploy to nodes)

Editor's Notes

  1. Work @ Arup
  2. What is OpenStreetMap Wikipedia of Maps Started by a Brit, Steve Coast, who was frustrated with the fact that most data held by government is not open Ordnance Survey is notoriously bad at being open with their data. So he started a project to get GPS data and trace over that. Now Bing donates their imagery for use in the project and there is a huge community. 2006 – area around university, Oxford Road, city Centre, motorways 2016 – all streets, railways, parks, many buildings, points of interest The cartography of the map and the data have massively improved
  3. Kibera, a slum in Nairobi Shout out for Missing Maps – have a monthly session in London where you can use your time to map genuinely unmapped areas of the world and help first responders and development organisations reach them more effectively
  4. Entire stack is open source Input Storage – critical bit of infrastructure PostgreSQL 9.1 Everything in or out
  5. OSM Historical – every version ever of everything in the database, including now deleted items Hardware growth
  6. Also things like 3d data
  7. What happens when my system needs more power: 3 ways Scale up Scale out (parallel) – traditional relational database systems Scale out (NoSql) – Hadoop Total Cost of Ownership approach
  8. Backlash against NoSQL NewSQL moment – in memory databases Fair to say that scale out in parallel is a neglected area Not generally open source but highly capable. Although Greenplum (parallel Postgres) has just been open sourced by Pivotal
  9. OSM is an interesting case Not a clear cut question Relational DBs taking on more NoSQL characteristics and vice-versa
  10. Geographic data is problematic for databases – it is minimum 2 dimensional – X and Y Imagine a list of names or numbers – it is easy to order them because the key is one-dimensional. So if I have a list between 1-1 billion I know that 1-10 million are on computer A, 10-20 million on B etc. and I can get 1-500,000 easily as they are on the same computer. But I can’t do that for 2-dimensional space, so need to convert it into 2d space for efficiency Space-filling curves try to map 2d onto 1d so that we don’t need to query every node in the cluster if we want to query a geographic area Z-order curve on left Hilbert Curve – on right Geohashing is a form of Z-order curve
  11. Mainly used Azure / Google Cloud because of the free tier! But AWS is probably cheaper! SpatialHadoop uses PIG! as an execution engine
  12. Simple things we do in GIS Tell me what is within a square box – (bounding box) Much easier to calculate than tell me if this tree is within Camden or Islington as they are irregular shapes
  13. XML is verbose – takes up a lot of space Alternative format called PBF which stores data in binary and makes optimisations to keep file size down There was a reader into Hadoop for OSM XML, but not for the binary format I created a reader to read in OSM binary data to Hadoop
  14. Reading OSM binary format is definitely more efficient and quicker than reading XML XML is a very cumbersome format – OSM Europe is 345GB of XML which is extremely difficult and time consuming to process – difficult and expensive to get disks this big to process this then read it into Hadoop (and to store in Hadoop) Overpass is a standalone solution for reading OSM data. It was quicker to load for a small area (UK) but slower to load than Hadoop for a larger area (Europe) Loading data in is one thing, but what about maintaining data? With Hadoop queries, currently have to r
  15. Cluster times cannot beat standalone times, even with indexing. However memory was becoming a serious issue when querying large bounding boxes
  16. Even for reading from the main osm database, indexes are crucial – a key operation is to get all the data for a bounding box area, which requires a spatial index to operate efficiently Should have used either GeoWave or Geomesa as these would have produced spatial indexes for Accumulo (and coming soon, Hbase)
  17. Could write my own scripts to install certain files on startup
  18. Still an emerging area In GIS we take for granted a set of standards in terms of file formats, format conversions, common spatial operations, indexes, storage backends. None of these have really been defined for spatial big data. Also the tools are currently not fully-optimised yet and could be improved by further development. Interesting to see how the big players such as ESRI handle this and how they cope with building on a totally non-proprietary technology. Massive scope not just in GIS, but in CAD, BIM, IoT etc.