This talk will describe his research into using Hadoop to query and manage big geographic datasets, specifically OpenStreetMap(OSM). OSM is an “open-source” map of the world, growing at a large rate, currently around 5TB of data. The talk will introduce OSM, detail some aspects of the research, but also discuss his experiences with using the SpatialHadoop stack on Azure and Google Cloud.
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup
1. Using Big Data techniques
with Open Street Map
Stephen Knox
Arup
Partly based on research for an MSc in Geographical Information Systems and Science
Kingston University 2015
2. Disclaimer
• I am in no way an expert on Hadoop!
• I am a Geographic Information Systems specialist who can program
(and is interested in big data)
• Hopefully I can tell you something you didn’t know about
OpenStreetMap and geographic big data processing
3. Outline
• Background to OpenStreetMap (OSM) and growth
• Background to Geographic Big Data
• Dissertation Research
• Aims & Objectives
• Methodology
• Results
• Conclusions
• My general experiences of using Hadoop/SpatialHadoop and related
tools
11. OSM Size and Growth
• Current Data – c. 0.5 – 1 TB
• Current and Historical Data – 5.15TB
• Growing at 1TB per annum
0
5
10
15
20
25
30
35
40
45
50
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
DB dump Size (XML BZ2)
2 processor cores
8GB RAM
6TB disk
4 processor cores
64GB RAM
6TB disk
64GB SSD
8 processor cores
256GB RAM
24TB disk
400GB SSD
OSM DB server
?
Source: Planet OSM http://planet.openstreetmap.org
Source: OSM http://wiki.openstreetmap.org/wiki/Servers
Source: OSM http://munin.openstreetmap.org/openstreetmap/katla.openstreetmap/postgres_size_openstreetmap_9_1_main.html
12. OSM Potential Growth (1)
Population
Africa Antarctica Asia
Australia Central America Europe
N. America S. America
Land Area
Africa Antarctica Asia
Australia Central America Europe
N. America S. America
Data in OSM
Africa Antarctica Asia
Australia Central America Europe
N. America S. America
+38%
+29%
+22%
+27%
+16%
+10%
+21%
Source: Geofabrik individual region download pages for OSM size and growth (http://download.geofabrik.de/index.html) , ArcGIS Continents
(http://www.arcgis.com/home/item.html?id=3c4741e22e2e4af2bd4050511b9fc6ad) and UN Department of Economic & Social Affairs Total
Population – Both Sexes (http://esa.un.org/unpd/wpp/Excel-
Data/EXCEL_FILES/1_Population/WPP2012_POP_F01_1_TOTAL_POPULATION_BOTH_SEXES.XLS)
13. Scaling systems
Scale-up Scale-out(parallel) Scale-out(NoSql)
• More
memory
• More cores
• More SSD
• More hard
disk
Controlling
Server
$$$
$$$
$$
$
$
$$
$$
$$$
Hardware costs
Software
acquisition &
development costs
Maintenance costs
Training costs
$$
$$$
$$$
$
Sources: Scale-up vs Scale-out for Hadoop: Time to rethink?
http://www.msr-waypoint.com/pubs/204499/a20-appuswamy.pdf
Scaling Up vs. scaling Out: Hidden Costs:
http://blog.codinghorror.com/scaling-up-vs-scaling-out-hidden-costs/
14. It’s getting complicated …. !
Source: The 451 Group https://blogs.the451group.com/information_management/2011/04/15/nosql-newsql-and-beyond/
15. What is the right tool for the job?
1MB 1GB 1TB 1PB 1EB
?
Transaction
Logs
Tool
Application /
Data
16. NoSQL Spatial
• Key research topic is indexing across multiple nodes
Source: Geowave Docs http://ngageoint.github.io/geowave/documentation.html#theory
• Implementations that add spatial capabilities to NoSQL databases
• SpatialHadoop, Hadoop GIS, ESRI tools for Hadoop
• SpatialSpark, GeoTrellis
• Geomesa, Geowave
• MongoDB (extension)
• Geocouch
17. Dissertation - Aims
• Investigate whether a parallel non-relational solution could be used
to:
• Analyse data from OSM (read-only)?
• Become the main storage platform (reads & writes)?
In terms of performance, and practicality (whole life cost)
• Does the size and growth rate of OSM make it likely that a non-
relational parallel storage solution will become technically or
economically desirable in the future?
18. Dissertation - Methodology
• Compare common current OSM tasks to an equivalent task using Big
Data tools
• Chose technologies in the Hadoop ecosystem rather than parallel
databases. Used SpatialHadoop and Hbase as principal platforms
• Started using a test Hadoop cluster @ work, but ran into issues, so
used cloud platforms
• Keep processing power and cost constant, so performance could be
directly compared
1 16 core server
64GB RAM
8 2-core servers
8GB RAM each
Master node
Broadly equivalent in cost and equivalent in nominal performance
19. SpatialHadoop
• University of Minnesota Open Source project
• Uses pig as an execution engine
• Creates spatial indexes and operators for big geographic datasets
20. Methodology (continued)
• 3 stages:
• Data loading & preparation for data analysis
• Test whether a data reader to read the OSM binary format was quicker than using the
XML format
• Data querying (read / analyse data)
• Spatial – give me the total features in this area [using spatial index]
• Non-spatial (e.g. count the total number of shops in the osm database)
• Simulation of master database (reads and writes)
• downloading existing data to work on (by bounding box)
• uploading new data changes
22. Results – Loading Data
File & size Cluster Time
UK PBF* (765MB) 4 high memory
nodes
37m
UK XML (17GB) 4 high memory
nodes
75.5m
UK XML BZ2+
(1.2GB)
4 high memory
nodes
66m
Europe PBF (15.7GB) 8 high memory
nodes
246m
Europe XML (345GB) Not undertaken –
too big to process
Europe XML BZ2
(24GB)
8 high memory
nodes
Did not
complete
Europe PBF (15.7GB) 16 high memory
nodes
143m
Europe XML Not undertaken –
too big to process
Europe XML BZ2
(24GB)
16 high memory
nodes
Did not
complete
* Protocol Buffer Format – binary format
+ without taking into account decompression time – c. 7 minutes
File & size Cluster Time
UK XML BZ2 (1.2GB) 1 x 8 core machine
(52GB RAM)
17m
Europe XML BZ2
(24GB)
1 x 16 core machine
(104GB RAM)
578m
OverpassHadoop
23. Results – Querying Data
Index type Time Taken
Grid 75m
R-tree 81m
Quad-tree 56m
Operation Cluster config Cluster Time Standalone config Standalone time
Europe data small
bounding box
8 x 2-core high memory
nodes (13GB RAM)
Grid: 50s
R-tree: 25s
Q-tree: 6s
1 x 16 core machine
(104GB RAM)
<1s
Europe data medium
bounding box
8 x 2-core high memory
nodes (13GB RAM)
Grid: 85s
R-tree: 141s
Q-tree:12s
1 x 16 core machine
(104GB RAM)
4s
Europe data large
bounding box (1°2)
8 x 2 core high memory
nodes (13GB RAM)
Grid: 91m
R-tree: 83s
Q-tree: 56s
1 x 16 core machine
(104GB RAM)
39s
Europe data huge
bounding box (3°2)
8 x 2 core high memory
nodes (13GB RAM)
Only attempted with
Q-tree: 88s
1 x 16 core machine
(104GB RAM)
Out of memory
Shops query 8 x 2 core high memory
nodes (13GB RAM)
729s 1 x 16 core machine
(104GB RAM)
349s (but also got
out of mem
errors)
Shops query after
indexing
8 x 2 core high memory
nodes (13GB RAM)
40s BUT… indexing took 714 seconds!
24. Results – Reading & Writing Data
• Used Hbase and Jython, but did not have time to implement spatial
indexes
Operation Cluster configuration Cluster Time Standalone
configuration
Standalone time
Data loading England PBF
(610MB)
8 x 2-core high memory
nodes (13GB ram each)
30m 1 x 16 core machine
(104GB RAM)
527m
Data retrieval (small town) 8 x 2-core high memory
nodes (13GB ram each)
1 x 16 core machine
(104GB RAM)
3s
Data retrieval (large town) 8 x 2-core high memory
nodes (13GB ram each)
1 x 16 core machine
(104GB RAM)
113s
Data retrieval (city) 8 x 2-core high memory
nodes (13GB ram each)
1 x 16 core machine
(104GB RAM)
Did not complete (> 300s
and 50,000 nodes)
25. Conclusions
• It’s possible to replicate much of what OSM requires in Hadoop
• Open Street Map is growing quickly, but it is a long way from requiring horizontal
sharing of databases
• In general, it is not quicker to run geographic queries in a cluster at the TB order
of magnitude (at least with current OSM tools)
• Indexes do significantly speed up geographic queries (Quad-tree seems to be the
best)
• There is a high barrier of entry (technical & cost) for Hadoop and ecosystem that
will make it difficult for OSM to adopt the technology
• OSM should also consider parallel databases if they do have a requirement to
scale-out as there is less mismatch between their current system
• Spatial extensions to big data platforms are relatively immature, but there is a
huge potential there to do data analytics on massive datasets and gain new
insights
• I’ve learnt a lot personally!
26. Experiences with Azure
+Easy to use – click to deploy
+Good free trial program
+Good integration with storage
- Less customisable
- It was impossible to deploy >= 8
node clusters (rate limits?) so I
gave up
- Technical support was
responsive but not especially
helpful
27. Experiences with Google Cloud
+Already had Hortonworks
Hadoop distribution automated
setup
+Easy to customise – everything
on GitHub.
+Uses a standard setup (Ambari)
- Not always reliable
- Free trial was quite limited
- More difficult to connect with
Google Storage buckets
- Bit more work to deploy solution
as code-based and have to
download 3rd party tool (gcloud)
28. General Hadoop experiences
• Choosing the correct tool can be a significant part of the problem
• Setting up Hadoop clusters is hard!
• Spatial Big Data is still a little niche (although I did get lots of help)
• Running Hadoop jobs (even with Pig) is hard!
• Trial and error to experiment with memory requirements
• Size of files is a real barrier (especially when you are paying!)
• Often jobs failed half way through
• Debugging is not easy
• Have to recompile Java whenever there is a change (and sometimes deploy to
nodes)
Editor's Notes
Work @ Arup
What is OpenStreetMap
Wikipedia of Maps
Started by a Brit, Steve Coast, who was frustrated with the fact that most data held by government is not open
Ordnance Survey is notoriously bad at being open with their data. So he started a project to get GPS data and trace over that. Now Bing donates their imagery for use in the project and there is a huge community.
2006 – area around university, Oxford Road, city Centre, motorways
2016 – all streets, railways, parks, many buildings, points of interest
The cartography of the map and the data have massively improved
Kibera, a slum in Nairobi
Shout out for Missing Maps – have a monthly session in London where you can use your time to map genuinely unmapped areas of the world and help first responders and development organisations reach them more effectively
Entire stack is open source
Input
Storage – critical bit of infrastructure
PostgreSQL 9.1
Everything in or out
OSM Historical – every version ever of everything in the database, including now deleted items
Hardware growth
Also things like 3d data
What happens when my system needs more power:
3 ways
Scale up
Scale out (parallel) – traditional relational database systems
Scale out (NoSql) – Hadoop
Total Cost of Ownership approach
Backlash against NoSQL
NewSQL moment – in memory databases
Fair to say that scale out in parallel is a neglected area
Not generally open source but highly capable. Although Greenplum (parallel Postgres) has just been open sourced by Pivotal
OSM is an interesting case
Not a clear cut question
Relational DBs taking on more NoSQL characteristics and vice-versa
Geographic data is problematic for databases – it is minimum 2 dimensional – X and Y
Imagine a list of names or numbers – it is easy to order them because the key is one-dimensional. So if I have a list between 1-1 billion I know that 1-10 million are on computer A, 10-20 million on B etc. and I can get 1-500,000 easily as they are on the same computer. But I can’t do that for 2-dimensional space, so need to convert it into 2d space for efficiency
Space-filling curves try to map 2d onto 1d so that we don’t need to query every node in the cluster if we want to query a geographic area
Z-order curve on left
Hilbert Curve – on right
Geohashing is a form of Z-order curve
Mainly used Azure / Google Cloud because of the free tier!
But AWS is probably cheaper!
SpatialHadoop uses PIG! as an execution engine
Simple things we do in GIS
Tell me what is within a square box – (bounding box)
Much easier to calculate than tell me if this tree is within Camden or Islington as they are irregular shapes
XML is verbose – takes up a lot of space
Alternative format called PBF which stores data in binary and makes optimisations to keep file size down
There was a reader into Hadoop for OSM XML, but not for the binary format
I created a reader to read in OSM binary data to Hadoop
Reading OSM binary format is definitely more efficient and quicker than reading XML
XML is a very cumbersome format – OSM Europe is 345GB of XML which is extremely difficult and time consuming to process – difficult and expensive to get disks this big to process this then read it into Hadoop (and to store in Hadoop)
Overpass is a standalone solution for reading OSM data. It was quicker to load for a small area (UK) but slower to load than Hadoop for a larger area (Europe)
Loading data in is one thing, but what about maintaining data? With Hadoop queries, currently have to r
Cluster times cannot beat standalone times, even with indexing. However memory was becoming a serious issue when querying large bounding boxes
Even for reading from the main osm database, indexes are crucial – a key operation is to get all the data for a bounding box area, which requires a spatial index to operate efficiently
Should have used either GeoWave or Geomesa as these would have produced spatial indexes for Accumulo (and coming soon, Hbase)
Could write my own scripts to install certain files on startup
Still an emerging area
In GIS we take for granted a set of standards in terms of file formats, format conversions, common spatial operations, indexes, storage backends. None of these have really been defined for spatial big data. Also the tools are currently not fully-optimised yet and could be improved by further development.
Interesting to see how the big players such as ESRI handle this and how they cope with building on a totally non-proprietary technology.
Massive scope not just in GIS, but in CAD, BIM, IoT etc.