Several criteria are used to evaluate the value of a dataset. This presentation exemplify how some data is easy to value, while some are much more complex. For the later (which might be a large part of open data), the value discovery process might be long but worthy.
2. Some context
• Canadian nonprofit that builds websites and tools to help
governments and citizens engage with each other
• Follows two main strategies:
Improve access to government information via open data
Make participation easy and meaningful
3. Ongoing projects
• Citizen Budget: a online consultative budget simulator for
municipalities and civil society organizations
• Represent: the largest database and open API of elected
Canadian officials with two drupal modules for easy website
integration
• MaMairie/MyCityHall: an online portal for tracking and
interacting with your city hall
• Open511: an open data standard for traffic data and basic
related tools
4. Data = Natural Resources?
Source: USGS Source: James St-Jones (cc-by)
Value! Meh?
Hint: this is bauxite
5. Value extraction
Diamond Aluminum
Extract Discover it’s valuable
Elaborate process
Cut
Industrialize process
Tada! …
Cans, Car parts, etc.
6. Traffic and Transit data
• Sort of case study
– Region of San Francisco: 2 leader organizations
• Bay Area Rapid Transit (BART): 80+ apps
• Metropolitan Transportation Commission (MTC): handful
of apps
– Same (full of geeks and startups) region
– Same “type” of data (transportation)
– Both organizations are innovative
Let’s look at “intrinsic” data value
7. 1. Standardization
• Transit data
– GTFS & SIRI: open data-oriented standards
– Used by 250 transit/transportation agencies
• Traffic
– Several standards (TMDD, TPEG, etc.), but difficult to use
in an open data context
⇒ Standard = low barrier to entry,
⇒ Tools/apps built for these standards can reach lots of
customers
8. 2. Self sufficient
• Transit data
– Data can be interpreted on its own. No need for external
data
• Traffic
– Several subsets of related data (accident, constructions,
road data, etc.)
– Data managed by several jurisdictions (local, regional,
provincial, federal)
⇒ Managing several sources and several datasets is always…
complex
9. 3. Complexity
• Transit
– (Quite) simple: some schedules, some fares, some spatial
data
• Traffic
– Complex: networks are wide, intertwined, with lots of rules,
lots of “free” actors
⇒ Modeling complex data is… complex and more prone to
discrepancy
10. 4. Reliability
• Transit
– Usually buses and trains follow their schedule
– Adding a GPS on each single bus is simple and give
almost 100% reliability of the data
• Traffic
– Impossible to monitor every single road segment
⇒ Lack of reliability has a strong, negative impact on data value
11. Techno-utopian dream
Your iphone 8S
Dear smartphone,
I need to pick
the kids at school
as fast as possible,
what’s the best
choice?
12. A wealth of data
Road events Gaz price Road data
Parking data Crowdsourced data
Realtime traffic sensors (gov) Planned trip
Car efficiency Realtime traffic (business)
Personal data: car, location, habits
13. Multiplicative effect
• “Diamond” data self-sufficient: a strength for adoption
• For all data: real value is in cross-use with other datasets
• Some datasets will find their value because of the existence of
other datasets
• Adding new datasets has a multiplier effects on existing
related datasets
14. Not only gov data
• Usually open data = open government data
• But open data can be much more
Road events Car, transit pass, bike share
Road data Open Open Transportation habits
Traffic data Gov personal Planned trip
Parking data Crowdsourced data
Data data
Open (?) Bike share
Gaz price
data from
Traffic data
companies Parking data
Vehicle efficiency
15. Some innovation theory
Gartner’s hype cycle of innovation (but it is not only about hype)
Stairway to heaven
(internet-style)
You might …or here
be here…
Peak of Plateau of
inflated expectations productivity
Slope of
Trough of enlightment
disillusionment
Innovation
trigger
Abyssal
crash
16. Conclusion
• Assess your datasets: diamond vs bauxite analogy or any
other analysis framework
• All datasets are not born equal, some might take more time to
show their value
• Help discovery and value extraction process
• Follow “open” standards when they exist or participate to their
elaboration
• Improve reliability of data where possible
• Be patient… but active!
Open North: Founded in 2011 Based in Montréal but with a Canadian scope More in the execution than in advocacy ---- Crunch open data to make it compelling for citizen Link open data with open gov, provide a feedback loop Fill the gap between governments and citizens
Presentation of the framework: no rocket science, just some (irrelevant) analogies Many are saying data is the natural resources. But it’s not natural, nor really a resource… Anyway, let’s take this analogy for a moment, what kind of resource do we have on earth: Some stuff look obviously interesting: diamond. Beautiful and incredibly hard Some look… less interesting. Like Bauxite ore… which gives us aluminium
How do we get value from these two different resource: Diamond. Quite easy (in theory): find it, cut it and you have something extremely valuable. Humans assigned high value to diamond 4000 years ago and it’s still valuable Bauxite/Aluminum At first, you have to discover there is something interesting there. Only happened in 1800 for Bauxite. Then need a hell of steps to extract the valuable part of the ore: process high quantities of ore to get a decent amount of aluminum And when you have pure aluminum, you are not done: either you product 100% aluminum products (and you need to find processes to produce that), but most of the time aluminum will only serve as parts of a larger product, like car. In this case, Aluminum has value only because if the existence of other produces that need its. Even we had found how to extract and use aluminum 1000 years before, it would have largely remained unused.
Let’s come back to our topic: open data! And more precisely, transportation data and even more precisely traffic and transit data. Recently, I’ve been working with some data from the SF bay area. Number of apps cannot be considered as the panacea to determine the success and value of data set. BUT have such a discrepancy between traffic and transit shows something. Let’s try to look at the value of data by itself. Let’s forget about licenses, community management and reach out. They are important but as the SF case exemplify, transit data are very successful while traffic data is not. Why is that? Is public transportation so much cooler than cars that it drags all the market?? Or is this that traffic data seem less valuable? We can’t be sure, but let’s have a look to some criteria that differentiate both types of data.
Open data oriented standards = specs available to anybody, for free, with no problematic license and there is no barrier/mechanism that makes the access difficult or expansive (e.g access to a hub, need for incredibly huge infrastructure or use of proprietary technologies). For traffic, TMDD and DATEX are probably the closest to open data standards (except that TMDD is not free), but too complex for open data, mainly designed for center to center communication and not toward travelers. Why standardizations matters? Clear doc to build tools When an app is based on a standard, each new place using this standard is a new market => More chance that a development investment will be amortized. Building application of a custom format, mainly with low hopes to have other places developing similar datasets is not the best solution to develop a product…
But BART was able to have a large number of app based on their real time API that is not standard. So why? Even if non standard, the data is easy to use. More precisely, once you grabbed the data, you can provide very interesting information without anything else. Transit data are by definition self-sufficient. Each transit network can be isolated easily, each network is well known Buses use roads but you don’t really need that. Schedule, fares, positions, you have more or less everything you need. Traffic data is not self sufficient at all.
Complexity and self-contained are close but different Transit is self sufficient and simple to model. GTFS is a handful of fixed tables and field that gives a good representation of real life. Modeling traffic data in general, or even some specific subsets like events is more complex. In order to keep models simple, it has to deviate from reality more that what is done with transit. The question is: what is the equilibrium between simpler data and deviation from reality. In any case, more complex data is more difficult to integrate in tools
Reliability of the data is partly linked with complexity. When the data model is a little farther from the reality, it becomes less reliable, it cannot take into account all the possible things cases of real like. On top of that, the ability to monitor the subject is key. Real life transit usually fits well the data. Yes, buses might be stuck in the traffic, but at least, it should pass where is it supposed to. And getting realtime data is simple (if not inexpensive): GPS. Monitoring of traffic is much more difficult. You can’t put sensors everywhere, maintaining a sensor network is expensive. You can’t put cameras everywhere. As a global, it is much more difficult to get reliable data on traffic.
Remember the diamond/aluminium analogy? Transit is the diamond of transportation data, and traffic data is its aluminum. Then, traffic data should be valuable? How does it look like? What is this value and how to extract? Let’s jump in a techno-utopian world, few years from here. I ask my smartphone … He proposes me the fastest way. Transit, we know it Bike share: data usually available Walk: need road data By the way, this not so futuristic. Startups and Google Now are more or less doing this. But large chunks data are still missing, data that should be open.
In order to get the car option right, we need a lot of data
This is where the analogy becomes clear: Diamond data’s value can be extracted easily Aluminum data need a large framework The case for open data for aluminum data might seem less obvious: it takes more knowledge, more infrastructure to use this data. The perennial geek who demonstrates the value of dataset during a hackathon will have more difficulties to do his job. Incumbent private companies seem to be able to buy or build the data needed. But in the end, this is the real place where open data is crucial: develop new uses that are not obvious, help serious new comers to overcome market barriers, analyze data to provide useful insights. Processing complex things is more and more “open”. Think of open source plans to build cars or buildings. It is not accessible to anybody, but this is a major change in business practice.
What would be a presentation without a Venn diagram? More or less stolen from David Eaves. Aluminum data frequently needs other data as it is in the car trip example. But as you can see, it cannot be restricted to gov data. There is case for more than gov data. Via existing initiative, there is currently a focus (mainly in the US) about opening personal data: blue button, green button. Give back to people their data to be used/integrated; use data provided by people More and more the case of open data from private companies. “Open” can be litigious because there is not always a real open license on it but things will probably evolve on this. Why private companies would open their data? To let people know about their product (e.g parking), to show market superiority (vehicle efficiency) or to create a platform.
Gartner’s hype cycle: Frequently have inflated expectation which brings disillusionment Not all the govs follow the cycle at the same pace. Diamond data is the one that tend to create inflated expectation. Everything look simple… but it is not. Most of the value of open data lies in “aluminum” data where value extraction is longer => this is the steady growth Internet lived the same cycle: end of the 90’s: Internet would change the world. But in 2000, Internet sounded like massive vaporware. Now, in 2013, Internet is bigger and more important than what most of the people were expecting at the peak of expectations in 90’s (e.g Stairway to heaven). Open data could follow the same path as internet… if all actors continue to push on the development of open data and build on “aluminum” data.