MIT lecture - Socrata Open Data Architecture

Socrata and Open Data
Architecture and
Technology
Evan Chan
Principal Engineer

• Who is Socrata?
• What’s Open Data?
• The state of government IT
• How Socrata Enables Open Data
• The Socrata Architecture
• Scaling our Architecture
Agenda

Who is Socrata?
!
We are a Seattle-based software startup.
!
We make data useful to everyone.
Open, Public Data
Consumers
Apps

Socrata is…
The most widely adopted Open Data platform

21st Century Government
• Lower the cost of healthcare
• Improve education systems
• Fight climate change
• Improve city safety
• Reduce the occurrences of crime
• Reduce bureaucratic inefficiencies
• Spur local innovation
Improved use of government data can:

Governments Want Their Data to be Open
• Fundamental belief that transparent government is
better
• Push to modernize government through APIs
• Belief that government data can be useful (think
health inspection data and Yelp, or 911 data and
Zillow)

1. Geospatial Data
2. Public Safety Data
Traffic, Crime, Environmental, Complaints
3. Salary Data
4. Health Data
5. Expenditure Data
6. Education Data
7. Census Data
8. Parcel Property Data
9. Business Data
10.Locations of Government Services
Most Compelling Datasets

What is Socrata?
• Catalog to ﬁnd datasets
• Tools for easily importing and updating datasets
• Simple data visualizations for exploring and showing
data
• Reporting and application building environment

Who uses Socrata?
Laura – Local Resident
“How safe is my neighborhood?”
Aaron – Community Advocate 
“I want to see trends in social
housing.”
Dave – App Developer  
“I need real-time API access to 911
data.”
Dora – The Chief Data Officer
”How do we connect our data to the web?”
Pam – Mayor’s Office 
“How do we share data to make better
decisions?”
Sammy – Department Head  
“I need to shift to self-service digital
channels.”
External Data Consumers Government Data Publishers

Visualizations
Analysis
Discovery/SEO
Dashboards
Government
Multilateral/NGO
Data Benchmarking / Prediction
Syndication to
Consumer Web
Apps

Information important to you
•Run our own datacenters (SEA/ORD)
•Javascript/Ruby on the frontend
•Java/Scala on the backend
•Postgres, Cassandra, Kafka, Chef
•Hard and novel problems to solve
and new backends to explore...

Increase
the flow
of data
Drive mass
consumptio
n

browser datasync Client/API
customer data management system
file (CSV,.xls) API
dataset additions/updates
socrata load balancer
1
2
3
1. Data is brought into the system. Data 
may be brought in via direct file upload 
or by using the datasync client or api which  
maintains an efficient and robust transfer 
of the data. 
2. Dataset additions, updates or deletes are  
communicated to the Socrata back end  
system across the Internet.
ingress

1. Data set update is routed to a request
dispatcher.
!
2. The request dispatcher forwards the
request to the data coordinator. 
3. Data coordinator performs the data set
addition or update. 
4. The truth service adds or alters the data
set. All primitive data types are addressed.
The truth system gets annotations to the
data set from the annotation service and
applies them as appropriate. 
5. Data coordinator informs all appropriate
query services including information on the
specific data needed. 
6. Impacted query services retrieve the
data.
ingress

understanding
• data set level
– high level: health
– more detailed:
• health/restaurant/inspection
• health/disease/infectious 
• columnar data types
– e.g. location/name/city,
demographic/gender 
• columnar schematic categories
– e.g. crime_type, crimes from
Boston and Chicago datsets 
• columnar schematic category
classifications, e.g. (assault, assault
and battery, violent assault) >
assault 
• pivot points
– e.g. neighborhood, city, business

gold GT 1
crowdsourcer
GT
models
annotations truth
1. A trusted curator prepares a gold ground
truth (GT) by manually labeling datasets. 
2. The Gold GT is used by a CrowdSourcer
system which coordinates jobs across untrusted
distributed mechanical turk workers to
annotate the Gold GT set. Annotation quality is
assessed via the Gold GT. 
3. The CrowdSourcer leverages the distributed
humans to annotate much larger sets of
datasets. 
4. Machine learning models are trained against
the GT and applied against a larger set of
datasets. In addition, trained models are
applied in the synchronous workflow described
above. 
5. Model based and crowd sourced annotations
are stored in an annotation service. 
6. The Truth system periodically queries the
Knowledge system for the latest annotation
mappings and applies them to its datasets. 
7. Secondary services like Search are notified
of changes. They pick them up and are now
available for query.
3
4
2
5
search
6
7
annotations curator

socrata load balancer
app service platform
govstat budget core ux
browser
queries
api app
1. Citizens, reporters, and other users
access our core ux or apps via a
browser.
2. Apps run on our app service
platform and generate queries to
our back end services as needed.
3. 3rd party developers build apps
using our API which leverage back
end services.
1
2
3
query

1. query is routed to a request
dispatcher
!
2. the request dispatcher’s query
Coordinator first checks if the
query is cached; if so, the cached
copy is returned.
!
3. request dispatcher routes the
query to the appropriate
specialty subsytem to perform
the query.

technologies
• db
– postgres including postGIS
– lucene/elastic search
– spark:(sql, streaming)
– cassandra (back end of govstat, dataset metrics)
• languages
– scala, ruby, javascript, python
• platforms:
– logging: sumo
– build: jenkins
– test: cucumber
– cloud: (aws, azure, own) -> aws
– machine learning: sklearn/scipy

1. Broad soda v.x api is made
more efficient via
techniques like rollup tables.
!
1. Big gulp api is serviced
through specialized
secondary Services. API is
highly controlled; additions
are made w/ a clear
understanding of scale cost.  
2. Big gulp api query breadth is
expanded over time, as soda
data and user throughput
sizes are increased
!
!
scaling strategy

The SODA API
• Recently introduced a new version of
our API
• Expressive, SQL-like language
• Provides the base for all other
functionality
• Provides the base for 3rd parties to
access data hosted by Socrata

SoQL
A REST-like SQL inspired API for accessing and querying
datasets
DEMO!

How I feel using SoQL over HTTP

The Scala SODA Client
•Abstract away JSON parsing and
types
•Scala-like query syntax
•Returns results as a Future
•Internally uses Iteratees to stream
and parse results

WE ARE HIRING!
evan.chan@socrata.com

MIT lecture - Socrata Open Data Architecture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MIT lecture - Socrata Open Data Architecture

Similar to MIT lecture - Socrata Open Data Architecture (20)

More from Evan Chan

More from Evan Chan (16)

Recently uploaded

Recently uploaded (20)

MIT lecture - Socrata Open Data Architecture