SlideShare a Scribd company logo
1 of 31
Download to read offline
Socrata and Open Data
Architecture and
Technology
Evan Chan
Principal Engineer
• Who is Socrata?
• What’s Open Data?
• The state of government IT
• How Socrata Enables Open Data
• The Socrata Architecture
• Scaling our Architecture
Agenda
Who is Socrata?
!
We are a Seattle-based software startup. 
!
We make data useful to everyone.
Open, Public Data
Consumers
Apps
Socrata is…
The most widely adopted Open Data platform
What is Open Data?
21st Century Government
• Lower the cost of healthcare
• Improve education systems
• Fight climate change
• Improve city safety
• Reduce the occurrences of crime
• Reduce bureaucratic inefficiencies
• Spur local innovation
Improved use of government data can:
Governments Want Their Data to be Open
• Fundamental belief that transparent government is
better
• Push to modernize government through APIs
• Belief that government data can be useful (think
health inspection data and Yelp, or 911 data and
Zillow)
1. Geospatial Data
2. Public Safety Data
Traffic, Crime, Environmental, Complaints
3. Salary Data
4. Health Data
5. Expenditure Data
6. Education Data
7. Census Data
8. Parcel Property Data
9. Business Data
10.Locations of Government Services
Most Compelling Datasets
What is Socrata?
• Catalog to find datasets
• Tools for easily importing and updating datasets 
• Simple data visualizations for exploring and showing
data
• Reporting and application building environment
Who uses Socrata?
Laura – Local Resident
“How safe is my neighborhood?”
Aaron – Community Advocate

“I want to see trends in social
housing.”
Dave – App Developer 

“I need real-time API access to 911
data.”
Dora – The Chief Data Officer
”How do we connect our data to the web?”
Pam – Mayor’s Office

“How do we share data to make better
decisions?”
Sammy – Department Head 

“I need to shift to self-service digital
channels.”
External Data Consumers Government Data Publishers
Visualizations
Analysis
Discovery/SEO
Dashboards
Government
Multilateral/NGO
Data Benchmarking / Prediction
Syndication to
Consumer Web
Apps
Our Architecture
Information important to you
•Run our own datacenters (SEA/ORD)
•Javascript/Ruby on the frontend 
•Java/Scala on the backend
•Postgres, Cassandra, Kafka, Chef
•Hard and novel problems to solve
and new backends to explore...
Increase
the flow
of data
Drive mass
consumptio
n
browser datasync Client/API
customer data management system
file (CSV,.xls) API
dataset additions/updates
socrata load balancer
1
2
3
1. Data is brought into the system. Data

may be brought in via direct file upload

or by using the datasync client or api which 

maintains an efficient and robust transfer

of the data.

2. Dataset additions, updates or deletes are 

communicated to the Socrata back end 

system across the Internet.
ingress
1. Data set update is routed to a request
dispatcher.
!
2. The request dispatcher forwards the
request to the data coordinator.

3. Data coordinator performs the data set
addition or update.

4. The truth service adds or alters the data
set. All primitive data types are addressed.
The truth system gets annotations to the
data set from the annotation service and
applies them as appropriate.

5. Data coordinator informs all appropriate
query services including information on the
specific data needed.

6. Impacted query services retrieve the
data.
ingress
understanding
• data set level
– high level: health
– more detailed:
• health/restaurant/inspection
• health/disease/infectious

• columnar data types
– e.g. location/name/city,
demographic/gender

• columnar schematic categories
– e.g. crime_type, crimes from
Boston and Chicago datsets

• columnar schematic category
classifications, e.g. (assault, assault
and battery, violent assault) >
assault

• pivot points
– e.g. neighborhood, city, business
gold GT 1
crowdsourcer
GT
models
annotations truth
1. A trusted curator prepares a gold ground
truth (GT) by manually labeling datasets.

2. The Gold GT is used by a CrowdSourcer
system which coordinates jobs across untrusted
distributed mechanical turk workers to
annotate the Gold GT set. Annotation quality is
assessed via the Gold GT.

3. The CrowdSourcer leverages the distributed
humans to annotate much larger sets of
datasets.

4. Machine learning models are trained against
the GT and applied against a larger set of
datasets. In addition, trained models are
applied in the synchronous workflow described
above.

5. Model based and crowd sourced annotations
are stored in an annotation service.

6. The Truth system periodically queries the
Knowledge system for the latest annotation
mappings and applies them to its datasets.

7. Secondary services like Search are notified
of changes. They pick them up and are now
available for query.
3
4
2
5
search
6
7
annotations curator
socrata load balancer
app service platform
govstat budget core ux
browser
queries
api app
1. Citizens, reporters, and other users
access our core ux or apps via a
browser.
2. Apps run on our app service
platform and generate queries to
our back end services as needed.
3. 3rd party developers build apps
using our API which leverage back
end services.
1
2
3
query
1. query is routed to a request
dispatcher
!
2. the request dispatcher’s query
Coordinator first checks if the
query is cached; if so, the cached
copy is returned.
!
3. request dispatcher routes the
query to the appropriate
specialty subsytem to perform
the query.
technologies
• db
– postgres including postGIS
– lucene/elastic search
– spark:(sql, streaming)
– cassandra (back end of govstat, dataset metrics)
• languages
– scala, ruby, javascript, python
• platforms:
– logging: sumo
– build: jenkins
– test: cucumber
– cloud: (aws, azure, own) -> aws
– machine learning: sklearn/scipy
1. Broad soda v.x api is made
more efficient via
techniques like rollup tables.
!
1. Big gulp api is serviced
through specialized
secondary Services. API is
highly controlled; additions
are made w/ a clear
understanding of scale cost. 

2. Big gulp api query breadth is
expanded over time, as soda
data and user throughput
sizes are increased
!
!
scaling strategy
ingress tp
1e11~100G
The SODA API
• Recently introduced a new version of
our API
• Expressive, SQL-like language
• Provides the base for all other
functionality
• Provides the base for 3rd parties to
access data hosted by Socrata
SoQL
A REST-like SQL inspired API for accessing and querying
datasets
DEMO!
How I feel using SoQL over HTTP
The Scala SODA Client
•Abstract away JSON parsing and
types
•Scala-like query syntax
•Returns results as a Future
•Internally uses Iteratees to stream
and parse results
WE ARE HIRING!
evan.chan@socrata.com

More Related Content

What's hot

Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
 
Rediscovering the Value of Apache Kafka® in Modern Data Architecture
Rediscovering the Value of Apache Kafka® in Modern Data ArchitectureRediscovering the Value of Apache Kafka® in Modern Data Architecture
Rediscovering the Value of Apache Kafka® in Modern Data Architectureconfluent
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Khai Tran
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesPaco Nathan
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIOJozo Kovac
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/HudiVinoth Chandar
 
Spark Streaming @ Berlin Apache Spark Meetup, March 2015
Spark Streaming @ Berlin Apache Spark Meetup, March 2015Spark Streaming @ Berlin Apache Spark Meetup, March 2015
Spark Streaming @ Berlin Apache Spark Meetup, March 2015Stratio
 
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksOverview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksSlim Baltagi
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaSlim Baltagi
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Guido Schmutz
 
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...Paul Brebner
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Paulo Gutierrez
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaGuido Schmutz
 
Kafka for data scientists
Kafka for data scientistsKafka for data scientists
Kafka for data scientistsJenn Rawlins
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksData Con LA
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and SparkODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and SparkCarolyn Duby
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidDataWorks Summit
 

What's hot (20)

Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 
Rediscovering the Value of Apache Kafka® in Modern Data Architecture
Rediscovering the Value of Apache Kafka® in Modern Data ArchitectureRediscovering the Value of Apache Kafka® in Modern Data Architecture
Rediscovering the Value of Apache Kafka® in Modern Data Architecture
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIO
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
 
Spark Streaming @ Berlin Apache Spark Meetup, March 2015
Spark Streaming @ Berlin Apache Spark Meetup, March 2015Spark Streaming @ Berlin Apache Spark Meetup, March 2015
Spark Streaming @ Berlin Apache Spark Meetup, March 2015
 
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksOverview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache Kafka
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
 
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
 
LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
 
Kafka for data scientists
Kafka for data scientistsKafka for data scientists
Kafka for data scientists
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and SparkODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
 

Similar to MIT lecture - Socrata Open Data Architecture

The Rising Floor of Platform - MIT Platform Summit 2014
The Rising Floor of Platform - MIT Platform Summit 2014The Rising Floor of Platform - MIT Platform Summit 2014
The Rising Floor of Platform - MIT Platform Summit 2014Peter Coffee
 
Geospatial Intelligence Middle East 2013_Big Data_Steven Ramage
Geospatial Intelligence Middle East 2013_Big Data_Steven RamageGeospatial Intelligence Middle East 2013_Big Data_Steven Ramage
Geospatial Intelligence Middle East 2013_Big Data_Steven RamageSteven Ramage
 
EVOLVING PATTERNS IN BIG DATA - NEIL AVERY
EVOLVING PATTERNS IN BIG DATA - NEIL AVERYEVOLVING PATTERNS IN BIG DATA - NEIL AVERY
EVOLVING PATTERNS IN BIG DATA - NEIL AVERYBig Data Week
 
Crime Analysis & Prediction System
Crime Analysis & Prediction SystemCrime Analysis & Prediction System
Crime Analysis & Prediction SystemBigDataCloud
 
RajeshGautamResearchResume-Apr-2016
RajeshGautamResearchResume-Apr-2016RajeshGautamResearchResume-Apr-2016
RajeshGautamResearchResume-Apr-2016Rajesh Gautam
 
Harnessing Big Data_UCLA
Harnessing Big Data_UCLAHarnessing Big Data_UCLA
Harnessing Big Data_UCLAPaul Barsch
 
Analytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data PlatformAnalytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data PlatformVMware Tanzu
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
Systemof insight
Systemof insightSystemof insight
Systemof insightsuresh sood
 
Barga ACM DEBS 2013 Keynote
Barga ACM DEBS 2013 KeynoteBarga ACM DEBS 2013 Keynote
Barga ACM DEBS 2013 KeynoteRoger Barga
 
Wikibon #IoT #HyperConvergence Presentation via @theCUBE
Wikibon #IoT #HyperConvergence Presentation via @theCUBE Wikibon #IoT #HyperConvergence Presentation via @theCUBE
Wikibon #IoT #HyperConvergence Presentation via @theCUBE John Furrier
 
Service Level Comparison for Online Shopping using Data Mining
Service Level Comparison for Online Shopping using Data MiningService Level Comparison for Online Shopping using Data Mining
Service Level Comparison for Online Shopping using Data MiningIIRindia
 
Smart city - Steven furst fis
Smart city - Steven furst fisSmart city - Steven furst fis
Smart city - Steven furst fisChuong Nguyen
 
Nyc web perf-final-july-23
Nyc web perf-final-july-23Nyc web perf-final-july-23
Nyc web perf-final-july-23Dan Boutin
 
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...Big Data Week
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks
 

Similar to MIT lecture - Socrata Open Data Architecture (20)

The Rising Floor of Platform - MIT Platform Summit 2014
The Rising Floor of Platform - MIT Platform Summit 2014The Rising Floor of Platform - MIT Platform Summit 2014
The Rising Floor of Platform - MIT Platform Summit 2014
 
Geospatial Intelligence Middle East 2013_Big Data_Steven Ramage
Geospatial Intelligence Middle East 2013_Big Data_Steven RamageGeospatial Intelligence Middle East 2013_Big Data_Steven Ramage
Geospatial Intelligence Middle East 2013_Big Data_Steven Ramage
 
EVOLVING PATTERNS IN BIG DATA - NEIL AVERY
EVOLVING PATTERNS IN BIG DATA - NEIL AVERYEVOLVING PATTERNS IN BIG DATA - NEIL AVERY
EVOLVING PATTERNS IN BIG DATA - NEIL AVERY
 
Crime Analysis & Prediction System
Crime Analysis & Prediction SystemCrime Analysis & Prediction System
Crime Analysis & Prediction System
 
RajeshGautamResearchResume-Apr-2016
RajeshGautamResearchResume-Apr-2016RajeshGautamResearchResume-Apr-2016
RajeshGautamResearchResume-Apr-2016
 
Harnessing Big Data_UCLA
Harnessing Big Data_UCLAHarnessing Big Data_UCLA
Harnessing Big Data_UCLA
 
Mesoscon 2015
Mesoscon 2015Mesoscon 2015
Mesoscon 2015
 
Analytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data PlatformAnalytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data Platform
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
Systemof insight
Systemof insightSystemof insight
Systemof insight
 
Barga ACM DEBS 2013 Keynote
Barga ACM DEBS 2013 KeynoteBarga ACM DEBS 2013 Keynote
Barga ACM DEBS 2013 Keynote
 
Wikibon #IoT #HyperConvergence Presentation via @theCUBE
Wikibon #IoT #HyperConvergence Presentation via @theCUBE Wikibon #IoT #HyperConvergence Presentation via @theCUBE
Wikibon #IoT #HyperConvergence Presentation via @theCUBE
 
Hyper-Convergence CrowdChat
Hyper-Convergence CrowdChatHyper-Convergence CrowdChat
Hyper-Convergence CrowdChat
 
Service Level Comparison for Online Shopping using Data Mining
Service Level Comparison for Online Shopping using Data MiningService Level Comparison for Online Shopping using Data Mining
Service Level Comparison for Online Shopping using Data Mining
 
Ibm big data
Ibm big dataIbm big data
Ibm big data
 
IBM Big Data
IBM Big Data IBM Big Data
IBM Big Data
 
Smart city - Steven furst fis
Smart city - Steven furst fisSmart city - Steven furst fis
Smart city - Steven furst fis
 
Nyc web perf-final-july-23
Nyc web perf-final-july-23Nyc web perf-final-july-23
Nyc web perf-final-july-23
 
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
 

More from Evan Chan

Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustEvan Chan
 
Designing Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesEvan Chan
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Evan Chan
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleEvan Chan
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkEvan Chan
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Evan Chan
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and SparkEvan Chan
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server TalkEvan Chan
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Evan Chan
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkEvan Chan
 

More from Evan Chan (16)

Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Designing Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and Kubernetes
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
 

Recently uploaded

welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate productionChinnuNinan
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfRajuKanojiya4
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxVelmuruganTECE
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptNarmatha D
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptbibisarnayak0
 

Recently uploaded (20)

welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate production
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdf
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptx
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.ppt
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.ppt
 

MIT lecture - Socrata Open Data Architecture

  • 1. Socrata and Open Data Architecture and Technology Evan Chan Principal Engineer
  • 2. • Who is Socrata? • What’s Open Data? • The state of government IT • How Socrata Enables Open Data • The Socrata Architecture • Scaling our Architecture Agenda
  • 3. Who is Socrata? ! We are a Seattle-based software startup. ! We make data useful to everyone. Open, Public Data Consumers Apps
  • 4. Socrata is… The most widely adopted Open Data platform
  • 5. What is Open Data?
  • 6. 21st Century Government • Lower the cost of healthcare • Improve education systems • Fight climate change • Improve city safety • Reduce the occurrences of crime • Reduce bureaucratic inefficiencies • Spur local innovation Improved use of government data can:
  • 7. Governments Want Their Data to be Open • Fundamental belief that transparent government is better • Push to modernize government through APIs • Belief that government data can be useful (think health inspection data and Yelp, or 911 data and Zillow)
  • 8. 1. Geospatial Data 2. Public Safety Data Traffic, Crime, Environmental, Complaints 3. Salary Data 4. Health Data 5. Expenditure Data 6. Education Data 7. Census Data 8. Parcel Property Data 9. Business Data 10.Locations of Government Services Most Compelling Datasets
  • 9. What is Socrata? • Catalog to find datasets • Tools for easily importing and updating datasets • Simple data visualizations for exploring and showing data • Reporting and application building environment
  • 10. Who uses Socrata? Laura – Local Resident “How safe is my neighborhood?” Aaron – Community Advocate
 “I want to see trends in social housing.” Dave – App Developer 
 “I need real-time API access to 911 data.” Dora – The Chief Data Officer ”How do we connect our data to the web?” Pam – Mayor’s Office
 “How do we share data to make better decisions?” Sammy – Department Head 
 “I need to shift to self-service digital channels.” External Data Consumers Government Data Publishers
  • 13. Information important to you •Run our own datacenters (SEA/ORD) •Javascript/Ruby on the frontend •Java/Scala on the backend •Postgres, Cassandra, Kafka, Chef •Hard and novel problems to solve and new backends to explore...
  • 14. Increase the flow of data Drive mass consumptio n
  • 15. browser datasync Client/API customer data management system file (CSV,.xls) API dataset additions/updates socrata load balancer 1 2 3 1. Data is brought into the system. Data
 may be brought in via direct file upload
 or by using the datasync client or api which 
 maintains an efficient and robust transfer
 of the data.
 2. Dataset additions, updates or deletes are 
 communicated to the Socrata back end 
 system across the Internet. ingress
  • 16. 1. Data set update is routed to a request dispatcher. ! 2. The request dispatcher forwards the request to the data coordinator.
 3. Data coordinator performs the data set addition or update.
 4. The truth service adds or alters the data set. All primitive data types are addressed. The truth system gets annotations to the data set from the annotation service and applies them as appropriate.
 5. Data coordinator informs all appropriate query services including information on the specific data needed.
 6. Impacted query services retrieve the data. ingress
  • 17. understanding • data set level – high level: health – more detailed: • health/restaurant/inspection • health/disease/infectious
 • columnar data types – e.g. location/name/city, demographic/gender
 • columnar schematic categories – e.g. crime_type, crimes from Boston and Chicago datsets
 • columnar schematic category classifications, e.g. (assault, assault and battery, violent assault) > assault
 • pivot points – e.g. neighborhood, city, business
  • 18. gold GT 1 crowdsourcer GT models annotations truth 1. A trusted curator prepares a gold ground truth (GT) by manually labeling datasets.
 2. The Gold GT is used by a CrowdSourcer system which coordinates jobs across untrusted distributed mechanical turk workers to annotate the Gold GT set. Annotation quality is assessed via the Gold GT.
 3. The CrowdSourcer leverages the distributed humans to annotate much larger sets of datasets.
 4. Machine learning models are trained against the GT and applied against a larger set of datasets. In addition, trained models are applied in the synchronous workflow described above.
 5. Model based and crowd sourced annotations are stored in an annotation service.
 6. The Truth system periodically queries the Knowledge system for the latest annotation mappings and applies them to its datasets.
 7. Secondary services like Search are notified of changes. They pick them up and are now available for query. 3 4 2 5 search 6 7 annotations curator
  • 19. socrata load balancer app service platform govstat budget core ux browser queries api app 1. Citizens, reporters, and other users access our core ux or apps via a browser. 2. Apps run on our app service platform and generate queries to our back end services as needed. 3. 3rd party developers build apps using our API which leverage back end services. 1 2 3 query
  • 20. 1. query is routed to a request dispatcher ! 2. the request dispatcher’s query Coordinator first checks if the query is cached; if so, the cached copy is returned. ! 3. request dispatcher routes the query to the appropriate specialty subsytem to perform the query.
  • 21. technologies • db – postgres including postGIS – lucene/elastic search – spark:(sql, streaming) – cassandra (back end of govstat, dataset metrics) • languages – scala, ruby, javascript, python • platforms: – logging: sumo – build: jenkins – test: cucumber – cloud: (aws, azure, own) -> aws – machine learning: sklearn/scipy
  • 22. 1. Broad soda v.x api is made more efficient via techniques like rollup tables. ! 1. Big gulp api is serviced through specialized secondary Services. API is highly controlled; additions are made w/ a clear understanding of scale cost. 
 2. Big gulp api query breadth is expanded over time, as soda data and user throughput sizes are increased ! ! scaling strategy
  • 23.
  • 25.
  • 26.
  • 27. The SODA API • Recently introduced a new version of our API • Expressive, SQL-like language • Provides the base for all other functionality • Provides the base for 3rd parties to access data hosted by Socrata
  • 28. SoQL A REST-like SQL inspired API for accessing and querying datasets DEMO!
  • 29. How I feel using SoQL over HTTP
  • 30. The Scala SODA Client •Abstract away JSON parsing and types •Scala-like query syntax •Returns results as a Future •Internally uses Iteratees to stream and parse results