SlideShare a Scribd company logo
1 of 62
Download to read offline
Using Cascalog
to build an app with
City of Palo Alto
Open Data
Paco Nathan
http://liber118.com/pxn/
1Sunday, 28 July 13
GitHub repo for the open source project:
github.com/Cascading/CoPA/wiki
This project began as a Big Data workshop
for a graduate seminar at CMU West
Many thanks to:
Stuart Evans
CMU Distinguished Service Professor
Jonathan Reichental
City of Palo Alto CIO
Peter Pirnejad
City of Palo Alto Dev Center Director
Diego May
Junar CEO & Co-founder
2Sunday, 28 July 13
 Cascading, a workflow abstraction
 Cascalog ➟ 2.0
 Palo Alto case study
 Open Data insights
3Sunday, 28 July 13
Cascading – origins
API author Chris Wensel worked as a system architect
at an Enterprise firm well-known for many popular
data products.
Wensel was following the Nutch open source project –
where Hadoop started.
Observation: would be difficult to find Java developers
to write complex Enterprise apps in MapReduce –
potential blocker for leveraging new open source
technology.
4Sunday, 28 July 13
Cascading – functional programming
Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
To ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows:
•leverages JVM and Java-based tools without any
need to create new languages
•allows programmers who have J2EE expertise
to leverage the economics of Hadoop clusters
5Sunday, 28 July 13
Cascading – functional programming
• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,
have invested in open source projects atop Cascading
– used for their large-scale production deployments
• new case studies for Cascading apps are mostly based
on domain-specific languages (DSLs) in JVM languages
which emphasize functional programming:
Cascalog in Clojure (2010)
Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki
Why Adopting the Declarative Programming PracticesWill ImproveYour Return fromTechnology
Dan Woods, 2013-04-17 Forbes
forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-
practices-will-improve-your-return-from-technology/
6Sunday, 28 July 13
Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Cascading – integrations
• partners: Microsoft Azure, Hortonworks,
Amazon AWS, MapR, EMC, SpringSource,
Cloudera
• taps: Memcached, Cassandra, MongoDB,
HBase, JDBC, Parquet, etc.
• serialization: Avro, Thrift, Kryo,
JSON, etc.
• topologies: Apache Hadoop,
tuple spaces, local mode
7Sunday, 28 July 13
Cascading – deployments
• case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, Factual, etc.
• use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utility grids, telecom,
genomics, climatology, agronomics, etc.
8Sunday, 28 July 13
Cascading – deployments
• case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, Factual, etc.
• use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utility grids, telecom,
genomics, climatology, agronomics, etc.
workflow abstraction addresses:
• staffing bottleneck;
• system integration;
• operational complexity;
• test-driven development
9Sunday, 28 July 13
Workflow Abstraction – pattern language
Cascading uses a “plumbing” metaphor in Java
to define workflows out of familiar elements:
Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc.
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
Data is represented as flows of tuples. Operations
in the flows bring functional programming aspects
into Java
A Pattern Language
Christopher Alexander, et al.
amazon.com/dp/0195019199
10Sunday, 28 July 13
Workflow Abstraction – literate programming
Cascading workflows generate their own visual
documentation: flow diagrams
in formal terms, flow diagrams leverage a methodology
called literate programming
provides intuitive, visual representations for apps –
great for cross-team collaboration
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
Literate Programming
Don Knuth
literateprogramming.com
11Sunday, 28 July 13
Workflow Abstraction – business process
following the essence of literate programming, Cascading
workflows provide statements of business process
this recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between
business process and implementation details (Hadoop, etc.)
this is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
by virtue of the pattern language, the flow planner then
determines how to translate business process into efficient,
parallel jobs at scale
12Sunday, 28 July 13
 Cascading, a workflow abstraction
 Cascalog ➟ 2.0
 Palo Alto case study
 Open Data insights
13Sunday, 28 July 13
14Sunday, 28 July 13
For the process used with this Open Data app,
we chose to use Cascalog
github.com/nathanmarz/cascalog/wiki
by Nathan Marz, Sam Ritchie, et al., 2010
a DSL in Clojure which implements
Datalog, backed by Cascading
Some aspects of CS theory:
• Functional Relational Programming
• mitigates Accidental Complexity
• has been compared with Codd 1969
15Sunday, 28 July 13
Accidental Complexity:
Not O(N) complexity, but the costs of software
engineering at scale over time
What happens when you build recommenders,
then go work on other projects for six months?
What does it cost others to maintain your apps?
“Out of theTar Pit”, Moseley & Marks, 2006
goo.gl/SKspn
Cascalog allows for leveraging the same framework,
same code base, from ad-hoc queries… to modeling…
to unit tests… to checkpoints in production use
This focuses on the process of structuring data:
specify what you require, not how it must be achieved
Huge implications for software engineering
16Sunday, 28 July 13
pros:
• most of the largest use cases for Cascading
• 10:1 reduction in code volume compared to SQL
• Leiningen build: simple, no surprises, in Clojure itself
• test-driven development (TDD) for Big Data
• fault-tolerant workflows which are simple to follow
• machine learning, map-reduce, etc., started in LISP
years ago anywho...
cons:
• learning curve, limited number of Clojure developers
• aggregators are the magic, those take effort to learn
17Sunday, 28 July 13
Q:
Who uses Cascalog, other than Twitter?
A:
• Climate Corp
• Factual
• Nokia
• Telefonica
• Harvard School of Public Health
• YieldBot
• uSwitch
• etc.
18Sunday, 28 July 13
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
1 map
1 reduce
18 lines code gist.github.com/3900702
WordCount – conceptual flow diagram
cascading.org/category/impatient
19Sunday, 28 July 13
WordCount – Cascading app in Java
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
 .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
20Sunday, 28 July 13
mapreduce
Every('wc')[Count[decl:'count']]
Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']
GroupBy('wc')[by:['token']]
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]
Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']
[head]
[tail]
[{2}:'token', 'count']
[{1}:'token']
[{2}:'doc_id', 'text']
[{2}:'doc_id', 'text']
wc[{1}:'token']
[{1}:'token']
[{2}:'token', 'count']
[{2}:'token', 'count']
[{1}:'token']
[{1}:'token']
WordCount – generated flow diagram
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
21Sunday, 28 July 13
(ns impatient.core
  (:use [cascalog.api]
        [cascalog.more-taps :only (hfs-delimited)])
  (:require [clojure.string :as s]
            [cascalog.ops :as c])
  (:gen-class))
(defmapcatop split [line]
  "reads in a line of string and splits it by regex"
  (s/split line #"[[](),.)s]+"))
(defn -main [in out & args]
  (?<- (hfs-delimited out)
       [?word ?count]
       ((hfs-delimited in :skip-header? true) _ ?line)
       (split ?line :> ?word)
       (c/count ?count)))
; Paul Lam
; github.com/Quantisan/Impatient
WordCount – Cascalog / Clojure
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
22Sunday, 28 July 13
 Cascading, a workflow abstraction
 Cascalog ➟ 2.0
 Palo Alto case study
 Open Data insights
23Sunday, 28 July 13
Palo Alto is quite a pleasant place
•temperate weather
•lots of parks, enormous trees
•great coffeehouses
•walkable downtown
•not particularly crowded
On a nice summer day, who wants to be stuck
indoors on a phone call?
Instead, take it outside – go for a walk
24Sunday, 28 July 13
1. Open Data about municipal infrastructure
(GIS data: trees, roads, parks)
✚
2. Big Data about where people like to walk
(smartphone GPS logs)
✚
3. some curated metadata
(which surfaces the value)
4. personalized recommendations:
“Find a shady spot on a summer day in which to walk
near downtown Palo Alto.While on a long conference call.
Sipping a latte or enjoying some fro-yo.”
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
25Sunday, 28 July 13
The City of Palo Alto recently began to support Open Data
to give the local community greater visibility into how
their city government operates
This effort is intended to encourage students, entrepreneurs,
local organizations, etc., to build new apps which contribute
to the public good
paloalto.opendata.junar.com/dashboards/7576/geographic-information/
discovery
26Sunday, 28 July 13
GIS about trees in Palo Alto:
discovery
27Sunday, 28 July 13
Geographic_Information,,,
"Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl"," Private: -1 Tree ID: 29
Street_Name: ADDISON AV Situs Number: 203 Tree Site: 2 Species: Celtis australis
Source: davey tree Protected: Designated: Heritage: Appraised Value:
Hardscape: None Identifier: 40 Active Numeric: 1 Location Feature ID: 13872
Provisional: Install Date: ","37.4409634615283,-122.15648458861,0.0 ","Point"
"Wilkie Way from West Meadow Drive to Victoria Place"," Sequence: 20 Street_Name: Wilkie Way
From Street PMMS: West Meadow Drive To Street PMMS: Victoria Place Street ID: 598 (Wilkie
Wy, Palo Alto) From Street ID PMMS: 689 To Street ID PMMS: 567 Year Constructed: 1950
Traffic Count: 596 Traffic Index: residential local Traffic Class: local residential
Traffic Date: 08/24/90 Paving Length: 208 Paving Width: 40 Paving Area: 8320
Surface Type: asphalt concrete Surface Thickness: 2.0 Base Type Pvmt: crusher run base
Base Thickness: 6.0 Soil Class: 2 Soil Value: 15 Curb Type: Curb Thickness:
Gutter Width: 36.0 Book: 22 Page: 1 District Number: 18 Land Use PMMS: 1
Overlay Year: 1990 Overlay Thickness: 1.5 Base Failure Year: 1990 Base Failure
Thickness: 6 Surface Treatment Year: Surface Treatment Type: Alligator Severity:
none Alligator Extent: 0 Block Severity: none Block Extent: 0 Longitude and
Transverse Severity: none Longitude and Transverse Extent: 0 Ravelling Severity: none
Ravelling Extent: 0 Ridability Severity: none Trench Severity: none Trench Extent: 0
Rutting Severity: none Rutting Extent: 0 Road Performance: UL (Urban Local) Bike Lane:
0 Bus Route: 0 Truck Route: 0 Remediation: Deduct Value: 100 Priority:
Pavement Condition: excellent Street Cut Fee per SqFt: 10.00 Source Date: 6/10/2009
User Modified By: mnicols Identifier System: 21410 ","-122.1249640794,37.4155803115645,0.0
-122.124661859039,37.4154224594993,0.0 -122.124587720719,37.4153758330704,0.0
-122.12451895942,37.4153242300888,0.0 -122.124456098457,37.4152680432944,0.0
-122.124399616238,37.4152077003122,0.0 -122.124374937753,37.4151774433318,0.0 ","Line"
discovery
(unstructured data…)
28Sunday, 28 July 13
(defn parse-gis [line]
"leverages parse-csv for complex CSV format in GIS export"
(first (csv/parse-csv line))
)
 
 
(defn etl-gis [gis trap]
"subquery to parse data sets from the GIS source tap"
(<- [?blurb ?misc ?geo ?kind]
(gis ?line)
(parse-gis ?line :> ?blurb ?misc ?geo ?kind)
(:trap (hfs-textline trap))
))
discovery
(specify what you require,
not how to achieve it…
80:20 cost of data prep)
29Sunday, 28 July 13
discovery
(ad-hoc queries get refined into
composable predicates)
Identifier: 474
Tree ID: 412
Tree: 412 site 1 at 115 HAWTHORNE AV
Tree Site: 1
Street_Name: HAWTHORNE AV
Situs Number: 115
Private: -1
Species: Liquidambar styraciflua
Source: davey tree
Hardscape: None
37.446001565119,-122.167713417554,0.0
Point
30Sunday, 28 July 13
discovery
(curate valuable metadata)
31Sunday, 28 July 13
(defn get-trees [src trap tree_meta]
"subquery to parse/filter the tree data"
(<- [?blurb ?tree_id ?situs ?tree_site
?species ?wikipedia ?calflora ?avg_height
?tree_lat ?tree_lng ?tree_alt ?geohash
]
(src ?blurb ?misc ?geo ?kind)
(re-matches #"^s+Private.*Tree ID.*" ?misc)
(parse-tree ?misc :> _ ?priv ?tree_id ?situs ?tree_site ?raw_species)
((c/comp s/trim s/lower-case) ?raw_species :> ?species)
(tree_meta ?species ?wikipedia ?calflora ?min_height ?max_height)
(avg ?min_height ?max_height :> ?avg_height)
(geo-tree ?geo :> _ ?tree_lat ?tree_lng ?tree_alt)
(read-string ?tree_lat :> ?lat)
(read-string ?tree_lng :> ?lng)
(geohash ?lat ?lng :> ?geohash)
(:trap (hfs-textline trap))
))
discovery
?blurb! ! Tree: 412 site 1 at 115 HAWTHORNE AV, on HAWTHORNE AV 22 from pl
?tree_id! " 412
?situs" " 115
?tree_site" 1
?species" " liquidambar styraciflua
?wikipedia" http://en.wikipedia.org/wiki/Liquidambar_styraciflua
?calflora http://calflora.org/cgi-bin/species_query.cgi?where-calrecnum=8598
?avg_height"27.5
?tree_lat" 37.446001565119
?tree_lng" -122.167713417554
?tree_alt" 0.0
?geohash" " 9q9jh0
32Sunday, 28 July 13
// run analysis and visualization in R
library(ggplot2)
dat_folder <- '~/src/concur/CoPA/out/tree'
data <- read.table(file=paste(dat_folder, "part-00000", sep="/"),
sep="t", quote="", na.strings="NULL", header=FALSE, encoding="UTF8")
 
summary(data)
t <- head(sort(table(data$V5), decreasing=TRUE)
trees <- as.data.frame.table(t, n=20))
colnames(trees) <- c("species", "count")
 
m <- ggplot(data, aes(x=V8))
m <- m + ggtitle("Estimated Tree Height (meters)")
m + geom_histogram(aes(y = ..density.., fill = ..count..)) + geom_density()
 
par(mar = c(7, 4, 4, 2) + 0.1)
plot(trees, xaxt="n", xlab="")
axis(1, labels=FALSE)
text(1:nrow(trees), par("usr")[3] - 0.25, srt=45, adj=1,
labels=trees$species, xpd=TRUE)
grid(nx=nrow(trees))
discovery
33Sunday, 28 July 13
discovery
sweetgum
analysis of the tree data:
34Sunday, 28 July 13
M
tree
GIS
export
Regex
parse-gis
src
Scrub
species
Geohash
Regex
parse-tree
tree
Tree
Metadata
Join
Failure
Traps
Estimate
height
M
discovery
(flow diagram, gis tree)
35Sunday, 28 July 13
(defn get-roads [src trap road_meta]
"subquery to parse/filter the road data"
(<- [?blurb ?bike_lane ?bus_route ?truck_route ?albedo
?min_lat ?min_lng ?min_alt ?geohash
?traffic_count ?traffic_index ?traffic_class
?paving_length ?paving_width ?paving_area ?surface_type
]
(src ?blurb ?misc ?geo ?kind)
(re-matches #"^s+Sequence.*Traffic Count.*" ?misc)
(parse-road ?misc :> _
?traffic_count ?traffic_index ?traffic_class
?paving_length ?paving_width ?paving_area ?surface_type
?overlay_year ?bike_lane ?bus_route ?truck_route)
(road_meta ?surface_type ?albedo_new ?albedo_worn)
(estimate-albedo
?overlay_year ?albedo_new ?albedo_worn :> ?albedo)
(bigram ?geo :> ?pt0 ?pt1)
(midpoint ?pt0 ?pt1 :> ?lat ?lng ?alt)
;; why filter for min? because there are geo duplicates..
(c/min ?lat :> ?min_lat)
(c/min ?lng :> ?min_lng)
(c/min ?alt :> ?min_alt)
(geohash ?min_lat ?min_lng :> ?geohash)
(:trap (hfs-textline trap))
))
discovery
36Sunday, 28 July 13
?blurb"" " " Hawthorne Avenue from Alma Street to High Street
?traffic_count"3110
?traffic_class"local residential
?surface_type" asphalt concrete
?albedo" " " 0.12
?min_lat"" " 37.446140860599854"
?min_lng "" " -122.1674652295435
?min_alt "" " 0.0
?geohash"" " 9q9jh0
(another data product)
discovery
37Sunday, 28 July 13
The road data provides:
•traffic class (arterial, truck route, residential, etc.)
•traffic counts distribution
•surface type (asphalt, cement; age)
This leads to estimators for noise, reflection, etc.
discovery
38Sunday, 28 July 13
GIS
export
Regex
parse-gis
src
Failure
Traps
M
M
road
Road
Metadata
Join
Estimate
Albedo
Geohash
road
Regex
parse-road
Road
Segments
R
(flow diagram, gis road)
discovery
39Sunday, 28 July 13
GIS data from Palo Alto provides us with geolocation about each
item in the export: latitude, longitude, altitude
Geo data is great for managing municipal infrastructure as well as
for mobile apps
Predictive modeling in our Open Data
example focuses on leveraging geolocation
We use spatial indexing by creating
a grid of geohash values, for efficient
parallel processing
Cascalog queries collect items with the
same geohash values – using them as keys
for large-scale joins (Hadoop)
modeling
40Sunday, 28 July 13
9q9jh0
geohash with 6-digit resolution
approximates a 5-block square
centered lat: 37.445, lng: -122.162
modeling
41Sunday, 28 July 13
Each road in the GIS export is listed as a block between two
cross roads, and each may have multiple road segments to
represent turns:
" -122.161776959558,37.4518836690781,0.0
" -122.161390381489,37.4516410983794,0.0
" -122.160786011735,37.4512589903357,0.0
" -122.160531178368,37.4510977281699,0.0
modeling
( lat0, lng0, alt0 )
( lat1, lng1, alt1 )
( lat2, lng2, alt2 )
( lat3, lng3, alt3 )
NB: segments in the raw GIS have the order of geo coordinates
scrambled: (lng, lat, alt)
42Sunday, 28 July 13
9q9jh0
X X
X
Filter trees which are too far away to provide shade. Calculate a sum
of moments for tree height × distance, as an estimator for shade:
modeling
43Sunday, 28 July 13
(defn get-shade [trees roads]
"subquery to join tree and road estimates, maximize for shade"
(<- [?road_name ?geohash ?road_lat ?road_lng
?road_alt ?road_metric ?tree_metric]
(roads ?road_name _ _ _
?albedo ?road_lat ?road_lng ?road_alt ?geohash
?traffic_count _ ?traffic_class _ _ _ _)
(road-metric
?traffic_class ?traffic_count ?albedo :> ?road_metric)
(trees _ _ _ _ _ _ _
?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash)
(read-string ?avg_height :> ?height)
;; limit to trees which are higher than people
(> ?height 2.0)
(tree-distance
?tree_lat ?tree_lng ?road_lat ?road_lng :> ?distance)
;; limit to trees within a one-block radius (not meters)
(<= ?distance 25.0)
(/ ?height ?distance :> ?tree_moment)
(c/sum ?tree_moment :> ?sum_tree_moment)
;; magic number 200000.0 used to scale tree moment
;; based on median
(/ ?sum_tree_moment 200000.0 :> ?tree_metric)
))
modeling
44Sunday, 28 July 13
M
tree
Join
Calculate
distance
shade
Filter
height
Sum
moment
REstimate
traffic
R
road
Filter
distance
M M
Filter
sum_moment
(flow diagram, shade)
modeling
45Sunday, 28 July 13
modeling
46Sunday, 28 July 13
(defn get-gps [gps_logs trap]
"subquery to aggregate and rank GPS tracks per user"
(<- [?uuid ?geohash ?gps_count ?recent_visit]
(gps_logs
?date ?uuid ?gps_lat ?gps_lng ?alt ?speed ?heading
?elapsed ?distance)
(read-string ?gps_lat :> ?lat)
(read-string ?gps_lng :> ?lng)
(geohash ?lat ?lng :> ?geohash)
(c/count :> ?gps_count)
(date-num ?date :> ?visit)
(c/max ?visit :> ?recent_visit)
))
modeling
?uuid ?geohash ?gps_count ?recent_visit
cf660e041e994929b37cc5645209c8ae 9q8yym 7 1972376866448
342ac6fd3f5f44c6b97724d618d587cf 9q9htz 4 1972376690969
32cc09e69bc042f1ad22fc16ee275e21 9q9hv3 3 1972376670935
342ac6fd3f5f44c6b97724d618d587cf 9q9hv3 3 1972376691356
342ac6fd3f5f44c6b97724d618d587cf 9q9hwn 13 1972376690782
342ac6fd3f5f44c6b97724d618d587cf 9q9hwp 58 1972376690965
482dc171ef0342b79134d77de0f31c4f 9q9jh0 15 1972376952532
b1b4d653f5d9468a8dd18a77edcc5143 9q9jh0 18 1972376945348
47Sunday, 28 July 13
Recommenders often combine multiple signals, via weighted
averages, to rank personalized results:
•GPS of person ∩ road segment
•frequency and recency of visit
•traffic class and rate
•road albedo (sunlight reflection)
•tree shade estimator
Adjusting the mix allows for further personalization at the end use
(defn get-reco [tracks shades]
"subquery to recommend road segments based on GPS tracks"
(<- [?uuid ?road ?geohash ?lat ?lng ?alt
?gps_count ?recent_visit ?road_metric ?tree_metric]
(tracks ?uuid ?geohash ?gps_count ?recent_visit)
(shades ?road ?geohash ?lat ?lng ?alt ?road_metric ?tree_metric)
))
apps
48Sunday, 28 July 13
(defproject cascading-copa "0.1.0-SNAPSHOT"
:description "City of Palo Alto Open Data recommender in Cascalog"
:url "https://github.com/Cascading/CoPA"
:license {:name "Apache License, Version 2.0"
:url "http://www.apache.org/licenses/LICENSE-2.0"
:distribution :repo
}
:uberjar-name "copa.jar"
:aot [copa.core]
:main copa.core
:source-paths ["src/main/clj"]
:dependencies [[org.clojure/clojure "1.4.0"]
[cascalog "1.10.0"]
[cascalog-more-taps "0.3.1-SNAPSHOT"]
[clojure-csv/clojure-csv "1.3.2"]
[org.clojars.sunng/geohash "1.0.1"]
[org.clojure/clojure-contrib "1.2.0"]
[date-clj "1.0.1"]
]
:profiles {:dev {:dependencies [[midje-cascalog "0.4.0"]]}
:provided {:dependencies [
[org.apache.hadoop/hadoop-core "0.20.2-dev"]
]}}
)
apps
49Sunday, 28 July 13
‣ addr: 115 HAWTHORNE AVE
‣ lat/lng: 37.446, -122.168
‣ geohash: 9q9jh0
‣ tree: 413 site 2
‣ species: Liquidambar styraciflua
‣ est. height: 23 m
‣ shade metric: 4.363
‣ traffic: local residential, light traffic
‣ recent visit: 1972376952532
‣ a short walk from my train stop ✔
apps
50Sunday, 28 July 13
Could combine this with a variety of data APIs:
• Trulia neighborhood data, housing prices
• Factual local business (FB Places, etc.)
• CommonCrawl open source full web crawl
• Wunderground local weather data
• WalkScore neighborhood data, walkability
• Data.gov US federal open data
• Data.NASA.gov NASA open data
• DBpedia datasets derived fromWikipedia
• GeoWordNet semantic knowledge base
• Geolytics demographics, GIS, etc.
• Foursquare,Yelp, CityGrid, Localeze,YP
• various photo sharing
apps
walkscore.com/CA/Palo_Alto
51Sunday, 28 July 13
 Cascading, a workflow abstraction
 Cascalog ➟ 2.0
 Palo Alto case study
 Open Data insights
52Sunday, 28 July 13
Trends in Public Administration
late 1880s – late 1920s (Woodrow Wilson)
as hierarchy, bureaucracy → only for the most educated, elite
late 1920s – late 1930s
as a business, relying on “Scientific Method”, gov as a process
late 1930s – late 1940s (Robert Dale)
relationships, behavioral-based → policy not separate from politics
late 1940s – 1980s
yet another form of management → less “command and control”
1980s – 1990s (David Osborne,Ted Gaebler)
New Public Management → service efficiency, more private sector
1990s – present (Janet & Robert Denhardt)
Digital Age → transparency, citizen-based “debugging”, bankruptcies
The Roles,Actors, and Norms Necessary to
Institutionalize Sustainable Collaborative Governance
Peter Pirnejad
USC Price School of Policy
2013-05-02
53Sunday, 28 July 13
Trends in Public Administration
late 1880s – late 1920s (Woodrow Wilson)
as hierarchy, bureaucracy → only for the most educated, elite
late 1920s – late 1930s
as a business, relying on “Scientific Method”, gov as a process
late 1930s – late 1940s (Robert Dale)
relationships, behavioral-based → policy not separate from politics
late 1940s – 1980s
yet another form of management → less “command and control”
1980s – 1990s (David Osborne,Ted Gaebler)
New Public Management → service efficiency, more private sector
1990s – present (Janet & Robert Denhardt)
Digital Age → transparency, citizen-based “debugging”, bankruptcies
The Roles,Actors, and Norms Necessary to
Institutionalize Sustainable Collaborative Governance
Peter Pirnejad
USC Price School of Policy
2013-05-02
Drivers, circa 2013
• governments running out of money,
cannot increase staff and services
• better data infra at scale (cloud, OSS, etc.)
• machine learning techniques to monetize
• viable ecosystem for data products,APIs
• mobile devices enabling use cases
54Sunday, 28 July 13
Open Data notes
Successful apps incorporate three components:
•Big Data (consumer interest, personalization)
•Open Data (monetizing public data)
•Curated Metadata
Most of the largest Cascading deployments leverage some
Open Data components: Climate Corp, Factual, Nokia, etc.
Notes about Open Data use cases: goo.gl/cd995T
Consider buildingeye.com, aggregate building permits:
•pricing data for home owners looking to remodel
•sales data for contractors
•imagine joining data with building inspection history,
for better insights about properties for sale…
55Sunday, 28 July 13
Open Data ecosystem
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, San Francisco, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook
Data feeds structured for
public private partnerships
56Sunday, 28 July 13
Open Data ecosystem – caveats
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, San Francisco, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook
Required Focus
• respond to viable use cases
• not budgeting hackathons
57Sunday, 28 July 13
Open Data ecosystem – caveats
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, San Francisco, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook
Required Focus
• surface the metadata
• curate, allowing for joins/aggregation
• not scans as PDFs
58Sunday, 28 July 13
Open Data ecosystem – caveats
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, San Francisco, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook
Required Focus
• make APIs consumable by automation
• allow for probabilistic usage
• not OSS licensing for data
59Sunday, 28 July 13
Open Data ecosystem – caveats
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, San Francisco, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook
Required Focus
• supply actionable data
• track data provenance carefully
• provide feedback upstream,
i.e., cleaned data at source
• focus on core verticals
60Sunday, 28 July 13
Open Data ecosystem – caveats
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, San Francisco, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook
Required Focus
• address consumer needs
• identify community benefits
of the data
61Sunday, 28 July 13
newsletter for updates:
http://liber118.com/pxn/
shop.oreilly.com/product/0636920028536.do
62Sunday, 28 July 13

More Related Content

What's hot

Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Pattern: PMML for Cascading and Hadoop
Pattern: PMML for Cascading and HadoopPattern: PMML for Cascading and Hadoop
Pattern: PMML for Cascading and HadoopPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
jlettvin.resume.20160922.STAR
jlettvin.resume.20160922.STARjlettvin.resume.20160922.STAR
jlettvin.resume.20160922.STARJonathan Lettvin
 
GraphTech Ecosystem - part 2: Graph Analytics
 GraphTech Ecosystem - part 2: Graph Analytics GraphTech Ecosystem - part 2: Graph Analytics
GraphTech Ecosystem - part 2: Graph AnalyticsLinkurious
 
Making Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFluxMaking Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFluxTrayan Iliev
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
H2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneH2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneJo-fai Chow
 

What's hot (8)

Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Pattern: PMML for Cascading and Hadoop
Pattern: PMML for Cascading and HadoopPattern: PMML for Cascading and Hadoop
Pattern: PMML for Cascading and Hadoop
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
jlettvin.resume.20160922.STAR
jlettvin.resume.20160922.STARjlettvin.resume.20160922.STAR
jlettvin.resume.20160922.STAR
 
GraphTech Ecosystem - part 2: Graph Analytics
 GraphTech Ecosystem - part 2: Graph Analytics GraphTech Ecosystem - part 2: Graph Analytics
GraphTech Ecosystem - part 2: Graph Analytics
 
Making Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFluxMaking Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFlux
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
H2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneH2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to Everyone
 

Similar to OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
PDX Hadoop: Enterprise Data Workflows with Cascading and MesosPDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
PDX Hadoop: Enterprise Data Workflows with Cascading and MesosPaco Nathan
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big DataPaco Nathan
 
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"Paco Nathan
 
Accelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingAccelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingCascading
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightPaco Nathan
 
Reducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop ApplicationsReducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop ApplicationsCascading
 
Implementation of a SaaS based simulation platform using open standards and o...
Implementation of a SaaS based simulation platform using open standards and o...Implementation of a SaaS based simulation platform using open standards and o...
Implementation of a SaaS based simulation platform using open standards and o...Thomas Paviot
 
Cascading concurrent yahoo lunch_nlearn
Cascading concurrent   yahoo lunch_nlearnCascading concurrent   yahoo lunch_nlearn
Cascading concurrent yahoo lunch_nlearnCascading
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowKaxil Naik
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapePaco Nathan
 
The Cascading (big) data application framework
The Cascading (big) data application frameworkThe Cascading (big) data application framework
The Cascading (big) data application frameworkModern Data Stack France
 
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...Cascading
 
OCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, Smile
OCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, SmileOCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, Smile
OCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, SmileOCCIware
 
Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017
Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017
Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017Marc Dutoo
 
Spring Boot & Spring Cloud on PAS- Nate Schutta (1/2)
Spring Boot & Spring Cloud on PAS- Nate Schutta (1/2)Spring Boot & Spring Cloud on PAS- Nate Schutta (1/2)
Spring Boot & Spring Cloud on PAS- Nate Schutta (1/2)VMware Tanzu
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 

Similar to OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data (20)

PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
PDX Hadoop: Enterprise Data Workflows with Cascading and MesosPDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big Data
 
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"
 
Accelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingAccelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with Cascading
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Pattern -A scoring engine
Pattern -A scoring enginePattern -A scoring engine
Pattern -A scoring engine
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 
Reducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop ApplicationsReducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop Applications
 
Implementation of a SaaS based simulation platform using open standards and o...
Implementation of a SaaS based simulation platform using open standards and o...Implementation of a SaaS based simulation platform using open standards and o...
Implementation of a SaaS based simulation platform using open standards and o...
 
Cascading concurrent yahoo lunch_nlearn
Cascading concurrent   yahoo lunch_nlearnCascading concurrent   yahoo lunch_nlearn
Cascading concurrent yahoo lunch_nlearn
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 
Sparkflows.io
Sparkflows.ioSparkflows.io
Sparkflows.io
 
The Cascading (big) data application framework
The Cascading (big) data application frameworkThe Cascading (big) data application framework
The Cascading (big) data application framework
 
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
 
OCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, Smile
OCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, SmileOCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, Smile
OCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, Smile
 
Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017
Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017
Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017
 
Spring Boot & Spring Cloud on PAS- Nate Schutta (1/2)
Spring Boot & Spring Cloud on PAS- Nate Schutta (1/2)Spring Boot & Spring Cloud on PAS- Nate Schutta (1/2)
Spring Boot & Spring Cloud on PAS- Nate Schutta (1/2)
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 

More from Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningPaco Nathan
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?Paco Nathan
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesPaco Nathan
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
 
Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future TensePaco Nathan
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan
 

More from Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future Tense
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 

Recently uploaded

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Recently uploaded (20)

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

  • 1. Using Cascalog to build an app with City of Palo Alto Open Data Paco Nathan http://liber118.com/pxn/ 1Sunday, 28 July 13
  • 2. GitHub repo for the open source project: github.com/Cascading/CoPA/wiki This project began as a Big Data workshop for a graduate seminar at CMU West Many thanks to: Stuart Evans CMU Distinguished Service Professor Jonathan Reichental City of Palo Alto CIO Peter Pirnejad City of Palo Alto Dev Center Director Diego May Junar CEO & Co-founder 2Sunday, 28 July 13
  • 3.  Cascading, a workflow abstraction  Cascalog ➟ 2.0  Palo Alto case study  Open Data insights 3Sunday, 28 July 13
  • 4. Cascading – origins API author Chris Wensel worked as a system architect at an Enterprise firm well-known for many popular data products. Wensel was following the Nutch open source project – where Hadoop started. Observation: would be difficult to find Java developers to write complex Enterprise apps in MapReduce – potential blocker for leveraging new open source technology. 4Sunday, 28 July 13
  • 5. Cascading – functional programming Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature. To ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows: •leverages JVM and Java-based tools without any need to create new languages •allows programmers who have J2EE expertise to leverage the economics of Hadoop clusters 5Sunday, 28 July 13
  • 6. Cascading – functional programming • Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc., have invested in open source projects atop Cascading – used for their large-scale production deployments • new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascalog in Clojure (2010) Scalding in Scala (2012) github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wiki Why Adopting the Declarative Programming PracticesWill ImproveYour Return fromTechnology Dan Woods, 2013-04-17 Forbes forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming- practices-will-improve-your-return-from-technology/ 6Sunday, 28 July 13
  • 7. Hadoop Cluster source tap source tap sink tap trap tap customer profile DBsCustomer Prefs logs logs Logs Data Workflow Cache Customers Support Web App Reporting Analytics Cubes sink tap Modeling PMML Cascading – integrations • partners: Microsoft Azure, Hortonworks, Amazon AWS, MapR, EMC, SpringSource, Cloudera • taps: Memcached, Cassandra, MongoDB, HBase, JDBC, Parquet, etc. • serialization: Avro, Thrift, Kryo, JSON, etc. • topologies: Apache Hadoop, tuple spaces, local mode 7Sunday, 28 July 13
  • 8. Cascading – deployments • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc. • use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc. 8Sunday, 28 July 13
  • 9. Cascading – deployments • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc. • use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc. workflow abstraction addresses: • staffing bottleneck; • system integration; • operational complexity; • test-driven development 9Sunday, 28 July 13
  • 10. Workflow Abstraction – pattern language Cascading uses a “plumbing” metaphor in Java to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Scrub token Document Collection Tokenize Word Count GroupBy token Count Stop Word List Regex token HashJoin Left RHS M R Data is represented as flows of tuples. Operations in the flows bring functional programming aspects into Java A Pattern Language Christopher Alexander, et al. amazon.com/dp/0195019199 10Sunday, 28 July 13
  • 11. Workflow Abstraction – literate programming Cascading workflows generate their own visual documentation: flow diagrams in formal terms, flow diagrams leverage a methodology called literate programming provides intuitive, visual representations for apps – great for cross-team collaboration Scrub token Document Collection Tokenize Word Count GroupBy token Count Stop Word List Regex token HashJoin Left RHS M R Literate Programming Don Knuth literateprogramming.com 11Sunday, 28 July 13
  • 12. Workflow Abstraction – business process following the essence of literate programming, Cascading workflows provide statements of business process this recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data) Cascading creates a separation of concerns between business process and implementation details (Hadoop, etc.) this is especially apparent in large-scale Cascalog apps: “Specify what you require, not how to achieve it.” by virtue of the pattern language, the flow planner then determines how to translate business process into efficient, parallel jobs at scale 12Sunday, 28 July 13
  • 13.  Cascading, a workflow abstraction  Cascalog ➟ 2.0  Palo Alto case study  Open Data insights 13Sunday, 28 July 13
  • 15. For the process used with this Open Data app, we chose to use Cascalog github.com/nathanmarz/cascalog/wiki by Nathan Marz, Sam Ritchie, et al., 2010 a DSL in Clojure which implements Datalog, backed by Cascading Some aspects of CS theory: • Functional Relational Programming • mitigates Accidental Complexity • has been compared with Codd 1969 15Sunday, 28 July 13
  • 16. Accidental Complexity: Not O(N) complexity, but the costs of software engineering at scale over time What happens when you build recommenders, then go work on other projects for six months? What does it cost others to maintain your apps? “Out of theTar Pit”, Moseley & Marks, 2006 goo.gl/SKspn Cascalog allows for leveraging the same framework, same code base, from ad-hoc queries… to modeling… to unit tests… to checkpoints in production use This focuses on the process of structuring data: specify what you require, not how it must be achieved Huge implications for software engineering 16Sunday, 28 July 13
  • 17. pros: • most of the largest use cases for Cascading • 10:1 reduction in code volume compared to SQL • Leiningen build: simple, no surprises, in Clojure itself • test-driven development (TDD) for Big Data • fault-tolerant workflows which are simple to follow • machine learning, map-reduce, etc., started in LISP years ago anywho... cons: • learning curve, limited number of Clojure developers • aggregators are the magic, those take effort to learn 17Sunday, 28 July 13
  • 18. Q: Who uses Cascalog, other than Twitter? A: • Climate Corp • Factual • Nokia • Telefonica • Harvard School of Public Health • YieldBot • uSwitch • etc. 18Sunday, 28 July 13
  • 19. Document Collection Word Count Tokenize GroupBy token Count R M 1 map 1 reduce 18 lines code gist.github.com/3900702 WordCount – conceptual flow diagram cascading.org/category/impatient 19Sunday, 28 July 13
  • 20. WordCount – Cascading app in Java String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); Document Collection Word Count Tokenize GroupBy token Count R M 20Sunday, 28 July 13
  • 21. mapreduce Every('wc')[Count[decl:'count']] Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']'] GroupBy('wc')[by:['token']] Each('token')[RegexSplitGenerator[decl:'token'][args:1]] Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']'] [head] [tail] [{2}:'token', 'count'] [{1}:'token'] [{2}:'doc_id', 'text'] [{2}:'doc_id', 'text'] wc[{1}:'token'] [{1}:'token'] [{2}:'token', 'count'] [{2}:'token', 'count'] [{1}:'token'] [{1}:'token'] WordCount – generated flow diagram Document Collection Word Count Tokenize GroupBy token Count R M 21Sunday, 28 July 13
  • 22. (ns impatient.core   (:use [cascalog.api]         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count))) ; Paul Lam ; github.com/Quantisan/Impatient WordCount – Cascalog / Clojure Document Collection Word Count Tokenize GroupBy token Count R M 22Sunday, 28 July 13
  • 23.  Cascading, a workflow abstraction  Cascalog ➟ 2.0  Palo Alto case study  Open Data insights 23Sunday, 28 July 13
  • 24. Palo Alto is quite a pleasant place •temperate weather •lots of parks, enormous trees •great coffeehouses •walkable downtown •not particularly crowded On a nice summer day, who wants to be stuck indoors on a phone call? Instead, take it outside – go for a walk 24Sunday, 28 July 13
  • 25. 1. Open Data about municipal infrastructure (GIS data: trees, roads, parks) ✚ 2. Big Data about where people like to walk (smartphone GPS logs) ✚ 3. some curated metadata (which surfaces the value) 4. personalized recommendations: “Find a shady spot on a summer day in which to walk near downtown Palo Alto.While on a long conference call. Sipping a latte or enjoying some fro-yo.” Scrub token Document Collection Tokenize Word Count GroupBy token Count Stop Word List Regex token HashJoin Left RHS M R 25Sunday, 28 July 13
  • 26. The City of Palo Alto recently began to support Open Data to give the local community greater visibility into how their city government operates This effort is intended to encourage students, entrepreneurs, local organizations, etc., to build new apps which contribute to the public good paloalto.opendata.junar.com/dashboards/7576/geographic-information/ discovery 26Sunday, 28 July 13
  • 27. GIS about trees in Palo Alto: discovery 27Sunday, 28 July 13
  • 28. Geographic_Information,,, "Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl"," Private: -1 Tree ID: 29 Street_Name: ADDISON AV Situs Number: 203 Tree Site: 2 Species: Celtis australis Source: davey tree Protected: Designated: Heritage: Appraised Value: Hardscape: None Identifier: 40 Active Numeric: 1 Location Feature ID: 13872 Provisional: Install Date: ","37.4409634615283,-122.15648458861,0.0 ","Point" "Wilkie Way from West Meadow Drive to Victoria Place"," Sequence: 20 Street_Name: Wilkie Way From Street PMMS: West Meadow Drive To Street PMMS: Victoria Place Street ID: 598 (Wilkie Wy, Palo Alto) From Street ID PMMS: 689 To Street ID PMMS: 567 Year Constructed: 1950 Traffic Count: 596 Traffic Index: residential local Traffic Class: local residential Traffic Date: 08/24/90 Paving Length: 208 Paving Width: 40 Paving Area: 8320 Surface Type: asphalt concrete Surface Thickness: 2.0 Base Type Pvmt: crusher run base Base Thickness: 6.0 Soil Class: 2 Soil Value: 15 Curb Type: Curb Thickness: Gutter Width: 36.0 Book: 22 Page: 1 District Number: 18 Land Use PMMS: 1 Overlay Year: 1990 Overlay Thickness: 1.5 Base Failure Year: 1990 Base Failure Thickness: 6 Surface Treatment Year: Surface Treatment Type: Alligator Severity: none Alligator Extent: 0 Block Severity: none Block Extent: 0 Longitude and Transverse Severity: none Longitude and Transverse Extent: 0 Ravelling Severity: none Ravelling Extent: 0 Ridability Severity: none Trench Severity: none Trench Extent: 0 Rutting Severity: none Rutting Extent: 0 Road Performance: UL (Urban Local) Bike Lane: 0 Bus Route: 0 Truck Route: 0 Remediation: Deduct Value: 100 Priority: Pavement Condition: excellent Street Cut Fee per SqFt: 10.00 Source Date: 6/10/2009 User Modified By: mnicols Identifier System: 21410 ","-122.1249640794,37.4155803115645,0.0 -122.124661859039,37.4154224594993,0.0 -122.124587720719,37.4153758330704,0.0 -122.12451895942,37.4153242300888,0.0 -122.124456098457,37.4152680432944,0.0 -122.124399616238,37.4152077003122,0.0 -122.124374937753,37.4151774433318,0.0 ","Line" discovery (unstructured data…) 28Sunday, 28 July 13
  • 29. (defn parse-gis [line] "leverages parse-csv for complex CSV format in GIS export" (first (csv/parse-csv line)) )     (defn etl-gis [gis trap] "subquery to parse data sets from the GIS source tap" (<- [?blurb ?misc ?geo ?kind] (gis ?line) (parse-gis ?line :> ?blurb ?misc ?geo ?kind) (:trap (hfs-textline trap)) )) discovery (specify what you require, not how to achieve it… 80:20 cost of data prep) 29Sunday, 28 July 13
  • 30. discovery (ad-hoc queries get refined into composable predicates) Identifier: 474 Tree ID: 412 Tree: 412 site 1 at 115 HAWTHORNE AV Tree Site: 1 Street_Name: HAWTHORNE AV Situs Number: 115 Private: -1 Species: Liquidambar styraciflua Source: davey tree Hardscape: None 37.446001565119,-122.167713417554,0.0 Point 30Sunday, 28 July 13
  • 32. (defn get-trees [src trap tree_meta] "subquery to parse/filter the tree data" (<- [?blurb ?tree_id ?situs ?tree_site ?species ?wikipedia ?calflora ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash ] (src ?blurb ?misc ?geo ?kind) (re-matches #"^s+Private.*Tree ID.*" ?misc) (parse-tree ?misc :> _ ?priv ?tree_id ?situs ?tree_site ?raw_species) ((c/comp s/trim s/lower-case) ?raw_species :> ?species) (tree_meta ?species ?wikipedia ?calflora ?min_height ?max_height) (avg ?min_height ?max_height :> ?avg_height) (geo-tree ?geo :> _ ?tree_lat ?tree_lng ?tree_alt) (read-string ?tree_lat :> ?lat) (read-string ?tree_lng :> ?lng) (geohash ?lat ?lng :> ?geohash) (:trap (hfs-textline trap)) )) discovery ?blurb! ! Tree: 412 site 1 at 115 HAWTHORNE AV, on HAWTHORNE AV 22 from pl ?tree_id! " 412 ?situs" " 115 ?tree_site" 1 ?species" " liquidambar styraciflua ?wikipedia" http://en.wikipedia.org/wiki/Liquidambar_styraciflua ?calflora http://calflora.org/cgi-bin/species_query.cgi?where-calrecnum=8598 ?avg_height"27.5 ?tree_lat" 37.446001565119 ?tree_lng" -122.167713417554 ?tree_alt" 0.0 ?geohash" " 9q9jh0 32Sunday, 28 July 13
  • 33. // run analysis and visualization in R library(ggplot2) dat_folder <- '~/src/concur/CoPA/out/tree' data <- read.table(file=paste(dat_folder, "part-00000", sep="/"), sep="t", quote="", na.strings="NULL", header=FALSE, encoding="UTF8")   summary(data) t <- head(sort(table(data$V5), decreasing=TRUE) trees <- as.data.frame.table(t, n=20)) colnames(trees) <- c("species", "count")   m <- ggplot(data, aes(x=V8)) m <- m + ggtitle("Estimated Tree Height (meters)") m + geom_histogram(aes(y = ..density.., fill = ..count..)) + geom_density()   par(mar = c(7, 4, 4, 2) + 0.1) plot(trees, xaxt="n", xlab="") axis(1, labels=FALSE) text(1:nrow(trees), par("usr")[3] - 0.25, srt=45, adj=1, labels=trees$species, xpd=TRUE) grid(nx=nrow(trees)) discovery 33Sunday, 28 July 13
  • 34. discovery sweetgum analysis of the tree data: 34Sunday, 28 July 13
  • 36. (defn get-roads [src trap road_meta] "subquery to parse/filter the road data" (<- [?blurb ?bike_lane ?bus_route ?truck_route ?albedo ?min_lat ?min_lng ?min_alt ?geohash ?traffic_count ?traffic_index ?traffic_class ?paving_length ?paving_width ?paving_area ?surface_type ] (src ?blurb ?misc ?geo ?kind) (re-matches #"^s+Sequence.*Traffic Count.*" ?misc) (parse-road ?misc :> _ ?traffic_count ?traffic_index ?traffic_class ?paving_length ?paving_width ?paving_area ?surface_type ?overlay_year ?bike_lane ?bus_route ?truck_route) (road_meta ?surface_type ?albedo_new ?albedo_worn) (estimate-albedo ?overlay_year ?albedo_new ?albedo_worn :> ?albedo) (bigram ?geo :> ?pt0 ?pt1) (midpoint ?pt0 ?pt1 :> ?lat ?lng ?alt) ;; why filter for min? because there are geo duplicates.. (c/min ?lat :> ?min_lat) (c/min ?lng :> ?min_lng) (c/min ?alt :> ?min_alt) (geohash ?min_lat ?min_lng :> ?geohash) (:trap (hfs-textline trap)) )) discovery 36Sunday, 28 July 13
  • 37. ?blurb"" " " Hawthorne Avenue from Alma Street to High Street ?traffic_count"3110 ?traffic_class"local residential ?surface_type" asphalt concrete ?albedo" " " 0.12 ?min_lat"" " 37.446140860599854" ?min_lng "" " -122.1674652295435 ?min_alt "" " 0.0 ?geohash"" " 9q9jh0 (another data product) discovery 37Sunday, 28 July 13
  • 38. The road data provides: •traffic class (arterial, truck route, residential, etc.) •traffic counts distribution •surface type (asphalt, cement; age) This leads to estimators for noise, reflection, etc. discovery 38Sunday, 28 July 13
  • 40. GIS data from Palo Alto provides us with geolocation about each item in the export: latitude, longitude, altitude Geo data is great for managing municipal infrastructure as well as for mobile apps Predictive modeling in our Open Data example focuses on leveraging geolocation We use spatial indexing by creating a grid of geohash values, for efficient parallel processing Cascalog queries collect items with the same geohash values – using them as keys for large-scale joins (Hadoop) modeling 40Sunday, 28 July 13
  • 41. 9q9jh0 geohash with 6-digit resolution approximates a 5-block square centered lat: 37.445, lng: -122.162 modeling 41Sunday, 28 July 13
  • 42. Each road in the GIS export is listed as a block between two cross roads, and each may have multiple road segments to represent turns: " -122.161776959558,37.4518836690781,0.0 " -122.161390381489,37.4516410983794,0.0 " -122.160786011735,37.4512589903357,0.0 " -122.160531178368,37.4510977281699,0.0 modeling ( lat0, lng0, alt0 ) ( lat1, lng1, alt1 ) ( lat2, lng2, alt2 ) ( lat3, lng3, alt3 ) NB: segments in the raw GIS have the order of geo coordinates scrambled: (lng, lat, alt) 42Sunday, 28 July 13
  • 43. 9q9jh0 X X X Filter trees which are too far away to provide shade. Calculate a sum of moments for tree height × distance, as an estimator for shade: modeling 43Sunday, 28 July 13
  • 44. (defn get-shade [trees roads] "subquery to join tree and road estimates, maximize for shade" (<- [?road_name ?geohash ?road_lat ?road_lng ?road_alt ?road_metric ?tree_metric] (roads ?road_name _ _ _ ?albedo ?road_lat ?road_lng ?road_alt ?geohash ?traffic_count _ ?traffic_class _ _ _ _) (road-metric ?traffic_class ?traffic_count ?albedo :> ?road_metric) (trees _ _ _ _ _ _ _ ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash) (read-string ?avg_height :> ?height) ;; limit to trees which are higher than people (> ?height 2.0) (tree-distance ?tree_lat ?tree_lng ?road_lat ?road_lng :> ?distance) ;; limit to trees within a one-block radius (not meters) (<= ?distance 25.0) (/ ?height ?distance :> ?tree_moment) (c/sum ?tree_moment :> ?sum_tree_moment) ;; magic number 200000.0 used to scale tree moment ;; based on median (/ ?sum_tree_moment 200000.0 :> ?tree_metric) )) modeling 44Sunday, 28 July 13
  • 47. (defn get-gps [gps_logs trap] "subquery to aggregate and rank GPS tracks per user" (<- [?uuid ?geohash ?gps_count ?recent_visit] (gps_logs ?date ?uuid ?gps_lat ?gps_lng ?alt ?speed ?heading ?elapsed ?distance) (read-string ?gps_lat :> ?lat) (read-string ?gps_lng :> ?lng) (geohash ?lat ?lng :> ?geohash) (c/count :> ?gps_count) (date-num ?date :> ?visit) (c/max ?visit :> ?recent_visit) )) modeling ?uuid ?geohash ?gps_count ?recent_visit cf660e041e994929b37cc5645209c8ae 9q8yym 7 1972376866448 342ac6fd3f5f44c6b97724d618d587cf 9q9htz 4 1972376690969 32cc09e69bc042f1ad22fc16ee275e21 9q9hv3 3 1972376670935 342ac6fd3f5f44c6b97724d618d587cf 9q9hv3 3 1972376691356 342ac6fd3f5f44c6b97724d618d587cf 9q9hwn 13 1972376690782 342ac6fd3f5f44c6b97724d618d587cf 9q9hwp 58 1972376690965 482dc171ef0342b79134d77de0f31c4f 9q9jh0 15 1972376952532 b1b4d653f5d9468a8dd18a77edcc5143 9q9jh0 18 1972376945348 47Sunday, 28 July 13
  • 48. Recommenders often combine multiple signals, via weighted averages, to rank personalized results: •GPS of person ∩ road segment •frequency and recency of visit •traffic class and rate •road albedo (sunlight reflection) •tree shade estimator Adjusting the mix allows for further personalization at the end use (defn get-reco [tracks shades] "subquery to recommend road segments based on GPS tracks" (<- [?uuid ?road ?geohash ?lat ?lng ?alt ?gps_count ?recent_visit ?road_metric ?tree_metric] (tracks ?uuid ?geohash ?gps_count ?recent_visit) (shades ?road ?geohash ?lat ?lng ?alt ?road_metric ?tree_metric) )) apps 48Sunday, 28 July 13
  • 49. (defproject cascading-copa "0.1.0-SNAPSHOT" :description "City of Palo Alto Open Data recommender in Cascalog" :url "https://github.com/Cascading/CoPA" :license {:name "Apache License, Version 2.0" :url "http://www.apache.org/licenses/LICENSE-2.0" :distribution :repo } :uberjar-name "copa.jar" :aot [copa.core] :main copa.core :source-paths ["src/main/clj"] :dependencies [[org.clojure/clojure "1.4.0"] [cascalog "1.10.0"] [cascalog-more-taps "0.3.1-SNAPSHOT"] [clojure-csv/clojure-csv "1.3.2"] [org.clojars.sunng/geohash "1.0.1"] [org.clojure/clojure-contrib "1.2.0"] [date-clj "1.0.1"] ] :profiles {:dev {:dependencies [[midje-cascalog "0.4.0"]]} :provided {:dependencies [ [org.apache.hadoop/hadoop-core "0.20.2-dev"] ]}} ) apps 49Sunday, 28 July 13
  • 50. ‣ addr: 115 HAWTHORNE AVE ‣ lat/lng: 37.446, -122.168 ‣ geohash: 9q9jh0 ‣ tree: 413 site 2 ‣ species: Liquidambar styraciflua ‣ est. height: 23 m ‣ shade metric: 4.363 ‣ traffic: local residential, light traffic ‣ recent visit: 1972376952532 ‣ a short walk from my train stop ✔ apps 50Sunday, 28 July 13
  • 51. Could combine this with a variety of data APIs: • Trulia neighborhood data, housing prices • Factual local business (FB Places, etc.) • CommonCrawl open source full web crawl • Wunderground local weather data • WalkScore neighborhood data, walkability • Data.gov US federal open data • Data.NASA.gov NASA open data • DBpedia datasets derived fromWikipedia • GeoWordNet semantic knowledge base • Geolytics demographics, GIS, etc. • Foursquare,Yelp, CityGrid, Localeze,YP • various photo sharing apps walkscore.com/CA/Palo_Alto 51Sunday, 28 July 13
  • 52.  Cascading, a workflow abstraction  Cascalog ➟ 2.0  Palo Alto case study  Open Data insights 52Sunday, 28 July 13
  • 53. Trends in Public Administration late 1880s – late 1920s (Woodrow Wilson) as hierarchy, bureaucracy → only for the most educated, elite late 1920s – late 1930s as a business, relying on “Scientific Method”, gov as a process late 1930s – late 1940s (Robert Dale) relationships, behavioral-based → policy not separate from politics late 1940s – 1980s yet another form of management → less “command and control” 1980s – 1990s (David Osborne,Ted Gaebler) New Public Management → service efficiency, more private sector 1990s – present (Janet & Robert Denhardt) Digital Age → transparency, citizen-based “debugging”, bankruptcies The Roles,Actors, and Norms Necessary to Institutionalize Sustainable Collaborative Governance Peter Pirnejad USC Price School of Policy 2013-05-02 53Sunday, 28 July 13
  • 54. Trends in Public Administration late 1880s – late 1920s (Woodrow Wilson) as hierarchy, bureaucracy → only for the most educated, elite late 1920s – late 1930s as a business, relying on “Scientific Method”, gov as a process late 1930s – late 1940s (Robert Dale) relationships, behavioral-based → policy not separate from politics late 1940s – 1980s yet another form of management → less “command and control” 1980s – 1990s (David Osborne,Ted Gaebler) New Public Management → service efficiency, more private sector 1990s – present (Janet & Robert Denhardt) Digital Age → transparency, citizen-based “debugging”, bankruptcies The Roles,Actors, and Norms Necessary to Institutionalize Sustainable Collaborative Governance Peter Pirnejad USC Price School of Policy 2013-05-02 Drivers, circa 2013 • governments running out of money, cannot increase staff and services • better data infra at scale (cloud, OSS, etc.) • machine learning techniques to monetize • viable ecosystem for data products,APIs • mobile devices enabling use cases 54Sunday, 28 July 13
  • 55. Open Data notes Successful apps incorporate three components: •Big Data (consumer interest, personalization) •Open Data (monetizing public data) •Curated Metadata Most of the largest Cascading deployments leverage some Open Data components: Climate Corp, Factual, Nokia, etc. Notes about Open Data use cases: goo.gl/cd995T Consider buildingeye.com, aggregate building permits: •pricing data for home owners looking to remodel •sales data for contractors •imagine joining data with building inspection history, for better insights about properties for sale… 55Sunday, 28 July 13
  • 56. Open Data ecosystem municipal departments publishing platforms aggregators data product vendors end use cases e.g., Palo Alto, San Francisco, etc. e.g., Junar, Socrata, etc. e.g., OpenStreetMap,WalkScore, etc. e.g., Factual, Marinexplore, etc. e.g., Facebook Data feeds structured for public private partnerships 56Sunday, 28 July 13
  • 57. Open Data ecosystem – caveats municipal departments publishing platforms aggregators data product vendors end use cases e.g., Palo Alto, San Francisco, etc. e.g., Junar, Socrata, etc. e.g., OpenStreetMap,WalkScore, etc. e.g., Factual, Marinexplore, etc. e.g., Facebook Required Focus • respond to viable use cases • not budgeting hackathons 57Sunday, 28 July 13
  • 58. Open Data ecosystem – caveats municipal departments publishing platforms aggregators data product vendors end use cases e.g., Palo Alto, San Francisco, etc. e.g., Junar, Socrata, etc. e.g., OpenStreetMap,WalkScore, etc. e.g., Factual, Marinexplore, etc. e.g., Facebook Required Focus • surface the metadata • curate, allowing for joins/aggregation • not scans as PDFs 58Sunday, 28 July 13
  • 59. Open Data ecosystem – caveats municipal departments publishing platforms aggregators data product vendors end use cases e.g., Palo Alto, San Francisco, etc. e.g., Junar, Socrata, etc. e.g., OpenStreetMap,WalkScore, etc. e.g., Factual, Marinexplore, etc. e.g., Facebook Required Focus • make APIs consumable by automation • allow for probabilistic usage • not OSS licensing for data 59Sunday, 28 July 13
  • 60. Open Data ecosystem – caveats municipal departments publishing platforms aggregators data product vendors end use cases e.g., Palo Alto, San Francisco, etc. e.g., Junar, Socrata, etc. e.g., OpenStreetMap,WalkScore, etc. e.g., Factual, Marinexplore, etc. e.g., Facebook Required Focus • supply actionable data • track data provenance carefully • provide feedback upstream, i.e., cleaned data at source • focus on core verticals 60Sunday, 28 July 13
  • 61. Open Data ecosystem – caveats municipal departments publishing platforms aggregators data product vendors end use cases e.g., Palo Alto, San Francisco, etc. e.g., Junar, Socrata, etc. e.g., OpenStreetMap,WalkScore, etc. e.g., Factual, Marinexplore, etc. e.g., Facebook Required Focus • address consumer needs • identify community benefits of the data 61Sunday, 28 July 13