SlideShare a Scribd company logo
1 of 53
1
Big Social Data:
The Spatial Turn in Big Data
Rich Heimann, UMBC Adjunct Faculty
Abe Usher, HumanGeo Group
May 9, 2013
2
Agenda
 Major Trends; Foundational Definitions. [Abe]
 Long Tail of Big Social Data [Rich]
 Laws of the Spatial Sciences [Rich]
– Big Data; Small Theory [Rich]
 Important Big Data Concepts [Abe]
– The Kitchen Model [Abe]
 Vignettes [Rich & Abe]
 So, what?
 Additional Resources
2
3
Major Trends
 Location Explosion 2004- present
4
 Location Explosion 2004- present
 Proliferation of mobile computing
Major Trends
7 billion devices in 2014
5
 Location Explosion 2004- present
 Proliferation of mobile computing
 Social networking
Major Trends
> 700 million comments daily
> 144 million connections daily
6
 Location Explosion 2004- present
 Proliferation of mobile computing
 Social networking
 Gamification of geo
Major Trends
7
 Location Explosion 2004- present
 Proliferation of mobile computing
 Social networking
 Gamification of geo
Impact:
Continuous, global geo-located observations,
shared across the Internet.
Impact:
Continuous, global geo-located observations,
shared across the Internet.
Major Trends
8
Definitions
 Volunteered Geographic Information* (VGI)
“harnessing of tools to create, assemble, and
disseminate geographic data provided voluntarily
by individuals”
* http://en.wikipedia.org/wiki/Volunteered_geographic_information
8
9
 Volunteered Geographic Information (VGI)
 Social Media
"a group of Internet-based applications that build
on the ideological and technological foundations
of Web 2.0, and that allow the creation and
exchange of user-generated content”
* http://goo.gl/oSrIS
9
Definitions
10
 Volunteered Geographic Information (VGI)
 Social Media
 Big Data
“is high-volume, velocity and variety information
assets that demand cost-effective, innovative
forms of information processing for enhanced
insight and decision making”
* http://goo.gl/DFFbr
10
Definitions
11
Long Tail: Traditional Social Science DataLong Tail: Traditional Social Science Data
Head: Big Data; nontraditionalHead: Big Data; nontraditional
social science data.social science data.
Head: Big Data – Large continuous datasets coincidentcoincident over Time & Space. Ideal for multivariate analysis.
Tail {power law distribution} Data in tail is often unmaintained beyond their initially designed use case and
individually curated. As a result, the data is discontiguous from other research efforts and discontinuous over
space and time.
Dark data is suspected to exist or ought to exist but is difficult or impossible to find. The problem of dark data is
real and prevalent in the tail. The long tail is an intractably large management problem.
Long Tail of Big Social Data
12
Power lawPower law 80%80% 20%20%
Number of Grants 7,478 1,869
Dollar Amount $938,548,595 $1,199,088,125
Total Grants (NSF07) 9,347 (Count) $2,137,636,716 (Amount)
Long Tail of NSF Data
13
Tobler’s [Tobler, 1970] First Law of Geography (TFLG)
TFLG: “All things are related, but nearby things are more related than
distant things”
Spatial Heterogeneity
“Second law of geography”[Goodchild, 2003].
Spatial Simpson’s Paradox
Global model will always compete and may be inconsistent with local
models.
Anyon (1982): social science should be empirically grounded,
theoretically explanatory and socially critical.
Laws of Spatial Science
13
http://www.bigdatarepublic.com/author.asp?section_id=2948
14
Spatial Simpson’s ParadoxSpatial Simpson’s Paradox
Global standards will always compete with local social phenomenon.Global standards will always compete with local social phenomenon.
Violence in
the south
Violence in theViolence in the
northnorth
Violence in the
south
Violence inViolence in
the norththe north
Violence
GlobalGlobal models average regionally variant phenomenon. LocalLocal models account for regional variation.
Big Data; Small Theory
14
15
Important Big Data Concepts
 Aggregation
 Association
 Correlation
15
16
Important Big Data Concepts
 Aggregation
 Quantitative methods for creating descriptive statistics
 Association
 Methods of identifying relationships of one data
element to another
 Correlation
 The process of quantifying a correspondence between
two comparable entities
17
Two Vignettes
1. Spatially patterning of Tweet composition
from the Presidential elections of 2012.
2. Pattern of life analysis of a major US city.
18
Kitchen Model
Chef Ingredients Utensils Recipes
19
Kitchen Model
Chef Ingredients Utensils Recipes
20
Practice: Recommended Tools
• Python
• R
• Quantum GIS
• Google Earth
20
21
Vignette 1:
The Flesch-Kincaid Reading Algorithm
22
RE = 206.835 – (1.015 x ASL) – (84.6 x ASW)
RE = Readability Ease; ASL = Average Sentence Length (i.e., the number of words
divided by the number of sentences); ASW = Average number of syllables per word (i.e.,
the number of syllables divided by the number of words)
The output, i.e., RE is a number generally ranging from 0 to 100. The higher the number,
the easier the text is to read.
• Scores between 90.0 and 100.0 are considered easily understandable by an average
5th grader.
• Scores between 60.0 and 70.0 are considered easily understood by 8th and 9th
graders.
• Scores between 0.0 and 30.0 are considered easily understood by college graduates.
The Flesch-Kincaid Reading Algorithm
23
Clean Text “this gas situation is absolutely ridiculous.”
Language english
Latitude 41.0862
Longitude -74.1520
USERID “ ”
Kincaid 14.3
Flesch 3.3
Flesch-Kincaid (Mean
Centered)
-76.273849
Leesbaarheid Score 56
Leesbaarheid Grade 11
The Flesch-Kincaid Reading Algorithm
24
Clean Text “down here in beach bout to shut this down
wit & feeling the vibe s.”
Language english
Latitude 33.68709
Longitude -78.88915
USERID “ ”
Kincaid 3.5
Flesch 100
Flesch-Kincaid (Mean
Centered)
20.42615
Leesbaarheid Score 22.9
Leesbaarheid Grade 4
The Flesch-Kincaid Reading Algorithm
25
Time Span: 2012-10-23 to 2012-11-06 (1 temporal bin, 2 weeks);
Spatial Area: Data Clipped to US;
Original Sample: 110,737 obs; 418,085 words & 1,446,494 characters without stop words
(519,974 & 2,326,500 with stop words);
Data processing: Removal of hashtags, @{users}, URLs, thresholding and mean centering;
Pruned Sample: 47,690 observations;
Method: Local Indicator of Spatial Autocorrelation (Moran’s I) with LISA Classifications of
High-High (HH), Low-Low (LL), High-Low (HL), Low-High (LH);
Spatial Weights: knn40;
Data Reduction: pseudo p-values 0.05, 0.01, 0.001.
By the numbers...
26
Region mean SD 0% 25% 50% 75% 100% data:n
East North
Central 0.6193 16.514 -76.274 -5.77 4.93 11.92 20.426 7579
East South
Central 0.6314 16.576 -74.673 -5.27 4.93 12.23 20.426 3028
Mid-
Atlantic -0.1988 16.590 -76.273 -6.47 3.73 11.43 20.426 6278
Mountain -0.1212 16.586 -73.174 -7.00 4.32 11.43 20.426 2452
New
England -0.1837 16.864 -73.174 -7.00 4.32 11.43 20.426 2392
Pacific -0.8560 17.276 -78.274 -7.78 3.72 11.43 20.426 5390
Southeast 0.1469 16.730 -79.373 -5.78 4.32 11.43 20.426 10022
West North
Central 0.6010 16.385 -78.274 -5.78 5.22 12.23 20.426 2781
West South
Central 0.8323 16.386 -79.273 -4.77 5.33 12.12 20.426 5572
The Flesch-Kincaid Reading Algorithm
27
The Flesch-Kincaid Reading Algorithm
(ggplot2)
(Twitter, aes(x=regiontxt, y=flecMC, ylab="Flesch Kincaid Index", xlab="Region", data=Twitter))
geom_point(colour="lightblue", alpha=0.1, position="jitter") +
geom_boxplot(outlier.size=1, alpha=0.1)
ot(flecMC~regiontxt, ylab="flecMC", xlab="regiontxt", data=Twitter)
https://gist.github.com/rheimann/5525909
29https://github.com/rheimann
The Flesch-Kincaid Reading Algorithm
Raw Data: data:n 47,690
30
High, High [n=77]
Low, Low   [n=74]
Low, High  [n=53]
High, Low  [n=55]
= El Paso, Oklahoma City, Omaha, Detroit, Memphis
= NYC & San Jose #nerds
= Sacramento
= Wichita, Kansas City, Tulsa, Nashville
pseudo p-value < 0.05
data:n 862 (3-digit Zip Codes)
Gassaway, WV
Watertown NY
Ithaca NY
Columbus OH
Fresno CA
https://github.com/rheimann
The Flesch-Kincaid Reading Algorithm
31
Rank ZIP code, City, State Median Home Price ($) Flesch-Kincaid Index
Mean Centered
Leesbaarheid
School Index
100 Zip Code -3.2266 5.44
6 10014, New York, NY 4,116,506
8 10021, New York, NY 3,980,829
1 10065, New York, NY 6,534,430
10 10075, New York, NY 3,885,409
076 Zip Code -3.761 5.5
2 07620, Alpine, NJ 5,745,038
119 Zip Code -0.0538 5.2
4 11962, Sagaponack, NY 4,180,385
5 940 Zip Code
3 94027, Atherton, CA 4,897,864
5 94010, Hillsborough, CA 4,127,250
7 94022, Los Altos Hills, CA 4,016,050 -3.596 5.87
The Flesch Reading Ease Algorithm
32
Green Eggs and Ham by Dr. Suess averages 5.7 words per sentence
and 1.02 syllables per word, with a grade level of −1.3. (Most of the 50
used words are monosyllabic; "anywhere", which occurs 8 times, is
the only exception.) The 50 dimensional space is small.
Even this fairly small Twitter sample & after lots of data processing to
remove words of count:1 and words fewer than three characters the
N:12,603 dimensional space.
Data Processing includes removing stop words and stemming.
110,737 obs; 418,085 words & 1,446,494 characters without stop words
(519,974 & 2,326,500 with stop words);
Top 50 words include: [romney, obama, election, vote, hope]
Green Eggs and Ham: N - Dimensional Problems
33
Vignette 2:
Spatial Patterns of Activity
34
Spatial Patterns of Activity:
Geolocated Social Media
New forms of aggregation unlock
new insights in your data.
 Useful for coarse
pattern analysis
 Looks interesting
 Difficult to analyze
directly
35
Rich & Abe Geolocated
Social Media
Python
Geohash
Algorithm
Code on
Github
Spatial Patterns of Activity:
Applying the Kitchen Sink
36
 States, Counties, and
Census tracks
 All different sizes
 Sometimes change
 This is a problem:
MAUP
http://goo.gl/wQLTW
Spatial Patterns of Activity:
Let’s use Political Boundaries
37
 States, Counties, and
Census tracks
 All different sizes
 Sometimes change
 This is a problem: MAUP
http://goo.gl/wQLTW
Spatial Patterns of Activity:
Let’s NOT use Political Boundaries
38
 Invented in 2008 by
Gustavo Niemeyer
 Similar to quadtree;
breaks the world into
rectangles
 Based on a z-curve
algorithm
 Useful for 2-d binning
Spatial Patterns of Activity:
Geohash
39
4
4
4
8
5
6
4
2
4
4
4
9
5
4
4
3
2
4
4
2
1
4
1
1
6
5
5
4
2
Spatial Patterns of Activity:
Geohash Math
Notional example:
Occurrence of geolocated tweets
related to coffee.
40
4
4
4
8
5
6
4
2
4
4
4
9
5
4
4
3
2
4
4
2
1
4
1
1
6
5
5
4
2
Spatial Patterns of Activity:
Geohash Math
41
Spatial Patterns of Activity:
Geohash Math
42
Activity near Washington DC
Spatial Patterns of Activity:
3-d Google Earth
43
Activity near Washington DC
Spatial Patterns of Activity:
3-d Google Earth
44
Spatial Patterns of Activity:
Avoid the Classic Blunders
http://xkcd.com/1138/
45
Night activity near Washington DC
Spatial Patterns of Activity:
Isolating a Time Series
46
Spatial Patterns of Activity:
Isolating a Time Series
School Event
Tourists
School Event
Spatial Patterns of Activity:
A Caffeinated Example
 Aggregation
 Where is the most commentary
about coffee and Starbucks?
 Association
 Is commentary about coffee and
Starbucks associated with the
location of Starbucks stores?
(Yes)
 Correlation
 What is the numeric relationship
between geo-located coffee
commentary and actual stores?
Where is Starbucks?
81 spatial regions identified with textual references
to the words ‘coffee’ and/or ‘Starbucks.’
8 of the 81 regions are boxes that include both
references to ‘coffee’ and ‘Starbucks’ within a
narrow window of time.
7 of 8 (88%) accurately classify a region as
containing a Starbucks by using simple text
analysis alone.
.09
.52
.88
49
• Putting data in geospatial context unlocks
insight.
• Location teaches us more about what we
are analyzing.
• Adhere to statistical assumptions and
avoid misspecification in our models.
• The “Big Data” aspects of social media
mean that the faucet is always running,
enabling experimentation.
So, What?
50
Eugene Wigner (1960 Nobel Laureate)
““The Unreasonable Effectiveness of Data”The Unreasonable Effectiveness of Data”
Peter Norvig Director of Research at Google Inc.
““The Unreasonable Effectiveness ofThe Unreasonable Effectiveness of
Mathematics in the Natural Sciences”Mathematics in the Natural Sciences”
Academic Works; Embracing Complexity
51
Additional resources; Code and
stuff...
Rich Heimann
Code and Data: https://github.com/rheimann
Slides: http://www.slideshare.net/rheimann04
Twitter: @rheimann
UMBC: rheimann@umbc.edu
Company: Data Tactics Corporation: http://goo.gl/8QWty
Abe Usher
Code and Data; https://github.com/abeusher
Twitter: @abeusher
Company: HumanGeo Group: http://goo.gl/uDbZP
52
Thank you!!
http://www.umbc.edu/shadygrove/gis/gis.php
53
Recommended resources: Books
54
Foundational data:
1. Geonames.org: http://www.geonames.org/
2. GADM.org: http://gadm.org/
Streaming data:
1. Twitter API: https://dev.twitter.com/
– Datasift: http://datasift.com/
1. GNIP: http://gnip.com/
Recommended resources: Data
54

More Related Content

What's hot

Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...Rich Heimann
 
A Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationA Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationRich Heimann
 
Neural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressNeural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressBhaskar Mitra
 
Kdd 2014 tutorial bringing structure to text - chi
Kdd 2014 tutorial   bringing structure to text - chiKdd 2014 tutorial   bringing structure to text - chi
Kdd 2014 tutorial bringing structure to text - chiBarbara Starr
 
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBenchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBhaskar Mitra
 
Networkx & Gephi Tutorial #Pydata NYC
Networkx & Gephi Tutorial #Pydata NYCNetworkx & Gephi Tutorial #Pydata NYC
Networkx & Gephi Tutorial #Pydata NYCGilad Lotan
 
Transfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningTransfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningSebastian Ruder
 
Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentationBushra Jbawi
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackBhaskar Mitra
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisJonathan Stray
 
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Jonathan Stray
 
Using Embeddings for Dynamic Diverse Summarisation in Heterogeneous Graph Str...
Using Embeddings for Dynamic Diverse Summarisation in Heterogeneous Graph Str...Using Embeddings for Dynamic Diverse Summarisation in Heterogeneous Graph Str...
Using Embeddings for Dynamic Diverse Summarisation in Heterogeneous Graph Str...Niki Pavlopoulou
 
Frontiers of Computational Journalism week 3 - Information Filter Design
Frontiers of Computational Journalism week 3 - Information Filter DesignFrontiers of Computational Journalism week 3 - Information Filter Design
Frontiers of Computational Journalism week 3 - Information Filter DesignJonathan Stray
 
Informatics is a natural science
Informatics is a natural scienceInformatics is a natural science
Informatics is a natural scienceFrank van Harmelen
 
Greedy Incremental approach for unfolding of communities in massive networks
Greedy Incremental approach for unfolding of communities in massive networksGreedy Incremental approach for unfolding of communities in massive networks
Greedy Incremental approach for unfolding of communities in massive networksIJCSIS Research Publications
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for SciencePaul Groth
 
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...Jonathan Stray
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?Samet KILICTAS
 

What's hot (20)

Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...
 
A Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationA Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics Corporation
 
Neural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressNeural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progress
 
Kdd 2014 tutorial bringing structure to text - chi
Kdd 2014 tutorial   bringing structure to text - chiKdd 2014 tutorial   bringing structure to text - chi
Kdd 2014 tutorial bringing structure to text - chi
 
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBenchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
 
Poster
PosterPoster
Poster
 
Networkx & Gephi Tutorial #Pydata NYC
Networkx & Gephi Tutorial #Pydata NYCNetworkx & Gephi Tutorial #Pydata NYC
Networkx & Gephi Tutorial #Pydata NYC
 
Transfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningTransfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine Learning
 
Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentation
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text Analysis
 
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
 
Using Embeddings for Dynamic Diverse Summarisation in Heterogeneous Graph Str...
Using Embeddings for Dynamic Diverse Summarisation in Heterogeneous Graph Str...Using Embeddings for Dynamic Diverse Summarisation in Heterogeneous Graph Str...
Using Embeddings for Dynamic Diverse Summarisation in Heterogeneous Graph Str...
 
Frontiers of Computational Journalism week 3 - Information Filter Design
Frontiers of Computational Journalism week 3 - Information Filter DesignFrontiers of Computational Journalism week 3 - Information Filter Design
Frontiers of Computational Journalism week 3 - Information Filter Design
 
Informatics is a natural science
Informatics is a natural scienceInformatics is a natural science
Informatics is a natural science
 
Greedy Incremental approach for unfolding of communities in massive networks
Greedy Incremental approach for unfolding of communities in massive networksGreedy Incremental approach for unfolding of communities in massive networks
Greedy Incremental approach for unfolding of communities in massive networks
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
 
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 

Viewers also liked

Guest Talk for Data Society's "INTRO TO DATA SCIENCE BOOT CAMP"
Guest Talk for Data Society's "INTRO TO DATA SCIENCE BOOT CAMP"Guest Talk for Data Society's "INTRO TO DATA SCIENCE BOOT CAMP"
Guest Talk for Data Society's "INTRO TO DATA SCIENCE BOOT CAMP"Rich Heimann
 
Mapping the City
Mapping the CityMapping the City
Mapping the CityColumbiaLRC
 
Deep maps-deep-contingencies: The promise of spatial humanities
Deep maps-deep-contingencies: The promise of spatial humanitiesDeep maps-deep-contingencies: The promise of spatial humanities
Deep maps-deep-contingencies: The promise of spatial humanitiesankeqiang
 
Linked (Geo) Data - Adding a Spatial Dimension to the Web of Data
Linked (Geo) Data - Adding a Spatial Dimension to the Web of DataLinked (Geo) Data - Adding a Spatial Dimension to the Web of Data
Linked (Geo) Data - Adding a Spatial Dimension to the Web of DataAndreas Langegger
 
Workshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityWorkshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityRaffael Marty
 
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingBig Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingHealth Catalyst
 

Viewers also liked (8)

Guest Talk for Data Society's "INTRO TO DATA SCIENCE BOOT CAMP"
Guest Talk for Data Society's "INTRO TO DATA SCIENCE BOOT CAMP"Guest Talk for Data Society's "INTRO TO DATA SCIENCE BOOT CAMP"
Guest Talk for Data Society's "INTRO TO DATA SCIENCE BOOT CAMP"
 
The Future of Check ins
The Future of Check insThe Future of Check ins
The Future of Check ins
 
Mapping the City
Mapping the CityMapping the City
Mapping the City
 
Deep maps-deep-contingencies: The promise of spatial humanities
Deep maps-deep-contingencies: The promise of spatial humanitiesDeep maps-deep-contingencies: The promise of spatial humanities
Deep maps-deep-contingencies: The promise of spatial humanities
 
Twitter by the Numbers
Twitter by the NumbersTwitter by the Numbers
Twitter by the Numbers
 
Linked (Geo) Data - Adding a Spatial Dimension to the Web of Data
Linked (Geo) Data - Adding a Spatial Dimension to the Web of DataLinked (Geo) Data - Adding a Spatial Dimension to the Web of Data
Linked (Geo) Data - Adding a Spatial Dimension to the Web of Data
 
Workshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityWorkshop: Big Data Visualization for Security
Workshop: Big Data Visualization for Security
 
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingBig Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
 

Similar to Big Social Data: The Spatial Turn in Big Data

Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in RFinding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in RRevolution Analytics
 
Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...
Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...
Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...Environmental Intelligence Lab
 
SMART Seminar Series: "Data is the new water in the digital age"
SMART Seminar Series: "Data is the new water in the digital age"SMART Seminar Series: "Data is the new water in the digital age"
SMART Seminar Series: "Data is the new water in the digital age"SMART Infrastructure Facility
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
Space For Human Services Planning
Space For Human Services PlanningSpace For Human Services Planning
Space For Human Services PlanningBrian Cooper
 
Augmenting offical datasets with volunteered geographic information a case ...
Augmenting offical datasets with volunteered geographic information   a case ...Augmenting offical datasets with volunteered geographic information   a case ...
Augmenting offical datasets with volunteered geographic information a case ...Institute for Transport Studies (ITS)
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)Michael Atkins
 
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...CUbRIK Project
 
Spatial data analysis 1
Spatial data analysis 1Spatial data analysis 1
Spatial data analysis 1Johan Blomme
 
Research on Haberman dataset also business required document
Research on Haberman dataset also business required documentResearch on Haberman dataset also business required document
Research on Haberman dataset also business required documentManjuYadav65
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...Marko Rodriguez
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Robert Grossman
 
Self adaptive based natural language interface for disambiguation of
Self adaptive based natural language interface for disambiguation ofSelf adaptive based natural language interface for disambiguation of
Self adaptive based natural language interface for disambiguation ofNurfadhlina Mohd Sharef
 
BIG DATA | How to explain it & how to use it for your career?
BIG DATA | How to explain it & how to use it for your career?BIG DATA | How to explain it & how to use it for your career?
BIG DATA | How to explain it & how to use it for your career?Tuan Yang
 
Autocorrelation_kriging_techniques for Hydrology
Autocorrelation_kriging_techniques for HydrologyAutocorrelation_kriging_techniques for Hydrology
Autocorrelation_kriging_techniques for Hydrologysmartwateriitrk
 

Similar to Big Social Data: The Spatial Turn in Big Data (20)

Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in RFinding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
 
2018 Modern Math Workshop - Nonparametric Regression and Classification for M...
2018 Modern Math Workshop - Nonparametric Regression and Classification for M...2018 Modern Math Workshop - Nonparametric Regression and Classification for M...
2018 Modern Math Workshop - Nonparametric Regression and Classification for M...
 
Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...
Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...
Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...
 
SMART Seminar Series: "Data is the new water in the digital age"
SMART Seminar Series: "Data is the new water in the digital age"SMART Seminar Series: "Data is the new water in the digital age"
SMART Seminar Series: "Data is the new water in the digital age"
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
Space For Human Services Planning
Space For Human Services PlanningSpace For Human Services Planning
Space For Human Services Planning
 
Augmenting offical datasets with volunteered geographic information a case ...
Augmenting offical datasets with volunteered geographic information   a case ...Augmenting offical datasets with volunteered geographic information   a case ...
Augmenting offical datasets with volunteered geographic information a case ...
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
 
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
 
Sensors1(1)
Sensors1(1)Sensors1(1)
Sensors1(1)
 
Spatial data analysis 1
Spatial data analysis 1Spatial data analysis 1
Spatial data analysis 1
 
Research on Haberman dataset also business required document
Research on Haberman dataset also business required documentResearch on Haberman dataset also business required document
Research on Haberman dataset also business required document
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
CBS CEDAR Presentation
CBS CEDAR PresentationCBS CEDAR Presentation
CBS CEDAR Presentation
 
dm_spdm_short.ppt
dm_spdm_short.pptdm_spdm_short.ppt
dm_spdm_short.ppt
 
dm_spdm_short (3).ppt
dm_spdm_short (3).pptdm_spdm_short (3).ppt
dm_spdm_short (3).ppt
 
Self adaptive based natural language interface for disambiguation of
Self adaptive based natural language interface for disambiguation ofSelf adaptive based natural language interface for disambiguation of
Self adaptive based natural language interface for disambiguation of
 
BIG DATA | How to explain it & how to use it for your career?
BIG DATA | How to explain it & how to use it for your career?BIG DATA | How to explain it & how to use it for your career?
BIG DATA | How to explain it & how to use it for your career?
 
Autocorrelation_kriging_techniques for Hydrology
Autocorrelation_kriging_techniques for HydrologyAutocorrelation_kriging_techniques for Hydrology
Autocorrelation_kriging_techniques for Hydrology
 

More from Rich Heimann

Human Terrain Analysis at George Mason University (DAY 1)
Human Terrain Analysis at George Mason University (DAY 1)Human Terrain Analysis at George Mason University (DAY 1)
Human Terrain Analysis at George Mason University (DAY 1)Rich Heimann
 
Human Terrain Analysis at George Mason University (DAY 1)
Human Terrain Analysis at George Mason University (DAY 1)Human Terrain Analysis at George Mason University (DAY 1)
Human Terrain Analysis at George Mason University (DAY 1)Rich Heimann
 
Data Tactics Analytics Brown Bag (November 2013)
Data Tactics Analytics Brown Bag (November 2013)Data Tactics Analytics Brown Bag (November 2013)
Data Tactics Analytics Brown Bag (November 2013)Rich Heimann
 
Spatial Analysis; The Primitives at UMBC
Spatial Analysis; The Primitives at UMBCSpatial Analysis; The Primitives at UMBC
Spatial Analysis; The Primitives at UMBCRich Heimann
 
Spatial Analysis and Geomatics
Spatial Analysis and GeomaticsSpatial Analysis and Geomatics
Spatial Analysis and GeomaticsRich Heimann
 
Week 1 Lecture @ UMBC
Week 1 Lecture @ UMBCWeek 1 Lecture @ UMBC
Week 1 Lecture @ UMBCRich Heimann
 

More from Rich Heimann (6)

Human Terrain Analysis at George Mason University (DAY 1)
Human Terrain Analysis at George Mason University (DAY 1)Human Terrain Analysis at George Mason University (DAY 1)
Human Terrain Analysis at George Mason University (DAY 1)
 
Human Terrain Analysis at George Mason University (DAY 1)
Human Terrain Analysis at George Mason University (DAY 1)Human Terrain Analysis at George Mason University (DAY 1)
Human Terrain Analysis at George Mason University (DAY 1)
 
Data Tactics Analytics Brown Bag (November 2013)
Data Tactics Analytics Brown Bag (November 2013)Data Tactics Analytics Brown Bag (November 2013)
Data Tactics Analytics Brown Bag (November 2013)
 
Spatial Analysis; The Primitives at UMBC
Spatial Analysis; The Primitives at UMBCSpatial Analysis; The Primitives at UMBC
Spatial Analysis; The Primitives at UMBC
 
Spatial Analysis and Geomatics
Spatial Analysis and GeomaticsSpatial Analysis and Geomatics
Spatial Analysis and Geomatics
 
Week 1 Lecture @ UMBC
Week 1 Lecture @ UMBCWeek 1 Lecture @ UMBC
Week 1 Lecture @ UMBC
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Big Social Data: The Spatial Turn in Big Data

  • 1. 1 Big Social Data: The Spatial Turn in Big Data Rich Heimann, UMBC Adjunct Faculty Abe Usher, HumanGeo Group May 9, 2013
  • 2. 2 Agenda  Major Trends; Foundational Definitions. [Abe]  Long Tail of Big Social Data [Rich]  Laws of the Spatial Sciences [Rich] – Big Data; Small Theory [Rich]  Important Big Data Concepts [Abe] – The Kitchen Model [Abe]  Vignettes [Rich & Abe]  So, what?  Additional Resources 2
  • 3. 3 Major Trends  Location Explosion 2004- present
  • 4. 4  Location Explosion 2004- present  Proliferation of mobile computing Major Trends 7 billion devices in 2014
  • 5. 5  Location Explosion 2004- present  Proliferation of mobile computing  Social networking Major Trends > 700 million comments daily > 144 million connections daily
  • 6. 6  Location Explosion 2004- present  Proliferation of mobile computing  Social networking  Gamification of geo Major Trends
  • 7. 7  Location Explosion 2004- present  Proliferation of mobile computing  Social networking  Gamification of geo Impact: Continuous, global geo-located observations, shared across the Internet. Impact: Continuous, global geo-located observations, shared across the Internet. Major Trends
  • 8. 8 Definitions  Volunteered Geographic Information* (VGI) “harnessing of tools to create, assemble, and disseminate geographic data provided voluntarily by individuals” * http://en.wikipedia.org/wiki/Volunteered_geographic_information 8
  • 9. 9  Volunteered Geographic Information (VGI)  Social Media "a group of Internet-based applications that build on the ideological and technological foundations of Web 2.0, and that allow the creation and exchange of user-generated content” * http://goo.gl/oSrIS 9 Definitions
  • 10. 10  Volunteered Geographic Information (VGI)  Social Media  Big Data “is high-volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making” * http://goo.gl/DFFbr 10 Definitions
  • 11. 11 Long Tail: Traditional Social Science DataLong Tail: Traditional Social Science Data Head: Big Data; nontraditionalHead: Big Data; nontraditional social science data.social science data. Head: Big Data – Large continuous datasets coincidentcoincident over Time & Space. Ideal for multivariate analysis. Tail {power law distribution} Data in tail is often unmaintained beyond their initially designed use case and individually curated. As a result, the data is discontiguous from other research efforts and discontinuous over space and time. Dark data is suspected to exist or ought to exist but is difficult or impossible to find. The problem of dark data is real and prevalent in the tail. The long tail is an intractably large management problem. Long Tail of Big Social Data
  • 12. 12 Power lawPower law 80%80% 20%20% Number of Grants 7,478 1,869 Dollar Amount $938,548,595 $1,199,088,125 Total Grants (NSF07) 9,347 (Count) $2,137,636,716 (Amount) Long Tail of NSF Data
  • 13. 13 Tobler’s [Tobler, 1970] First Law of Geography (TFLG) TFLG: “All things are related, but nearby things are more related than distant things” Spatial Heterogeneity “Second law of geography”[Goodchild, 2003]. Spatial Simpson’s Paradox Global model will always compete and may be inconsistent with local models. Anyon (1982): social science should be empirically grounded, theoretically explanatory and socially critical. Laws of Spatial Science 13 http://www.bigdatarepublic.com/author.asp?section_id=2948
  • 14. 14 Spatial Simpson’s ParadoxSpatial Simpson’s Paradox Global standards will always compete with local social phenomenon.Global standards will always compete with local social phenomenon. Violence in the south Violence in theViolence in the northnorth Violence in the south Violence inViolence in the norththe north Violence GlobalGlobal models average regionally variant phenomenon. LocalLocal models account for regional variation. Big Data; Small Theory 14
  • 15. 15 Important Big Data Concepts  Aggregation  Association  Correlation 15
  • 16. 16 Important Big Data Concepts  Aggregation  Quantitative methods for creating descriptive statistics  Association  Methods of identifying relationships of one data element to another  Correlation  The process of quantifying a correspondence between two comparable entities
  • 17. 17 Two Vignettes 1. Spatially patterning of Tweet composition from the Presidential elections of 2012. 2. Pattern of life analysis of a major US city.
  • 20. 20 Practice: Recommended Tools • Python • R • Quantum GIS • Google Earth 20
  • 22. 22 RE = 206.835 – (1.015 x ASL) – (84.6 x ASW) RE = Readability Ease; ASL = Average Sentence Length (i.e., the number of words divided by the number of sentences); ASW = Average number of syllables per word (i.e., the number of syllables divided by the number of words) The output, i.e., RE is a number generally ranging from 0 to 100. The higher the number, the easier the text is to read. • Scores between 90.0 and 100.0 are considered easily understandable by an average 5th grader. • Scores between 60.0 and 70.0 are considered easily understood by 8th and 9th graders. • Scores between 0.0 and 30.0 are considered easily understood by college graduates. The Flesch-Kincaid Reading Algorithm
  • 23. 23 Clean Text “this gas situation is absolutely ridiculous.” Language english Latitude 41.0862 Longitude -74.1520 USERID “ ” Kincaid 14.3 Flesch 3.3 Flesch-Kincaid (Mean Centered) -76.273849 Leesbaarheid Score 56 Leesbaarheid Grade 11 The Flesch-Kincaid Reading Algorithm
  • 24. 24 Clean Text “down here in beach bout to shut this down wit & feeling the vibe s.” Language english Latitude 33.68709 Longitude -78.88915 USERID “ ” Kincaid 3.5 Flesch 100 Flesch-Kincaid (Mean Centered) 20.42615 Leesbaarheid Score 22.9 Leesbaarheid Grade 4 The Flesch-Kincaid Reading Algorithm
  • 25. 25 Time Span: 2012-10-23 to 2012-11-06 (1 temporal bin, 2 weeks); Spatial Area: Data Clipped to US; Original Sample: 110,737 obs; 418,085 words & 1,446,494 characters without stop words (519,974 & 2,326,500 with stop words); Data processing: Removal of hashtags, @{users}, URLs, thresholding and mean centering; Pruned Sample: 47,690 observations; Method: Local Indicator of Spatial Autocorrelation (Moran’s I) with LISA Classifications of High-High (HH), Low-Low (LL), High-Low (HL), Low-High (LH); Spatial Weights: knn40; Data Reduction: pseudo p-values 0.05, 0.01, 0.001. By the numbers...
  • 26. 26 Region mean SD 0% 25% 50% 75% 100% data:n East North Central 0.6193 16.514 -76.274 -5.77 4.93 11.92 20.426 7579 East South Central 0.6314 16.576 -74.673 -5.27 4.93 12.23 20.426 3028 Mid- Atlantic -0.1988 16.590 -76.273 -6.47 3.73 11.43 20.426 6278 Mountain -0.1212 16.586 -73.174 -7.00 4.32 11.43 20.426 2452 New England -0.1837 16.864 -73.174 -7.00 4.32 11.43 20.426 2392 Pacific -0.8560 17.276 -78.274 -7.78 3.72 11.43 20.426 5390 Southeast 0.1469 16.730 -79.373 -5.78 4.32 11.43 20.426 10022 West North Central 0.6010 16.385 -78.274 -5.78 5.22 12.23 20.426 2781 West South Central 0.8323 16.386 -79.273 -4.77 5.33 12.12 20.426 5572 The Flesch-Kincaid Reading Algorithm
  • 27. 27 The Flesch-Kincaid Reading Algorithm (ggplot2) (Twitter, aes(x=regiontxt, y=flecMC, ylab="Flesch Kincaid Index", xlab="Region", data=Twitter)) geom_point(colour="lightblue", alpha=0.1, position="jitter") + geom_boxplot(outlier.size=1, alpha=0.1) ot(flecMC~regiontxt, ylab="flecMC", xlab="regiontxt", data=Twitter) https://gist.github.com/rheimann/5525909
  • 28. 29https://github.com/rheimann The Flesch-Kincaid Reading Algorithm Raw Data: data:n 47,690
  • 29. 30 High, High [n=77] Low, Low   [n=74] Low, High  [n=53] High, Low  [n=55] = El Paso, Oklahoma City, Omaha, Detroit, Memphis = NYC & San Jose #nerds = Sacramento = Wichita, Kansas City, Tulsa, Nashville pseudo p-value < 0.05 data:n 862 (3-digit Zip Codes) Gassaway, WV Watertown NY Ithaca NY Columbus OH Fresno CA https://github.com/rheimann The Flesch-Kincaid Reading Algorithm
  • 30. 31 Rank ZIP code, City, State Median Home Price ($) Flesch-Kincaid Index Mean Centered Leesbaarheid School Index 100 Zip Code -3.2266 5.44 6 10014, New York, NY 4,116,506 8 10021, New York, NY 3,980,829 1 10065, New York, NY 6,534,430 10 10075, New York, NY 3,885,409 076 Zip Code -3.761 5.5 2 07620, Alpine, NJ 5,745,038 119 Zip Code -0.0538 5.2 4 11962, Sagaponack, NY 4,180,385 5 940 Zip Code 3 94027, Atherton, CA 4,897,864 5 94010, Hillsborough, CA 4,127,250 7 94022, Los Altos Hills, CA 4,016,050 -3.596 5.87 The Flesch Reading Ease Algorithm
  • 31. 32 Green Eggs and Ham by Dr. Suess averages 5.7 words per sentence and 1.02 syllables per word, with a grade level of −1.3. (Most of the 50 used words are monosyllabic; "anywhere", which occurs 8 times, is the only exception.) The 50 dimensional space is small. Even this fairly small Twitter sample & after lots of data processing to remove words of count:1 and words fewer than three characters the N:12,603 dimensional space. Data Processing includes removing stop words and stemming. 110,737 obs; 418,085 words & 1,446,494 characters without stop words (519,974 & 2,326,500 with stop words); Top 50 words include: [romney, obama, election, vote, hope] Green Eggs and Ham: N - Dimensional Problems
  • 33. 34 Spatial Patterns of Activity: Geolocated Social Media New forms of aggregation unlock new insights in your data.  Useful for coarse pattern analysis  Looks interesting  Difficult to analyze directly
  • 34. 35 Rich & Abe Geolocated Social Media Python Geohash Algorithm Code on Github Spatial Patterns of Activity: Applying the Kitchen Sink
  • 35. 36  States, Counties, and Census tracks  All different sizes  Sometimes change  This is a problem: MAUP http://goo.gl/wQLTW Spatial Patterns of Activity: Let’s use Political Boundaries
  • 36. 37  States, Counties, and Census tracks  All different sizes  Sometimes change  This is a problem: MAUP http://goo.gl/wQLTW Spatial Patterns of Activity: Let’s NOT use Political Boundaries
  • 37. 38  Invented in 2008 by Gustavo Niemeyer  Similar to quadtree; breaks the world into rectangles  Based on a z-curve algorithm  Useful for 2-d binning Spatial Patterns of Activity: Geohash
  • 38. 39 4 4 4 8 5 6 4 2 4 4 4 9 5 4 4 3 2 4 4 2 1 4 1 1 6 5 5 4 2 Spatial Patterns of Activity: Geohash Math Notional example: Occurrence of geolocated tweets related to coffee.
  • 40. 41 Spatial Patterns of Activity: Geohash Math
  • 41. 42 Activity near Washington DC Spatial Patterns of Activity: 3-d Google Earth
  • 42. 43 Activity near Washington DC Spatial Patterns of Activity: 3-d Google Earth
  • 43. 44 Spatial Patterns of Activity: Avoid the Classic Blunders http://xkcd.com/1138/
  • 44. 45 Night activity near Washington DC Spatial Patterns of Activity: Isolating a Time Series
  • 45. 46 Spatial Patterns of Activity: Isolating a Time Series School Event Tourists School Event
  • 46. Spatial Patterns of Activity: A Caffeinated Example  Aggregation  Where is the most commentary about coffee and Starbucks?  Association  Is commentary about coffee and Starbucks associated with the location of Starbucks stores? (Yes)  Correlation  What is the numeric relationship between geo-located coffee commentary and actual stores?
  • 47. Where is Starbucks? 81 spatial regions identified with textual references to the words ‘coffee’ and/or ‘Starbucks.’ 8 of the 81 regions are boxes that include both references to ‘coffee’ and ‘Starbucks’ within a narrow window of time. 7 of 8 (88%) accurately classify a region as containing a Starbucks by using simple text analysis alone. .09 .52 .88
  • 48. 49 • Putting data in geospatial context unlocks insight. • Location teaches us more about what we are analyzing. • Adhere to statistical assumptions and avoid misspecification in our models. • The “Big Data” aspects of social media mean that the faucet is always running, enabling experimentation. So, What?
  • 49. 50 Eugene Wigner (1960 Nobel Laureate) ““The Unreasonable Effectiveness of Data”The Unreasonable Effectiveness of Data” Peter Norvig Director of Research at Google Inc. ““The Unreasonable Effectiveness ofThe Unreasonable Effectiveness of Mathematics in the Natural Sciences”Mathematics in the Natural Sciences” Academic Works; Embracing Complexity
  • 50. 51 Additional resources; Code and stuff... Rich Heimann Code and Data: https://github.com/rheimann Slides: http://www.slideshare.net/rheimann04 Twitter: @rheimann UMBC: rheimann@umbc.edu Company: Data Tactics Corporation: http://goo.gl/8QWty Abe Usher Code and Data; https://github.com/abeusher Twitter: @abeusher Company: HumanGeo Group: http://goo.gl/uDbZP
  • 53. 54 Foundational data: 1. Geonames.org: http://www.geonames.org/ 2. GADM.org: http://gadm.org/ Streaming data: 1. Twitter API: https://dev.twitter.com/ – Datasift: http://datasift.com/ 1. GNIP: http://gnip.com/ Recommended resources: Data 54

Editor's Notes

  1. So, over the next hour Abe and I are going to cover some elements of the Big Data movement, most notably the access to social media data, in our case Flickr and Twitter and some general pathways for analysis. We expect to deliver content over the next 45 - 50 minutes and leave 10 minutes or so to questions. I would add that about half of the time will be spent on two vignettes designed for this webinar specifically. Abe will explore pattern of life analysis and I will explore the spatial patterning of tweet composition from the presidential elections of 2012; though I will admit it has little to do with anything political. We will cover Major Trends within the Big Data movement as well as complimentary trends in location based data and services as well as Foundational Definitions. This is a nice backdrop and draws contrast to traditional social scienctific research, which I will attempt to explain in the Long Tail of Big Social Data. I will also cover briefly some laws of the spatial sciences and Abe will discuss important Big Data Concepts. Then we will cover our vignettes and quickly wrap up with “So, Whats?” and direction to additional resources. To that point, I am excited to share with you that all of the data and material, as well as code will be available following this webinar. So, without further delay I pass things off to Abe Usher.
  2. In 2004, Google purchased a startup technology called ‘Keyhole’ that allowed Internet users to explore satellite images of the Earth. This product quickly became the application “Google Earth” – one of the most downloaded computer applications of all times, with more than 500 million downloads. To understand how Big Data technology and social media relate to Geography, we must first understand four major trends that are shaping how people create and interact with data. The first major trend, is an explosion in location based information which started in mid-2004. Google’s decision to make Google Earth free to everyone on the Internet was a pivotal moment in history. This application opened people’s minds to the possibility of exploring the world from the comfort of home. It also presented a standard data format, named ‘keyhole markup language’ or KML for short. Google Earth enables users to annotate the globe with their own observations – for example, marking the location of the home or school with a map push-pin, tracing a road with a simple line tool, or outlining a plot of land with a polygon tool. By 2007, there were more than 300 million Google KML files describing the elements of geography freely floating around the Internet. Google Earth significantly lowered the bar for exploring satellite images of the world, and for sharing geospatial facts with others. We’ll examine this activity further when we discuss volunteered geographic information later in our talk.
  3. The second major trend shaping how people create and interact with information is the proliferation of mobile computing technology. In 1999, Research In Motion released the ‘Blackberry’ – an innovative mobile phone device that allows people to make phone calls and send and receive emails on the same device. In January 2007, Steve Jobs released the Apple iPhone. This was a major improvement over the Blackberry – it was a combination phone, ipod music device, and mobile computer with the ability to install apps from a central app-store. In September 2008, Google released a competing mobile computing device named Android. In April 2010, Apple released a GSM enabled computing tablet called the iPad. In February 2013, Google released an augmented reality head-mounted display computing device called “Google Glass.” The evolution of mobile computing elements continues to accelerate. The International Telecommunications Union estimates that the number of mobile phones and computing devices will exceed the population of the world in 2014. That’s more than 7 billion mobile devices.
  4. The third major trend shaping how people create and interact with information is the use of social networking websites, and the creation of social media content. The number of Internet users in the world is currently approximately 2.2 billion. People with access to the Internet increasingly are interested in using it as a platform for interacting with others, and maintaining relationships through social networking sites such as Facebook, LinkedIn, and Twitter. Why is this interesting? The amount of raw content generated by these sites daily is staggering. For example, every 20 minutes on Facebook more than 10.2 million comments are posted to the site, and two million friend requests are accepted. That’s more than 700 million comments and 144 million friend connects – per day! Although most of this is informal, unstructured content – it is a rich set of observations that are constantly being generated. Never before in the history of humankind have scientists and researchers had so much potential data to work with.
  5. The fourth major trend shaping how people create and interact with information is what I call the gamification of geo. With the requisite three preceding trends, there is an emerging behavior where Internet users now share location information as part of games and transparency oriented interaction. Photo sharing sites Flickr &amp; Panoramio encourage Internet users to share geolocated photos with textual descriptions Twitter allows users to associate GPS specified coordinates with their tweets. The logical extreme of this type of activity is FourSquare, where users are provided with point and merit badge based incentives to share their own location information and observations about locations they visit.
  6. The impact of these trends is a system of continuous, global geo-located observations, shared across the Internet.
  7. Volunteered geographic information, or VGI for short, is the harnessing of tools to create, assemble, and disseminate geographic data provided voluntarily by individuals. Sites that contribute to this phenomenon include: OpenStreetMap Wikimapia Google Map Maker Flickr Panoramio Twitter Instagram Locr.com Just to name a few. VGI is interesting, because it is a way that billions of Internet users are informally collaborating to create an aggregate understanding of the world around them. In a way that eclipses the capability of any single corporation or nation state, the loosely coupled community of geospatially savvy Internet users are creating the greatest foundational database of geospatial content ever produced.
  8. Social media as defined by Andreas Kaplan is &quot;a group of Internet-based applications that build on the ideological and technological foundations of Web 2.0, and that allow the creation and exchange of user-generated content.” http://goo.gl/oSrIS Less formally, you could say that social media is interactive content on the Internet that is fundamentally generated and managed by the user community. This includes major web destinations such as Facebook, Twitter, and Wikipedia, as well as homegrown wikis, blogs, forums, and online bulletin boards. Social media is interesting from a research perspective because it generates such as high volume of artifacts for analysis.
  9. There are many definitions of big data, each with interesting nuances. A highly practical definition is that put forth by the IT consultancy Gartner Group, stating: Big data “is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making”
  10. In an effort to explain big data I have decided to compare things to what we have historically thought of social science research. This is more of a thought experiment rather than anything rigorous and empirical though I do have some empirical evidence to support this notion. The distribution in this slide is what is known as a power law distribution where observations are farther from the mean than they would otherwise be under a normal distribution. A signature quality of a power law is the long tail and the large number of occurrences far from the &quot;head&quot; or central part of a distribution. The tail is what I think of traditional social science data. These data are often collected for small projects and are often forgotten and not maintained. The poor curation of these data leads to their inevitable misplacement – notionally dark data is suspected to exist or ought to exist but is difficult or impossible to find. The problem of dark data is real and prevalent in the tail. The utter lack of central management of data in the tail invariably leads these data to be forgotten. The long tail is an intractably large management problem and an analytical one as well. The central curation of data in the head ensures maintenance, unlike data in the tail. The head of the distribution is where Big Data resides and perhaps where the greatest impact is to human understanding and the advancement of human welfare. The head contains data, which are large and homogenous. The volume produces coincident datasets in time and space - unintentionally producing binding research across social science disciplines, even producing binding research between the natural and social sciences - what Edward Wilson called Consilience in his book of the same name on the Unity of Knowledge in 1998. The datasets in the head are limited but have broad utility and appeal to many, many users whereas resources in the tail, despite there being many thousands have appeal to a few scientists, analysts, and decision makers. The coincident nature of data in the head makes them ideal for cross correlation and multivariate analysis. Open Innovation initiatives hold certain promise for shared innovation and risk for research initiatives developing what is in effect a shared virtual laboratory where we all can work on the same data.
  11. So, this could be empirical support for the previous notion communicated in the last slide. The National Science Foundation has shown their grants in dollar amounts to follow the power law. Data in the tail are small and often unmaintained beyond their initially designed use case and individually curated. As a result, the data is discontinuous from other research efforts and discontinuous over space and time.
  12. Moving on... Some important laws of the spatial sciences are as follows. Laws are important for a number of reasons. First, laws allow any subsequent data analysis to be constructed from principles first and provide the basis for predicting performance and making analytical design choices. Not to mention that they are an asset of a strong and robust discipline. These laws specifically aid in pattern discovery and recognition. The first is commonly and glowingly known as the First Law of Geography and states that “All things are related, but nearby things are more related than distant things.” This is spatial dependency and is often measured with spatial autocorrelation or the relationship that a variable has with itself over space. The notion of near is what requires operationalizing with spatial weights. The second law is what Michael Goodchild called the “Second Law of Geography” or spatial heterogeneity. It is non constant variance over space and is effectively a break down of the First Law. Variance in this case can be technically be thought of a mean 0 and SD 1. It suggests non stationarity or multiple processes operating within our study area. A flavor of spatial heterogeneity is the spatial simpson’s paradox where global models competing with local models. The final law isn’t one of the spatial sciences but may be thought of as one of Big Social Data and that is tackling socially critical problems with Big Data. Examples of socially critical Big Data includes but is certainly not excluded to Google Flu Trends , The Billion Price s Project at MIT , the Global Puls e at the United Nations , as well as per sonal efforts on the behalf of soc ially conscious data scientists.
  13. Here is an example of the Spatial Simpson’s Paradox where crime in the north and crime in the south when collapsed over space produce a “best of fit” that is representative of neither. If this were a policy issue the subsequent policy would be no good for anyone. David Kilcullen (2009) explains that today’s conflicts are a complex hybrid of contrasting trends that counterinsurgencies continue to conflate, blurring the distinction between local and global struggles, and thereby enormously complicating the challenges faced. Kilcullen steps through local and global struggles and outlines the importance of commensurate policy. This process can be characterized roughly as useful spatial models whose statistically significant global variables exhibit strong regional variation to inform local policy, and as statistically significant global variables that exhibit little regional variation could inform region-wide policy. Now, Abe will discuss complimentary mechanisms for approaching Big Data problems.
  14. There is an infinite variety in ways of approaching Big Data analysis. Three methods that are well documented by Google are aggregation, association, and correlation. Aggregation Relates to quantitative methods for creating descriptive statistics. A simple example of this is the creation of counts, statistical means and medians, and standard deviations for an observed data set. We’ll explain practical applications of this in both of our vignettes. Association Relates to methods of identifying relationships of one data element to another. A very simple spatial example could be comparing geolocated tweets about coffee to the location of Starbucks and other coffee franchises. In many cases, a concentration of geolocated tweets is associated with the physical location of a coffee shop. Correlation Correlation is a special case of association – it is the process of quantifying a correspondence between two comparable entities. Rather than merely stating that there is a loose relationship between geolocated coffee tweets and coffee shops, with correlation we would actually attempt to create a numeric model that could be used for predicting the presence of a coffee shop based on a number of geolocated coffee tweets. Correlation is a special case because it can enable the creation of predictive models.
  15. There is an infinite variety in ways of approaching Big Data analysis. Three methods that are well documented by Google are aggregation, association, and correlation. Aggregation Relates to quantitative methods for creating descriptive statistics. A simple example of this is the creation of counts, statistical means and medians, and standard deviations for an observed data set. We’ll explain practical applications of this in both of our vignettes. Association Relates to methods of identifying relationships of one data element to another. A very simple spatial example could be comparing geolocated tweets about coffee to the location of Starbucks and other coffee franchises. In many cases, a concentration of geolocated tweets is associated with the physical location of a coffee shop. Correlation Correlation is a special case of association – it is the process of quantifying a correspondence between two comparable entities. Rather than merely stating that there is a loose relationship between geolocated coffee tweets and coffee shops, with correlation we would actually attempt to create a numeric model that could be used for predicting the presence of a coffee shop based on a number of geolocated coffee tweets. Correlation is a special case because it can enable the creation of predictive models.
  16. Before we dive into our specific examples of data analysis, I’d like to introduce you to a metaphor for dealing with Big Data. I call it the Kitchen Model of Big Data analysis. A kitchen is a great metaphor for understanding value creation. Raw materials go in, and they are transformed into more valuable (and hopefully delicious) outputs. In the physical world, there are a number of factors that contribute to the output of a kitchen. The most important include: The skill level of the chef The quality, quantity, and variety of ingredients The utensils that are available for work And recipes that are known to work well. In the quantitative data sciences, The chefs are the people The ingredients are the data sets you have at your disposal The utensils are the technical tools you choose to use And the recipes are the repeatable methodology that you create for addressing a particular analytic question. Any time that you frame an analytic question, it is a useful exercise to consider if you have The right people, the right data, the right tools, and the right methodology. In good faith with the community of geospatial practioners, we are happy to share our data, tools, and methodology with you in the form of two vignettes.
  17. Before we dive into our specific examples of data analysis, I’d like to introduce you to a metaphor for dealing with Big Data. I call it the Kitchen Model of Big Data analysis. A kitchen is a great metaphor for understanding value creation. Raw materials go in, and they are transformed into more valuable (and hopefully delicious) outputs. In the physical world, there are a number of factors that contribute to the output of a kitchen. The most important include: The skill level of the chef The quality, quantity, and variety of ingredients The utensils that are available for work And recipes that are known to work well. In the quantitative data sciences, The chefs are the people The ingredients are the data sets you have at your disposal The utensils are the technical tools you choose to use And the recipes are the repeatable methodology that you create for addressing a particular analytic question. Any time that you frame an analytic question, it is a useful exercise to consider if you have The right people, the right data, the right tools, and the right methodology. In good faith with the community of geospatial practioners, we are happy to share our data, tools, and methodology with you in the form of two vignettes.
  18. At the end of the presentation, we have links to the source code we used to process data in our vignettes. This code is in the form of scripts that can be run without requiring any commercial licensed software. For aspiring data scientists and geospatial researchers, we recommend four no-cost tools: The python programming language, the R programming language, Quantum GIS, and Google Earth.
  19. Vignette one should bring some levity to Big Social Data but it is all the same driven by a social aspect and ultimately analyzes data that could serve as a proxy for other more substantive variables. My vignette is analyzing Twitter data using the Flesch-Kincaid index, which you may all be familiar with as a consequence of using MS Word. It has for some time provided the readability index to documents. The Guardian in February 2013 used the Flesch-Kincaid index to track the reading level of every state of the union address and noted how the linguist standard of the presidential address has declined. ‘ The state of our union is … dumber: How the linguistic standard of the presidential address has declined Using the Flesch-Kincaid readability test the Guardian has tracked the reading level of every state of the union http://www.guardian.co.uk/world/interactive/2013/feb/12/state-of-the-union-reading-level
  20. For my analysis I used the Readability Ease Index which is the average sentence length weighted then subtracted from the average number of syllables per word. The output generally ranges from 0 - 100. To provide examples the Reader&apos;s Digest magazine has a readability index of about 65, Time magazine s core s about 52, an average 6th grade student age 11 has written assignments at a readability score of 60–70, and the Harvard Law Review has a general read ability score in the low 30s. The highest (easiest) readability score possible is around 120 (meaning every sentence consisting of only two one-syllable words). The score does not have a theoretical lower bound. It is possible to make the score as low as you want by arbitrarily including words with many syllables. In Twitter this could be a result of and I discovered Tweets where LOL was repeated to the max character limit of 140, which drives subsequent indices well below 0. These values were clipped by using a threshold of 0 - 100. This sentence, for example, taken as a reading passage unto itself, has a readability score of about thirty-three. The sentence, &quot;The Australian platypus is seemingly a hybrid of a mammal and reptilian creature&quot; is a 24.4 as it has 26 syllables and 13 words. One particularly long sentence about sharks in chapter 64 of Moby-Dick has a readability score of -146.77. [ 9]
  21. Again, the index is inversely related to its sophistication. A high score is easier to read or put different poorly written. This is an example of a low score or a Tweet written with high sophistication. It is parsimonious and more dense on average that other tweets with syllables. The data has been mean centered so keep that in mind. The tweet is as follows “this gas situation is absolutely ridiculous” and written at an 11th grade level and has a mean centered value well below zero. The location of the Tweet is Mahwah NJ. About 20 miles outside of NYC.
  22. This is an example of a high score or a Tweet written with low sophistication. It has but one monosyllable word. The tweet is as follows “down here in beach bout to shut this down wit &amp; feeling the vibe s” and written at an 4th grade level and has a mean centered value well above zero. The location of the Tweet is Myrtle Beach SC.
  23. explain thresholding Alpha 0.05
  24. This table shows centrality and spread. By mean centering the data, that is subtracting the global mean from each region we can quickly identify deviation from the global mean. The Mid-Atlantic, Mountain, New England, and Pacific are all below the global mean whereas East North Central, East South Central, Southeast, West North Central, and West South Central are all above. You can also quickly see that the Pacific and the West South Central regions deviate most in their respective direction from the global mean.
  25. Another way of exploring the data are box plots by region with underlying scatter plots. Adding a jitter allows us to get a sense of the distribution and cognitively speaking green is easier to interpret than other colors. I think an important point with the two last visualization methods is they are both locationally invariant beyond the coded region variable. As Michael Goodchild said, a fundamental property of spatial analysis is the lack of locational invariance. In other words, if the values within each region were shuffled neither of these two techniques would change. Lacking locational invariance is that results change when location changes.
  26. There is merely a map of the post processed data; that is after thresholding. You notice that even with just 48,000 observations the pattern recognition is difficult due in part to coincident points in space and perhaps support for quantitative methods of pattern recognition and discovery.
  27. Using LISA we can analyze both the first and second law of geography. Here you can see certain spatial clusters representative of spatial dependency. Notice that TFLG can be seen with HH in the middle of the country. These are high indices surrounded by other high indices. What is also noticeable are the low indices surrounded by other low indices in the north, centered around Montana and on the coasts namely the NYC Metropolitan area and San Jose/SF area. The regional variation noted by different spatial regimes is the second law of geography at work; It is the non stationarity of writing ability in the US. There are also numerous more localized relationships not clear from this map. However, in addition to the smooth quality of the analysis as noted by high values surrounded by high values and low values surrounded by low values there are also some interesting rough qualities characteristic of spatial outliers of high values surrounded by low values and low values surrounded by high values. For example, Columbus OH, Ithaca NY, and Gassaway WV are all low values surrounded by high values - meaning writing at a more sophisticated level than its neighbors and meeting statistical significance. By performing a spatial inner join with major cities, in this case cities with more than 300,000 people and the LISA classifications we can identify large cities and their sophistication in crafting Tweets. Following are the only cities that meet that criteria. El Paso, Oklahoma City, Omaha, Detroit, and Memphis all have statistically significant HH values. NYC and San Jose are low values surrounded by low values. Sacramento is a low value surrounded by otherwise high values and Wichita, Kansas City, Tulsa, and Nashville are all high flesch-kincaid indices surrounded by low flesch-kincaid indices. REMEMBER these indices are inversely related with writing ability; high values are low writing ability and vice versa low values are high writing ability. So, you might conclude among other things that NYC and San Jose are filled with nerds! Sorry DC. The LISA categories are statistically significant with a pseudo p-value &lt; 0.05. Pseudo p-values are a computational approach to inference and proves to be a nice data reduction technique. Our original dataset of 3-digit zip codes is reduced from 862 observations to just 259 where all other observations are not statistically significant in the patterning of the kincaid index or just 30% of the original dataset. This analysis could certainly benefit from more data and in fact I am currently analyzing the same index with nearly one million tweets after data processing.
  28. I thought it would be interesting to see the intersection with the computed index and some of the more prestigious or at least expensive zip codes in the US. With the exception of the exclusive 902 zip code are mean centered values fall below the global mean meaning higher writing levels. It might suggest that the 902 zip code is not as smart as a fifth grader. :)
  29. In usual text mining algorithms, a document is represented as a vector whose dimension is the number of distinct keywords in it, which can be very large. Not long after publishing &quot;The Cat in the Hat&quot; at 225 words, Bennett Serf challenged Seuss to see if he could write a book using even FEWER words. Seuss was able to deliver and win the bet - &quot;Green Eggs and Ham&quot; uses exactly 50 words! Even our rather small Twitter dataset has a large N dimensional space of over 12,600 (12,603) unique words. The Flesch-Kincaid is an effective computational effort to add structure to this unstructured data. Also notice that [romney, obama, election, vote, and hope] all appeared within the top fifty words - determined by overall count. And now Abe will discuss the second vignette on pattern of life using geo social media data.
  30. Our next vignette will explore how we can use geolocated social media data to understand spatial patterns.
  31. For social media content that has explicit geolocations, it is straightforward to plot this on a map. Dropping point markers is useful for a very coarse analysis of what is happening in an area. In this map display, we’re examining one day’s worth of geolocated tweets in and around the Washington DC area. There is an interesting pattern to where the observations are. In general, you can see that the volume of content is much more dense in North-West DC than say Burke, Virginia. However, just looking at this markers, it is difficult to make sense of what this might mean – and it is impossible to effectively contrast a view like this to a view depicting another day. Using one of the three data science concepts from earlier, we need to apply a form of spatial aggregation to better understand what we’re looking at.
  32. To put this into context, here’s what our analytic approach looks like. Rich and I are taking ingredients in the form of geolocated tweets, applying some aggregation algorithms with python code, and generating new visualizations and quantitative understanding of the space. If you want to repeat this data exploration, you can get the code from Github.
  33. One way that we could aggregate the data to understand activity is to aggregate things by political boundaries. For many purposes, this is exactly what we might want to do. In the preceding example, using political boundaries is an effective way of being able to compare named places one to another. However, if we are interested in computing a kernel density or “heatmap” of activity, political boundaries are problematic. All of the 50 states within the US are different sizes. The same goes for counties. And in the case of zipcodes and census tracts, the areas actually change over time. All of these factors make it difficult to apply quantitative statistics. This general class of problem is referred to as the Modifiable Areal Unit Problem.
  34. One way that we could aggregate the data to understand activity is to aggregate things by political boundaries. For many purposes, this is exactly what we might want to do. In the preceding example, using political boundaries is an effective way of being able to compare named places one to another. However, if we are interested in computing a kernel density or “heatmap” of activity, political boundaries are problematic. All of the 50 states within the US are different sizes. The same goes for counties. And in the case of zipcodes and census tracts, the areas actually change over time. All of these factors make it difficult to apply quantitative statistics. This general class of problem is referred to as the Modifiable Areal Unit Problem.
  35. Instead we’ll use a thing called the geohash algorithm. Invented in 2008 by software engineer Gustavo Niemeyer, geohash is a data encoding that combines elements of longitude and latitude into a single variable. A computed geohash references a rectangle shaped box located on the earth. If you are interested in the implementation details, geohash is based on the same encoding concepts of the classicial quadtree data structure in computer science. It is called a ‘z curve’ representation because points with a similar geohash prefix tend to be near spatially (but not always). For our purposes, geohas is very useful for two dimensionally binning – putting geolocated content into boxes for quantification.
  36. In this notional example, we’ve taken geolocated tweets with language related to coffee, and binned them into boxes. The numbers that you see in each box refers to the number of coffee mentions during some unit of time, like a day. With this simple, primitive binning mechanism, it is possible to make more advanced visualizations to depict spatial patterns.
  37. For example, we could analyze our data and determine that when there are eight or more geolocated references to coffee within a geohash box on a given day, 95% of the time there is a coffee shop co-located in the same box. To create a visual element around this, we can create a thematic map – based on a simple count of coffee references to visually depict the likelihood of a coffee shop being present.
  38. We can further simplify this information by merely displaying informal probabilities as colors on the map. Two dimensions is useful, but there is nothing preventing use from using additional dimensions for simplifying the visual understanding of what the data are trying to tell us. Imagine for example adding another dimension – a height of each box to depict the count of some factor that we are observing. So, how does this work in practice?
  39. This next image is a Google Earth KML representation of geolocated social media content in and around Washington DC. Each vertically extruded polygon depicts the amount of activity based on color (where green is low activity, red is high activity) and height where low boxes are low activity and tall boxes are high activity zones. You might say that it is intuitively obvious to the casual observer – the tall red columns are where there is a lot of activity taking place.
  40. An alternate view from above show a clear cluster of activity occurring near NW and central DC. There is also a fair amount of content on the periphery of DC, generally following a line around the beltway. Hopefully these are not tweets from drivers! Now if we stopped our data analysis at this point we’d be in for trouble, because we’d be falling prey to one of the classic blunders of geography.
  41. This blunder is well depicted by an XKCD cartoon – many heatmaps are basically just population maps. If we take a barefoot empiricist approach to naively counting observations, we could mistakenly find a correlation between unrelated groups like people that subscribe to Martha Stewart Living, and people that like UMBC webinars. There are many advanced techniques to avoid this trap. Once of the simplest is to filter your data based on some other attribute, such as time.
  42. In this next example, you can see geolocated social media activity between 10pm and midnight. The observations are much more sparse.
  43. As you examine the clusters of activity, two of the hot spots to the extreme west and extreme east are evening activities at area schools. The tallest hotspot in the center is tourist activity near the national mall in downtown DC.
  44. Because I love caffeinated beverages, I decided to look at real data and apply aggregation, association, and correlation To tweets discussing coffee and the word Starbucks. I examined approximately 30,000 geolocated tweets in the DC area, and wrote code to answer three questions: Where is the most commentary about coffee and Starbucks? Is commentary about coffee and Starbucks associated with the location of Starbucks stores? What is the numeric relationship between geo-located coffee commentary and actual stores?
  45. Using the geohash algorithm as a mechanism for counting things, I found 81 spatial regions with textual references to the words coffee and/or Starbucks. 8 of the 81 regions are geohash boxes that include references to both coffee and starbucks in a narrow window of time. 7 of 8 (88%) of these boxes accurately classify a region as containing a Starbucks, just by using simple text analysis alone. This is a very exciting finding – what it tells us as geographers and data scientists is that we can use Aggregated geolocated observations as a form of automatic crowdsourcing to learn facts about our environment. We can also use these techniques to build highly accurate, predictive models
  46. So, hopefully we have shown some examples that geospatial context does unlock insights into our data. That Location teaches us more about what we are analyzing. By explicitly accounting for space we adhere to statistical assumptions and avoid misspecification of our models. But, I think the most important element is that Big Data means that the faucet, so to speak is always running, which enables unique opportunity for experimentation.
  47. There are a number of seminal works already in this space. In an attempt to be fair I have chosen two related works that provide unique insight. Eugene Wigner wrote, &quot;The Unreasonable Effectiveness of Mathematics in the Natural Sciences” in 1960. Since I cannot improve on Eugene Wigner’s presentation for the natural sciences, I hope to offer some reflection for the social sciences. There is only one thing which is more unreasonable than the unreasonable effectiveness of mathematics in physics, and this is largely the unreasonable ineffectiveness of mathematics in social science. Peter Norvig et al. in 2009 wrote a paper titled, “The Unreasonable Effectiveness of Data.” In his opening sentence he draws the direct comparison to Wigner’s work and says that sciences that involve human beings rather than elementary particles have proven more resistant to elegant mathematics. Social Scientists have suffered from a so-called physics envy over their inability to neatly model human behavior. Norvig continues by saying &quot;We should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data. The effectiveness of data directly feeds its ultimate utility. Wigner provides examples of the &quot;The Unreasonable Effectiveness of Mathematics” and notes how Galileo’s Experiment is true everywhere on the Earth, was always true, and will always be true. It is valid no matter whether it rains or not, whether the experiment is carried out in the Middle East or Northeast DC, no matter whether the person is a man or a woman, rich or poor, Muslim or Catholic. This invariance property of physics is well recognized and without invariance principles physics would not be possible. But, as Ernest Rutherford pointed out, the only law of the social sciences is “some do, some don’t.” So, does one counter example defeat a law? Social phenomenon is not invariant over time or space. While serial and spatial autocorrelation exists so does temporal and spatial heterogeneity and ultimately uncontrolled variance. Exploiting the complexity of data in the head of the power law holds promise for the social sciences. Integration of these data is key. I think we have shown some examples today of how to integrate, visualize and analyze this data and ultimately exploiting this complexity.
  48. As promised, here are links to further material. The links are paths to the data and code used for this webinar. Abe and I can also be reached on Twitter - though I won’t tell you who writes with a lower kincaid index! :) I can also be reached at my UMBC email should you have any questions.
  49. So, in conclusion I would like to thank you on the behalf of myself and Abe; we both really enjoyed getting this material together. I would like to again just echo a couple of key points discussed today. Abe mentioned some key elements of the analysis of geo-social media data. They were aggregation, association, and correlation and gave examples of each and their ultimate utility. I provided some key spatial laws to help govern analysis and pattern discovery and recognition. The chosen method, Moran’s I LISA used spatial weights files to construct a conceptualization of space as well as groupings by 3-digit zip codes both of which effectively operationalized the notion of “near” in Tolber’s first law. The areal units and the weights file is construction of interaction or association as termed in Abe’s slide and effectively aggregates the data into neighborhoods and ultimately allows a measure of correlation, but in the example today it was autocorrelation, or the relationship that variable has with itself over space. In other words, both vignettes followed these methodological pathways. If you are interested in learning more I would suggest exploring some of the previously promoted material. Alternately you can explore the MPS Program in GIS at UMBC Shady Grove where nontraditional datasets are explored. Thank you!