A 1015 update to the 2012 "Data Big and Broad" talk - http://www.slideshare.net/jahendler/data-big-and-broad-oxford-2012 - extends coverage, brings more in context of recent "big data" work.
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
Broad Data (India 2015)
1. Tetherless World Constellation
Broad Data
Jim Hendler
Tetherless World Professor of Computer and Cognitive Science
Director, The Rensselaer Institute of
Data Exploration and Applications (IDEA)
Rensselaer Polytechnic Institute
http://www.cs.rpi.edu/~hendler
@jahendler (twitter)
2. Tetherless World Constellation
This talk
• What I’m not going to talk about much
– The Semantic Web (per se)
• http://www.slideshare.net/jahendler/semantic-web-the-inside-story
– Social Machines
• http://www.slideshare.net/jahendler/social-machines-oxford-hendler
– My work with Watson and Cognitive Computing
• http://www.slideshare.net/jahendler/watson-an-academics-perspective
• http://www.slideshare.net/jahendler/watson-summer-review82013final
• What I am going to present
– The rest of the big data story…
3. Tetherless World Constellation
Data is important!
• Roughly every 50 years a new power
source for the human race is found.
Once upon a time it was chemical,
then it was electrical, then nuclear,
etc.
• Information – so not just data, but
data being used – is the new power
source for our generation.
http://www.slideshare.net/jahendler/the-science-of-data-science
4. 4
The Rensselaer Institute for Data Exploration and Applications
Business
Systems:
Built and Natural
Environments:
Cyber-
Resiliency:
Policy, Ethics and
Stewardship:
Materials Informatics:Data-driven Physical/Life
Sciences:
Healthcare Analytics
and Mobile Health:
Social Network
Analytics:
Agents and
Augmented Reality:
5. Office of Research 5
Developing a “Data Science” Research Agenda
Multiscale
Sparcity
Abductive Agent-oriented
6. Tetherless World Constellation
BIG Data
• The term “Big Data” is widely used
nowadays to refer to a whole bunch of
machine-readable data in one accessible
(to the researcher) place
– 3 main contexts
• The large data collections of “big science” projects
– in traditional data warehouse or database formats
• The enterprise data of large, non-Web-based
companies (IBM, TATA, etc.)
– Generally in multiple data formats, stores, warehouses, etc.
• The data holdings of a Google, Facebook or other
large Web company
– Include large “unstructured” holdings
– Include “graph” data
7. Tetherless World Constellation
But wait, there’s more!
• 4th
context: Broad Data
– The huge amount of freely available, but widely varied,
Open Data on the World Wide Web (Structured and
Semi-structured)
• Example: The extended Facebook OGP graph (the
part outside Facebook’s datasets)
• Example: dbpedia, yago, wikidata, and other sources
of indexed information sources
• Example: The growing linked open data cloud of
freely available linked data from many domains
• Example: millions of datasets that are available on
the Web freely available from governments around
the world
9. Tetherless World Constellation
BROAD data challenges
• For broad data the new challenges
that emerge include
– (Web-scale) data search
– “Crowd-sourced” modeling and user testing
– rapid (and potentially ad hoc) integration of
datasets
– visualization and analysis of only-partially
modeled datasets
– policies for data use, reuse and combination.
• Which are an overlooked but critical
part of the KDD world
11. Tetherless World Constellation
KDD Pipeline – in the real world
• Data is increasingly being
brought in from external
sources, with mixed
provenance, and
increasingly outside the
analyzers’ control.
• At increasing rates and
scalesData
Storag
e
Data
Storag
e
Sensors … apps
Social
Media
Customer
Behaviors
Web
Partners
Formatting, standards use, data
cleansing, data bias analysis, …
Open data
Data
Storag
e
Data
Storag
e
Data
Storag
e
Data
Storag
e
Data
Storag
e
Data
Storag
e
Data
Storag
e
Data
Storag
e
Data
Storag
e
Data
Storag
e
Data
Storag
e
Data
Storag
e
Data
Sources
Data
Sources
…
…
14. IDEA
Discovery needs semantics
How do you find the Data you need?How do you find the Data you need?
The answer isn’t:
Middle Eastern Terrorists for $800 …
17. IDEA
Integration challenge: need to understand the data
Person
RIN 660125137
Address # 1118
Address St Pinehurst
Address
zip
12203
Course
topic
CSCI
Course # 4961
Campus Personnel
RPI ID 660125137
Name Hendler
Campus Classes
CRN 1118
Name Intro to Physics
YES
NO!!!!
18. IDEA
Semantic Web and Linked Data (UK)
County Council
Ordnance Survey
Royal Mail
IOGDC Open Data Tutorial 18
21. IDEA
But very hard for machines without people (or knowledge)
Head to head comparison shows that burglaries in Avon
and Somerset (UK) far exceed those in Los Angeles,
California
* one of the most dangerous places in the US
vs. one of the safest in the UK
* fails the “smell test”
22. IDEA
Data + everything else you know
Same or
different?
Do the terms mean the same? Are they collected in the same way? Are
they processed differently? …
23. Office of Research
Exploration challenge: develop/test earlier in pipeline
23
Data
Storage
Data
Storage
Sensors and apps Social
Media
Customer
Behaviors
Web
Partners
Formatting, standards use, data
cleansing, data bias analysis, …
Open data
Data
Storage
Data
Storage
Data
Storage
Data
Storage
Data
Storage
Data
Storage
Data
Storage
Data
Storage
Data
Storage
Data
Storage
Data
Storage
Data
Storage
Data
Storage
Data
Storage
ExploreExplore
Can we develop mechanisms
to rapidly develop/test
hypotheses prior to entering
the full analytics pipeline?
Can human perceptual
apparatus help?
26. Tetherless World Constellation
Traditional Metadata
• Traditionally metadata tries to be
comprehensive
– Example:ISO 19115
(GIS standard)
• >400 elements
• 14 “packages”
• Dozens of UML models
(not all consistent w/
each other)
• After 50 years this still doesn’t work!
27. Tetherless World Constellation
The alternative: Not your “father’s metadata”
• Big Data on the web
– is moving away from
traditional relational
models (cf. NoSQL)
– Moving towards third
party application and
extension (cf. Json)
– Focus on interoperability
and exchange with
“lightweight” semantics
• Using ideas from the Semantic Web
– Search: Schema.org
– Social Networking: OGP
44. Tetherless World Constellation
Next steps
Smith James June 4
Jones Fred May 17
O’Connell Frank April 3
Chang Wu February 21
Hoffman Bernd December 9
Person
Date
It’s not enough just to describe the data elements…
45. Tetherless World Constellation
Describing a dataset … requires a context
Smith James June 4
Jones Fred May 17
O’Connell Frank April 3
Chang Wu February 21
Hoffman Bernd December 9
Person
Date
1976 Dates
of Birth
46. Tetherless World Constellation
Describing a dataset … requires a context
How do we capture more of this information?
Smith James June 4
Jones Fred May 17
O’Connell Frank April 3
Chang Wu February 21
Hoffman Bernd December 9
Person
Date
1976 Cancer
Mortality
dates
49. IDEA
ARL Network-Science CTA
1
10
100
1000
1 10 100 1000
Count
Time interval (# of days)
Mentorship first
Housing first
1
10
100
1000
1 10 100 1000
Count
Time interval (# of days)
Mentorship first
Housing trust first
1
10
100
1000
1 10 100 1000
Count
Time interval (# of days)
Housing trust first
Mentorship first
1
10
100
1000
1 10 100 1000
Count
Time interval (# of days)
Housing trust first
Mentorship first
A
C
B
D
0
50
100
150
200
250
300
350
400
-300 -200 -100 0 100 200 300
Count
Time interval (# of days)
0
100
200
300
400
500
600
700
-300 -200 -100 0 100 200 300
Count
Time interval (# of days)
0
50
100
150
200
250
300
350
400
450
-300 -200 -100 0 100 200 300
Count
Time interval (# of days)
0
50
100
150
200
250
300
350
-300 -200 -100 0 100 200 300
Count
Time interval (# of days)
A
C
B
D
Algorithms designed Y3 were tested against 220GB of
data from Everquest II game looking for proxy
measures of trust - Performance results on real data
showed good correspondence with theoretical results.
(but 220GB = 1 month of our 2 yrs of data)
50. IDEA
Scaling inference for discovery, integration & validation
AI “rules on graphs” bring (limited) KR
languages to supercomputing models
Weaver (PhD 2013) showed power of BlueGene/Q for AI
computations
51. 51
From visualization to exploration
… Unfortunately, visualization too often becomes an end product of scientific analysis,
rather than an exploration tool that scientists can use throughout the research life cycle.
However, new database technologies, coupled with emerging Web-based technologies,
may hold the key to lowering the cost of visualization generation and allow it to become
a more integral part of the scientific process.
52. Tetherless World Constellation
Conclusions
• Our data challenge is becoming “Broad Data”
– World Wide Web trend towards more and more varied
data
• In many domains
– E-commerce, Open Govt, many more (cf. Health/Medical care)
• Broad data requires
– Modern, Web-oriented metadata
– LINKING the metadata, not the data
• Broad data requires thinking outside the
“Database” box
– DIVE: discover, integrate, validate and
– especially: EXPLORE (early, often, rapidly)
this is the data science agenda- basically, these are the hard problems in the closing the loop – how to go from the correlation on one side to the causal on the other – I don’t love the term agent-oriented, but we mean a combination of unstructured, AI, etc – abductive is usually where I talk about these being hard inverse problems where we don’t know a specific function, but rathr are looking for an explanation.