From Web Data to Knowledge: on the Complementarity of Human and Artificial Intelligence

Backup
29/05/19 1Stefan Dietze
From (Web) Data to Knowledge: on the Complementarity
of Human and Artificial Intelligence
Prof. Dr. Stefan Dietze
Inaugural Lecture, 28 May 2019
Heinrich-Heine-Universität Düsseldorf

Finding “things” on the Web
• Resources
• Facts
• Claims
• Opinions

• Resources
• Facts
• Claims
• Opinions

• Resources
• Facts
• Claims
• Opinions
We‘ll try to use AI to „answer“ that
question at the end of the talk.

Finding social sciences research data on the Web

Human/Crowd Intelligence
Artificial Intelligence
„Supervising AI“ with user-
generated data & knowledge
(„making machines smarter“)
Artificial vs human intelligence: a simplistic Web search perspective
 Information retrieval (crawling, indexing,
ranking etc)
 Natural language processing
 (Hyperlink) graph analysis (e.g. PageRank
et al.)
 Statistics and (deep) learning from user
interactions
o Query interpretation & intent prediction
o Classification of users, documents, queries
o Reranking & personalisation
o ….
Facilitating search, retrieval &
knowledge gain of users
„making humans smarter“

Part I
Symbolic & subsymbolic AI on the Web – a brief introduction
Part II
Extracting machine-interpretable knowledge („making machines smarter“)
Part III
Facilitating search, retrieval & knowledge gain of users („making humans smarter“)
Overview

Symbols, data & knowledge on the Web
dbr:Tim_Berners-Lee
dbo:Person
„Tim Berners-Lee“@en
1955-06-08^^xsd:date
dbr:MIT
dbr:Washington_DC
dbr:WWW_Foundation
dbo:Organisation
dbo:keyPersonOf
rdf:type
rdfs:subClassOf
foaf:name
dbo:birthDate
dbo:workplaces
yago:LegalActor
dbo:Scientist
Unstructured data
e.g. web pages, user interactions/behavior, clickstreams, sensor data
Machine-interpretable knowledge
e.g. Knowledge graphs, Web markup
dbr:Jakarta
dbo:location
rdf:type
DBpedia (eng.) 200 million facts
Google KG: 18 billion facts

Symbolic AI
• AI = manipulation and interpretation of
symbols (eventually: “knowledge”)
• Top-down: knowledge representation,
logics, inference, knowledge graphs
• “strong AI hypothesis” or “Physical Symbol
System Hypothesis” (Newell & Simon,
1976), “GOFAI”
Subsymbolic AI
• AI = emulating/engineering human
intelligence, e.g. through cognitive computing
(“perceptron”, Frank Rosenblatt 1957)
• Bottom up: neural networks, machine/deep
learning, distributional semantics
• Also called: “weak AI hypothesis” (Russel &
Norwig, 1995)
Symbolic vs subsymbolic AI
Knowledge
Information
Data
Symbols
Horse ⊓ ¬RockingHorse ⊑ Animal ⊓ ∀(=4)hasLegs
„Intelligence is ten million rules“
(Douglas Lenat, founder of Cyc)

Subsymbolic AI & deep learning for language understanding
Percentage of deep learning papers in major NLP conferences
(Source: Young et al., Recent Trends in Deep Learning Based Natural Language Processing)
• Distributional semantics &
embeddings: predicting low-
dimensional vector representations
of words & text, e.g. Word2Vec
[Mikolov et al., 2013]
• Efficient RNN/CNN architectures in
encoder/decoder settings (e.g. for
machine translation) [Vaswani et al.,
2017]
• Pretraining language models for
task-specific transfer learning, e.g.,
BERT - Bidirectional Encoder
Representations from Transformers
[Devlin et al., 2018]
T. Mikolov et al., Distributed Representations of Words and Phrases and their Compositionality, NIPS (2013)
J. Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018)
A. Vaswani et al. Attention is all you need, NIPS (2017)

Source: https://techcrunch.com/2016/03/24/microsoft-silences-its-new-a-i-bot-tay-after-twitter-users-teach-it-racism/
• Biases in human interactions can be learned and elevated by ML models
• Meaning / semantics are crucial to facilitate interpretation by/of machines & ML models
[N-word]
Learning without semantics

Semantics and knowledge: a brief (and incomplete) history
• Deductive reasoning, syllogism & categorisation
(Aristotele, 384 BC – 322 BC)
• Formal logic & calculus rationicator (reasoning, symbol manipulation)
(G.W. Leibniz 1646 - 1716)
• „Begriffschrift“, technically: predicate logic
(Gottlob Frege, 1848 – 1925)
• Frames for representing stereotyped situations
(Marvin Minsky, 1974)
• Rules & expert systems
• Ontologies
(Leibniz, Kant, Gruber 1994)
• Description Logics
(Baader & Hollunder, 1991 et al.)
• Semantic Web
(Berners-Lee, Hendler, Lassila, 2001)
& Linked Data
& Knowledge Graphs

Symbolic & subsymbolic AI: e.g. linking Web documents & KGs
 Robust methods for named entity
disambiguation (NED), e.g. Ambiverse
[Hoffart et al., 2011], Babelfy [Ferragina et al., 2010],
TagMe [Moro et al., 2014]
 Time- and corpus-specific entity
relatedness; prior probabilities and
meaning of entities change over time, e.g.
“Deutschland” during World Cup
[DL4KGS 2018]
 Meta-EL: supervised ensemble learner
exploiting results of different NED systems
[SAC19, CIKM19]
o Considers features of terms,
mentions/occurrences,
dynamics/temporal drift etc
o Outperforms individual NED systems
across diverse documents/corpora
 Problem:
“Completeness” & coverage of KGs?
Fafalios, P., Joao, R.S., Dietze, S., Same but Different: Distant
Supervision for Predicting and Understanding Entity Linking
Difficulty, ACM SAC19
Mohapatra, N., Iosifidis, V., Ekbal, A., Dietze, S., Fafalios, P., Time-
Aware and Corpus-Specific Entity Relatedness, DL4KGS at ESWC2018.
dbr:Tim_Berners-Lee
29/05/19 14

Overview
Part I
Part II
Part III

Knowledge about: facts, claims, stances & opinions on the Web
Facts & claims Stances, opinions, interactions
<„Tim Berners-Lee“ s:founderOf „Solid“>

Mining (long-tail) facts from the Web?
<„Tim Berners-Lee“ s:founderOf „Solid“>
 Obtaining verified facts (or knowledge graph) for a
given entity?
 Application of NLP (e.g. NER, relation extraction) at
Web-scale (Google index: 50 trn pages)?
 Exploiting entity-centric embedded Web page markup
(schema.org), prevalent in roughly 40% off Web pages
(44 Bn „facts“ in Common Crawl 2016/3.2 Bn Web
pages)
 Challenges
o Errors. Factual errors, annotation errors (see also
[Meusel et al, ESWC2015])
o Ambiguity & coreferences. e.g. 18.000 entity
descriptions of “iPhone 6” in Common Crawl 2016
& ambiguous literals (e.g. „Apple“>)
o Redundancies & conflicts vast amounts of
equivalent or conflicting statements

 0. Noise: data cleansing (node URIs, deduplication etc)
 1.a) Scale: Blocking (BM25 entity retrieval) on markup index
 1.b) Relevance: supervised coreference resolution
 2.) Quality & redundancy: data fusion through supervised fact classification (SVM, knn, RF, LR, NB), diverse
feature set (authority, relevance etc), considering source- (eg PageRank), entity-, & fact-level
KnowMore: data fusion on markup
1. Blocking &
coreference
resolution
2. Fusion / Fact selection
New Query Entities
BBC Audio, type:(Organization)
Chapman & Hall, type:(Publisher)
Put Out More Flags, type:(Book)
(supervised)
Entity Description
author Evelyn Waugh
priorWork Put Out More Flags
ISBN 978031874803074
copyrightHolder Evelyn Waugh
releaseDate 1945
… …
Query Entity
Brideshead Revisited,
type:(Book)
Candidate Facts
node1 publisher Chapman & Hall
node1 releaseDate 1945
node1 publishDate 1961
node2 country UK
node2 publisher Black Bay Books
node3 country US
node3 copyrightHolder Evelyn Waugh
… …. ….
Web page
markup
Web crawl
(Common Crawl,
44 bn facts)
approx. 5000 facts for „Brideshead Revisited“
(compare: 125.000 facts for „iPhone6“)
Yu, R., [..], Dietze, S., KnowMore-Knowledge Base
Augmentation with Structured Web Markup, Semantic
Web Journal 2019 (SWJ2019)
Tempelmeier, N., Demidova, S., Dietze, S., Inferring
Missing Categorical Information in Noisy and Sparse
Web Markup, The Web Conf. 2018 (WWW2018)
20 correct/non-redundant
facts for „Brideshead Rev.“
18Stefan Dietze
Fusion performance
 Baselines: BM25, CBFS [ESWC2015], PreRecCorr [Pochampally
et. al., ACM SIGMOD 2014], strong variance across types
Knowledge Graph Augmentation
 Experiments on books, movies, products
 New facts (wrt DBpedia, Wikidata, Freebase):
 On average 60% - 70% of all facts for books & movies new
(across KBs)
 100% new facts for long-tail entities (e.g. products)
 Additional experiments on learning new categorical features
(e.g. product categories or movie genres) [WWW2018]

Beyond facts: claims, opinions and misinformation on the Web
 Investigations into misinformation and opinion forming
received massive attention across a wide range of
disciplines and industries (e.g. [Vousoughi et al. 2018])
 Insights, mostly (computational) social sciences, e.g.
o Spreading of claims and misinformation
o Effect of biased and fake news on public opinions
o Reinforcement of biases and echo chambers
 Methods, mostly in computer science, e.g. for
o Claim/fact detection and verification („fake news
detection“), e.g. CLEF 2018 Fact Checking Lab
(http://alt.qcri.org/clef2018-factcheck/)
o Stance detection, e.g. Fake News Challenge (FNC)
http://www.fakenewschallenge.org/
 Some recent work
o Large-scale public research corpora for
replicating/improving methods/insights
o TweetsKB: 9 Bn annotated tweets
o ClaimsKG: 30 K annotated claims & truth ratings
o ML models for stance detection of Web documents
(towards given claims)
19Stefan Dietze

Stance detection of Web documents
Motivation
 Problem: detecting stance of documents (Web pages)
towards a given claim (unbalanced class distribution)
 Motivation: stance of documents (in particular
disagreement) useful (a) as signal for fake news
detection and (b) Website classification
Approach
 Cascading binary classifiers: addressing individual
issues (e.g. misclassification costs) per step
 Features, e.g. textual similarity (Word2Vec etc),
sentiments, LIWC, etc.
 Best-performing models: 1) SVM with class-wise
penalty, 2) CNN, 3) SVM with class-wise penalty
 Experiments on FNC-1 dataset (and FNC baselines)
Results
 Minor overall performance improvement
 Improvement on disagree class by 27%
(but still far from robust)
A. Roy, A. Ekbal, S. Dietze, P. Fafalios, Step-by-Step: A three-
stage Pipeline for Stance Classification of Documents
towards Claims, CIKM19 under review.
20Stefan Dietze

http://dbpedia.org/resource/Tim_Berners-Lee
wna:positive-emotion
onyx:hasEmotionIntensity "0.75"
Mining opinions & interactions (the case of Twitter)
 Heterogenity: multimodal, multilingual, informal,
“noisy” language
 Context dependence: interpretation of
tweets/posts (entities, sentiments) requires
consideration of context (e.g. time, linked
content), “Dusseldorf” => City or Football team
 Dynamics & scale: e.g. 6000 tweets per second,
plus interactions (retweets etc) and context (e.g.
25% of tweets contain URLs)
 Evolution and temporal aspects: evolution of
interactions over time crucial for many social
sciences questions
 Representativity and bias: demographic
distributions not known a priori in archived data
collections
http://dbpedia.org/resource/Solid
wna:negative-emotion
P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public
and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18.

P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public
and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18.
Mining knowledge about opinions & interactions: TweetsKB
http://l3s.de/tweetsKB
 Harvesting & archiving of 9 Bn tweets over 5 years
(permanent collection from Twitter 1% sample since
2013)
 Information extraction pipeline (distributed via Hadoop
Map/Reduce)
o Entity linking with knowledge graph/DBpedia
(Yahoo‘s FEL [Blanco et al. 2015])
(“president”/“potus”/”trump” =>
dbp:DonaldTrump), to disambiguate text and use
background knowledge (eg US politicians?
Republicans?), high precision (.85), low recall (.39)
o Sentiment analysis/annotation using SentiStrength
[Thelwall et al., 2012], F1 approx. .80
o Extraction of metadata and lifting into established
schemas (SIOC, schema.org), publication using W3C
standards (RDF/SPARQL)
Use cases
 Aggregating sentiments towards topics/entities, e.g. about
CDU vs SPD politicians in particular time period
 Temporal analytics: evolution of popularity of entities/topics
over time (e.g. for detecting events or trends, such as rise of
populist parties)
 Twitter archives as general corpus for understanding temporal
entity relatedness (e.g. “austerity” & “Greece” 2010-2015)
Limitations
 Bias & representativity: demographic distributions of users
(not known a priori and not representative)
 Cf. use case at the end of the talk
-0.40000
-0.30000
-0.20000
-0.10000
0.00000
0.10000
0.20000
0.30000
0.40000
Cologne Düsseldorf

Overview
Part I
Part II
Part III
23Stefan Dietze

Knowledge (gain) while searching the Web (“Search As Learning”)?
Challenges & results
 Detecting coherent search missions?
 Detecting learning throughout search?
detecting “informational” search missions (as
opposed to “transactional” or “navigational”
missions [Broder, 2002])
o Search mission classification with average F1
score 75%
 How competent is the user? –
Predict/understand knowledge state of users
based on in-session behavior/interactions
 How well does a user achieve his/her learning
goal/information need? - Predict knowledge gain
throughout search missions
o Correlation of user behavior (queries,
browsing, mouse traces, etc) & user
knowledge gain/state in search [CHIIR18]
o Prediction of knowledge gain/state through
supervised models [SIGIR18]
24Stefan Dietze

Understanding knowledge gain/state of user during search?
Data collection
 Crowdsourced collection of search session data
 10 search topics (e.g. “Altitude sickness”, “Tornados”), incl. pre-
and post-tests
 Approx. 1000 distinct crowd workers & 100 sessions per topic
 Tracking of user behavior through 76 features in 5 categories
(session, query, SERP – search engine result page, browsing,
mouse traces)
Some results
 70% of users exhibited a knowledge gain (KG)
 Negative relationship between KG of users and topic popularity
(avg. accuracy of workers in knowledge tests) (R= -.87)
 Amount of time users actively spent on web pages describes 7%
of the variance in their KG
 Query complexity explains 25% of the variance in the KG of users
 Topic-dependent behavior: search behavior correlates stronger
with search topic than with KG/KS
Gadiraju, U., Yu, R., Dietze, S., Holtz, P.,. Analyzing
Knowledge Gain of Users in Informational Search
Sessions on the Web. ACM CHIIR 2018.
25Stefan Dietze

26Stefan Dietze
Predicting knowledge gain/state of user during search?
 Stratification into classes: user knowledge state (KS) and
knowledge gain (KG) into {low, moderate, high} using
(low < (mean ± 0.5 SD) < high)
 Supervised multiclass classification (Naive Bayes, Logistic
regression, SVM, random forest, multilayer perceptron)
 KG prediction performance results (after 10-fold cross-validation)
 Feature importance (KG prediction)
Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S.,
Analyzing Knowledge Gain of Users in Informational Search
Sessions on the Web. ACM SIGIR 2018.

Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S.,
Analyzing Knowledge Gain of Users in Informational Search
Sessions on the Web. ACM SIGIR 2018.
Predicting knowledge gain/state of user during search?
 Stratification into classes: user knowledge state (KS) and
knowledge gain (KG) into {low, moderate, high} using
(low < (mean ± 0.5 SD) < high)
 Supervised multiclass classification (Naive Bayes, Logistic
regression, SVM, random forest, multilayer perceptron)
 KG prediction performance results (after 10-fold cross-validation)
 Feature importance (KG prediction)
Shortcomings & future work
 Lab studies to obtain more reliable data (controlled
environment, longer sessions) & additional features (eye-
tracking)
 Resource features (complexity, analytic/emotional
language, multimodality etc) as additional signals
[CIKM2019, under review]
 Improving ranking/retrieval in Web search or other
archives
(SALIENT project, Leibniz Cooperative Excellence)

Applications: social sciences research data on the Web
28Stefan Dietze
Improving findability of
(social science) research data
Mining novel (social science)
research data from the Web
https://data.gesis.org/claimskg

Finally: can we use AI & the Web to answer THE question?
29Stefan Dietze

30Stefan Dietze
P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze,
TweetsKB: A Public and Large-Scale RDF Corpus of
Annotated Tweets, ESWC'18.
http://dbpedia.org/resource/Tim_Berners-Lee
wna:positive-emotion
Recap: “Web-mined opinions” in Tweets KB
http://dbpedia.org/resource/Solid
wna:negative-emotion
Total # tweets mentioning (K, D) in 1.5 bn tweets:
• # dbp:Cologne: 89.564
• # dbp:Dusseldorf: 4723
• Opinions in terms of expressed sentiments?
• „Happiness (X) = mean of sentiment score
delta (positive - negative) of all Tweets
mentioning X“

-0.40000
-0.30000
-0.20000
-0.10000
0.00000
0.10000
0.20000
0.30000
0.40000
Cologne Düsseldorf
Mean sentiment scores (2013-2017):
• Happiness(Cologne) = 0.09281
• Happiness(Dusseldorf) = 0.04056
• Positive (Cologne) = 0.17297
• Positive (Dusseldorf) = 0.1245
• Negative (Cologne) = 0.07948
• Negative (Dusseldorf) = 0.09030
Key Findings
• Cologne happier (no significance
testing yet)
• Cologne & Dusseldorf happy overall
(positive sentiments)
Limitations
• Bias: Twitter users not representative
• Bias: Cologne cathedral=> distribution
of tourists & residents among Twitter
users likely different for both cities
January 2016,
Cologne NYE 2015/2016 aftermath
Cologne vs Dusseldorf: a pseudoscientific “answer” using TweetsKB
March 2017,
Axe attack in D?
Happiness(dbp:Cologne)
Happiness(dbp:Dusseldorf)
31Stefan Dietze
Source: https://theculturetrip.com/europe/germany/articles/8-fascinating-things-didnt-know-colognes-cathedral/© freedom100m

Acknowledgements
Co-authors
• Katarina Boland (GESIS, Germany)
• Elena Demidova (L3S, Germany)
• Asif Ekbal (IIT Patna, India)
• Pavlos Fafalios (L3S, Germany)
• Ujwal Gadiraju (L3S, Germany)
• Peter Holtz (IWM, Germany)
• Eirini Ntoutsi (LUH, Germany)
• Vasilis Iosifidis (L3S, Germany)
• Markus Rokicki (L3S, Germany)
• Arjun Roy (IIT Patna, India)
• Renato Stoffalette Joao (L3S, Germany)
• Davide Taibi (CNR, ITD, Italy)
• Nicolas Tempelmeier (L3S, Germany)
• Konstantin Todorov (LIRMM, France)
• Ran Yu (GESIS, Germany)
• Benjamin Zapilko (GESIS, Germany)
32Stefan Dietze

From (Web) Data to Knowledge: on the Complementarity
of Human and Artificial Intelligence
Prof. Dr. Stefan Dietze
Heinrich-Heine-Universität Düsseldorf
GESIS Leibniz Institute for the Social Sciences

From Web Data to Knowledge: on the Complementarity of Human and Artificial Intelligence

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to From Web Data to Knowledge: on the Complementarity of Human and Artificial Intelligence

Similar to From Web Data to Knowledge: on the Complementarity of Human and Artificial Intelligence (20)

More from Stefan Dietze

More from Stefan Dietze (20)

Recently uploaded

Recently uploaded (20)

From Web Data to Knowledge: on the Complementarity of Human and Artificial Intelligence