This document presents an approach to semantically enhance user-generated content from Facebook groups about ecological observations in Taiwan by linking extracted information to Linked Open Data resources. The approach involves extracting species names, geographic locations, and other information from Facebook posts and comments. It then formalizes the information using ontologies and links the extracted entities to external Linked Open Data sources like the Linked Open Data of Ecology and Linked Open Data Taiwan Geographic Names. Finally, the semantically enhanced user-generated content is published and can be reused by others.
1. Utilizing Linked Open Data
(LOD) Resources for
Semantic Enhancement of
User-Generated Content
Dong-Po Deng1,2, Guan-Shuo Mai3, Cheng-Hsin Hsu3,
Chin-Lung Chang1,4, Tyng-Ruey Chuang1, and Kwang-Tsao Shao3
1ITC, University of Twente, Enschede, the Netherlands
2Institute of Information Science & 3Biodiversity Research Center,
Academia Sinica, Taipei, Taiwan
4Department of Computer Science and Information Engineering
National Taiwan University of Science and Technology
Taipei, Taiwan
Thursday, February 7, 2013
2. Outline
Background
Motivation
Data Collection
LOD resources - LODE and LOD TGN
An approach for processing UGC
Information Extraction
Information Formalization
Information Reuse
Conclusion remarking
JIST2012 2012/12/3 2
Thursday, February 7, 2013
3. Outline
Background
Motivation
Data Collection
LOD resources - LODE and LOD TGN
An approach for processing UGC
Information Extraction
Information Formalization
Information Reuse
Conclusion remarking
JIST2012 2012/12/3 3
Thursday, February 7, 2013
4. Background
Web 2.0 technologies enable people to contribute
their content on the web, e.g. wiki, blog, tagging
Social media utilize web 2.0 technologies to
support social interactive on the web, e.g. twitter,
flickr, facebook
The content on the web (or/and social media)
contributed by people is called “User-Generated
Content” (UGC)
UGC is mainly multimedia or textual data
UGC is considered as a potential resource for
scientific projects, e.g. citizen science
JIST2012 2012/12/3 4
Thursday, February 7, 2013
5. Background(cont.)
There are several problems to harvest UGC to
scientific purposes
The unstructured UGC is difficult to handle
The semantics of UGC is often ambiguous or/and poor
Social media is not designed for scientific purposes
Courtesy from http://www.datenform.de/mapeng.html
JIST2012 2012/12/3 5
Thursday, February 7, 2013
6. Outline
Background
Motivation
Data Collection
LOD resources - LODE and LOD TGN
An approach for processing UGC
Information Extraction
Information Formalization
Information Reuse
Conclusion remarking
JIST2012 2012/12/3 6
Thursday, February 7, 2013
7. Motivation
LOD datasets as resources
LOD aims on how to make data available on the Web, and
to interconnect data with the aim of increasing its value for
users
about 300 datasets consisting of over 31 billion RDF triples
within LOD projects.
Each entry representing a fact in LOD datasets has
a Unique Resource Identifier (URI) which is
referenceable and linkable on the Web.
The high interconnectivity between entries
potentially increases discoverability, reusability,
and the utility of information
JIST2012 2012/12/3 7
Thursday, February 7, 2013
8. Motivation (cont.)
Therefore, if named entities of UGC can be
identified and connected to entries of LOD, the
semantics of named entities would be
disambiguated, so that the UGC could be easier to
process.
JIST2012 2012/12/3 8
Thursday, February 7, 2013
9. Outline
Background
Motivation
Data Collection
LOD resources - LODE and LOD TGN
An approach for processing UGC
Information Extraction
Information Formalization
Information Reuse
Conclusion remarking
JIST2012 2012/12/3 9
Thursday, February 7, 2013
10. Data collection
Two Facebook interest groups for ecological
observations in Taiwan
http://www.facebook.com/groups/roadkilled/ http://www.facebook.com/groups/enjoymoths/
JIST2012 2012/12/3 10
Thursday, February 7, 2013
12. Outline
Background
Motivation
Data Collection
LOD resources - LODE and LOD TGN
An approach for processing UGC
Information Extraction
Information Formalization
Information Reuse
Conclusion remarking
JIST2012 2012/12/3 12
Thursday, February 7, 2013
13. LOD Ecology
Linked Open Data of Ecology (LODE) is a validated
dataset from a LOD project.
LODE integrated 5 previously distributed
databases:
TFRI: Taiwan Forestry Research Institute
JIST2012 2012/12/3 13
Thursday, February 7, 2013
14. LODE in Linked Open Data Cloud
JIST2012 2012/12/3 14
Thursday, February 7, 2013
15. LODE in Linked Open Data Cloud
JIST2012 2012/12/3 14
Thursday, February 7, 2013
16. LOD Taiwan Geographic Name (TGN)
LOD TGN is mainly transferred from Taiwan
Gazetteer via LOD principles
LOD TGN has 159,241 geographic name entries, in
which 17,442 entries are linked to geonames.org
JIST2012 2012/12/3 15
Thursday, February 7, 2013
17. Outline
Background
Motivation
Data Collection
LOD resources - LODE and LOD TGN
An approach for processing UGC
Information Extraction
Information Formalization
Information Reuse
Conclusion remarking
JIST2012 2012/12/3 16
Thursday, February 7, 2013
18. An approach for processing UGC
Information Extraction Information Reuse
Information Formalization
JIST2012 2012/12/3 17
Thursday, February 7, 2013
19. Outline
Background
Motivation
Data Collection
LOD resources - LODE and LOD TGN
An approach for processing UGC
Information Extraction
Information Formalization
Information Reuse
Conclusion remarking
JIST2012 2012/12/3 18
Thursday, February 7, 2013
20. Problems in Chinese species names in
Facebook ecological observations
曙鳳蝶 (Atrophaneura Horishana) 曙鳳
(1) 玉帶鳳蝶 (Papilio Polytes) 玉帶
琉璃紋鳳蝶 (Papilio Hermosanus) 琉璃
Adjective Noun
細紋 (pronounced Si-Wen, meaning “fine veined”
細紋黃鉤蛾
(2) 細紋蠍蛉
細紋新蠍蛉
...15 species names with prefix name “細紋”
JIST2012 2012/12/3 19
Thursday, February 7, 2013
21. Identifying shortened
species names
Confidence value =
JIST2012 2012/12/3 20
Thursday, February 7, 2013
22. Determine a species name for a thread
What if several species
names had mentioned in
one thread? We used three
criteria
How many Like does the post or
the comments get?
How prestigious are the people
who post or make comments?
How many times does a species
name occur in a thread?
JIST2012 2012/12/3 21
Thursday, February 7, 2013
23. The problems of geographic names in
Facebook ecological observations
An example:
The Endemic Species Research Institute
特有生物研究保育中心
Te-You-Sheng-Wu-Yan-Jiou-Bao-Yu-Jhong-Sin
is shorten to
特生中心
Te-Sheng-Jhong-Sin
JIST2012 2012/12/3 22
Thursday, February 7, 2013
24. The problems of geographic names in
Facebook ecological observations
An example:
The Endemic Species Research Institute
特有生物研究保育中心
Te-You-Sheng-Wu-Yan-Jiou-Bao-Yu-Jhong-Sin
is shorten to
特生中心 There are no rules to
Te-Sheng-Jhong-Sin shorten long geographic
names
JIST2012 2012/12/3 22
Thursday, February 7, 2013
26. The ontology...
is relied on a Facebook thread, which is an entity
comprised of social media contents involving
peoples, places, time periods, photos, and links to
other contents
uses standard vocabularies,
Semantically-Interlinked Online communities (SIOC) can be
used to represent the structure of Facebook posts,
comments, and threads.
Friend of a Friend (FOAF) can be used to describe content
creators,
and Dublin Core for the interlinked contents they created
JIST2012 2012/12/3 24
Thursday, February 7, 2013
27. An ontology for formalizing the extracted
information from Facebook threads
JIST2012 2012/12/3 25
Thursday, February 7, 2013
28. Outline
Background
Motivation
Data Collection
LOD resources - LODE and LOD TGN
An approach for processing UGC
Information Extraction
Information Formalization
Information Reuse
Conclusion remarking
JIST2012 2012/12/3 26
Thursday, February 7, 2013
29. Transfer ecological observations in
Facebook to RDF
http://140.109.28.64:2020/page/thread/177883715557195_440860179259546
JIST2012 2012/12/3 27
Thursday, February 7, 2013
30. Transfer ecological observations in
Facebook to RDF
http://140.109.28.64:2020/page/thread/177883715557195_440860179259546
JIST2012 2012/12/3 27
Thursday, February 7, 2013
31. The extracted species name from the
Facebook thread is linked to LOD resources
JIST2012 2012/12/3 28
Thursday, February 7, 2013
32. The extracted species name from the
Facebook thread is linked to LOD resources
JIST2012 2012/12/3 28
Thursday, February 7, 2013
33. The extracted species name from the
Facebook thread is linked to LOD resources
JIST2012 2012/12/3 28
Thursday, February 7, 2013
34. The extracted species name from the
Facebook thread is linked to LOD resources
JIST2012 2012/12/3 28
Thursday, February 7, 2013
35. A taxon of Theretra Nessus is the
extracted species name
JIST2012 2012/12/3 29
Thursday, February 7, 2013
36. A taxon of Theretra Nessus is the
extracted species name
This entry is connected to LODE via owl:sameAs
JIST2012 2012/12/3 29
Thursday, February 7, 2013
37. The extracted place name from the
Facebook thread is linked to LOD resources
JIST2012 2012/12/3 30
Thursday, February 7, 2013
38. The extracted place name from the
Facebook thread is linked to LOD resources
JIST2012 2012/12/3 30
Thursday, February 7, 2013
39. The extracted place name from the
Facebook thread is linked to LOD resources
JIST2012 2012/12/3 30
Thursday, February 7, 2013
40. The extracted place name from the
Facebook thread is linked to LOD resources
JIST2012 2012/12/3 30
Thursday, February 7, 2013
41. The entry of LOD TGN transferred from
Taiwan Gazetteer
JIST2012 2012/12/3 31
Thursday, February 7, 2013
42. The entry of LOD TGN transferred from
Taiwan Gazetteer
It is linked to geonames.org via owl:sameAs
JIST2012 2012/12/3 31
Thursday, February 7, 2013
43. Publish the processed Facebook
ecological observations
JIST2012 2012/12/3 32
Thursday, February 7, 2013
44. Outline
Background
Motivation
Data Collection
LOD resources - LODE and LOD TGN
An approach for processing UGC
Information Extraction
Information Formalization
Information Reuse
Conclusion remarking
JIST2012 2012/12/3 33
Thursday, February 7, 2013
45. A semantic annotation plug-in for entering
geographic names in Facebook posts
JIST2012 2012/12/3 34
Thursday, February 7, 2013
46. A semantic annotation plug-in for entering
geographic names in Facebook posts
JIST2012 2012/12/3 34
Thursday, February 7, 2013
47. A semantic annotation plug-in for entering
geographic names in Facebook posts
JIST2012 2012/12/3 34
Thursday, February 7, 2013
48. JIST2012 2012/12/3 35
Thursday, February 7, 2013
49. Outline
Background
Motivation
Data Collection
LOD resources - LODE and LOD TGN
An approach for processing UGC
Information Extraction
Information Formalization
Information Reuse
Conclusion remarking
JIST2012 2012/12/3 36
Thursday, February 7, 2013
50. Conclusion remarking
This study reports our experiences in transferring FB
ecological observations to interlink to LOD
resources (LODE and LOD TGN)
With these information extraction tools and LOD
resources, we developed a tool for semantic
enhancement of user input.
The LOD TGN is an ongoing project.
In the future, we will consolidate the feature types
of the geographic names, and we plan to make
the LOD TGN a geospatial semantics reference
resource.
JIST2012 2012/12/3 37
Thursday, February 7, 2013
51. Thank you for your attentions
Questions?
deng@itc.nl
JIST2012 2012/12/3 38
Thursday, February 7, 2013