SlideShare a Scribd company logo
1 of 21
Community detection using citation relations and
textual similarities in a large set of PubMed
publications
Per Ahlgren, Yunwei Chen, Cristian Colliander, and
Nees Jan van Eck
Publication-level classification system
2
Social sciences
and humanities
Biomedical and
health sciences
Life and earth
sciences
Mathematics and
computer science
Physical
sciences and
engineering
Introduction
Purpose of the study:
To analyze whether clustering accuracy can be improved by
combining direct citations with indirect citation relations or text
relations.
Introduction
β€’ We compare 6 publication clustering approaches.
β€’ The main difference between them is how the
relatedness of publications is defined.
β€’ We build on, and were inspired by, two studies
presented at the ISSI conference in Wuhan 2017:
– Chen et. al (2017). A weighted method for citation network
community detection.
– Waltman et al. (2017). A principled methodology for comparing
relatedness measures for clustering publications.
Data and methods
β€’ Five-year publication period, 2013-2017.
β€’ About 4 million publications were retrieved from MEDLINE, the
largest subset of PubMed.
β€’ PubMed does not contain citation relations between
publications. Therefore, we also used Web of Science (WoS)
data.
– Each publication was matched to a publication included in the in-house version of the
WoS database available at the Centre for Science and Technology Studies (CWTS)
at Leiden University.
Data and methods
– About 3.5 million publications remained after matching.
– From these publications, we selected each publication p such that p satisfies each of
the following four conditions:
1. p has a WoS publication year in the period 2013-2017.
2. p is of WoS document type Article or Review.
3. p has both an abstract and a title with respect to its WoS record.
4. p has a citation relation to at least one publication p’ such that p’ satisfies points 1-3 in this list.
– About 3 million publications finally obtained.
Data and methods
β€’ Investigated relatedness measures
– Direct citations (DC). The relatedness of two publications i and j is 1 if there is a
direct citation from i to j or such a relation from j to i, otherwise the relatedness is 0.
– Bibliographic coupling (BC). The relatedness of i and j is defined as the number of
shared cited references in i and j.
– Co-citation (CC). The relatedness of i and j is defined as the number of publications
that cite both i and j.
– BM25. Terms (noun phrases) in the titles and abstracts of the publications are used
to represent the publications. The approach involves the BM25 measure, a well-
known query-publication similarity measure in information retrieval research.
β€’ The value of the measure for i with j is a sum across all unique terms in the dataset, where the
number of occurrences of a term in i and j, the inverse document frequency of the term and the
length of j are taken into account.
Data and methods
– DC-BC-CC. In this approach, direct citations are enhanced by the citation relations
corresponding to the approaches BC and CC. We define relatedness of i and j,
π‘Ÿπ‘–π‘—
DCβˆ’BCβˆ’CC
, as
π‘Ÿπ‘–π‘—
DCβˆ’BCβˆ’CC
= π›Όπ‘Ÿπ‘–π‘—
DC
+ π‘Ÿπ‘–π‘—
BC
+ π‘Ÿπ‘–π‘—
CC
where 𝛼 is a weight of direct citations relative to BC and CC. In our analysis, we use 1
and 5 as values of 𝛼.
Data and methods
– DC-BM25. In this approach, direct citations are enhanced by the text relations. We
define relatedness of i and j, π‘Ÿπ‘–π‘—
DCβˆ’BM25
, as
π‘Ÿπ‘–π‘—
DCβˆ’BM25
= π›Όπ‘Ÿπ‘–π‘—
DC
+ π‘Ÿπ‘–π‘—
BM25
where 𝛼 is a weight of direct citations relative to BM25. The average across all BM25
relatedness values greater than 0 was calculated, an average that turned out to be
equal to 50. By setting 𝛼 to 50, the DC values are put on the same scale as the BM25
relatedness values, in an average sense. By setting 𝛼 to 25 (100), less (more)
emphasis would be put on DC. We use all these three 𝛼 values in our analysis.
Data and methods
β€’ Clustering of publications
– In this study, we use the Leiden algorithm (Traag et al., 2018a, 2018b) to generate a
series of clustering solutions for each of the relatedness measures. The Leiden
algorithm is used to maximize the Constant Potts Model as quality function (Traag et
al., 2011; Waltman & Van Eck, 2012).
– Using different values of the resolution parameter (0.000001, 0.000002, 0.000005,
0.00001, 0.00002, 0.00005, 0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, 0.01), we
obtain 13 clustering solutions for each relatedness measure.
Data and methods
β€’ Evaluation of approach performance
– We use the evaluation framework of Waltman et al. (2017,
2019).
– A relatedness measure based on MeSH terms is used as
an independent evaluation measure to compare the
accuracy of clustering solutions produced by all
approaches.
– MeSH is a detailed item-level subject classification scheme.
β€’ MeSH descriptors (more than 28 thousand) and subheadings are
used to index publications in PubMed.
β€’ Approximately 80 subheadings (or qualifiers) can be used by the
indexer to qualify a descriptor.
Data and methods
– The accuracy of the kth (1 ≀ k ≀ 13) clustering solution for 𝑋 ∈ {DC, BC, CC, BM25,
DC-BC-CC, DC-BM25, MeSH}, where the accuracy is based on MeSH cosine
similarity, symbolically 𝐴 𝑋 π‘˜|MeSH, is defined as follows (Waltman et al., 2017, 2019):
𝐴 𝑋 π‘˜|MeSH =
1
𝑁 𝑖,𝑗
𝐼(𝑐𝑖
𝑋 π‘˜
= 𝑐𝑗
𝑋 π‘˜
)π‘Ÿπ‘–π‘—
MeSH
where N is the number of publications in the dataset, 𝑐𝑖
𝑋 π‘˜
a positive integer denoting
the cluster to which publication i belongs with respect to the kth clustering solution
for X, 𝐼(𝑐𝑖
𝑋 π‘˜
= 𝑐𝑗
𝑋 π‘˜
) is 1 if its condition is true, otherwise 0, and π‘Ÿπ‘–π‘—
MeSH
(norm) the
normalized MeSH cosine similarity of i with j.
Results
β€’ We visualize the evaluation results by using granularity-
accuracy (GA) plots (Waltman et al., 2017, 2019).
β€’ We present three figures containing GA plots.
– DC and the other citation-based approaches
– DC and the text-based approaches
– DC and best performing approaches
Results
Results
Results
Conclusions and future research
β€’ Enhancing direct citations with indirect citation relations (BC-
CC) or text relations (BM25) gives rise to substantial
performance gains relative to direct citations
β€’ Combination of direct citations and text (BM25) performs best
β€’ These results assume that MeSH terms serve as an appropriate
evaluation measure
Conclusions and future research
β€’ An extended version of our paper has been submitted to the
journal Quantitative Science Studies.
– One more approach is added: extended direct citations (EDC).
– EDC shows the best performance.
Conclusions and future research
β€’ It does not follow that two cluster solutions with similar accuracy
also have similar groupings of publications into clusters. In view
of this, in future studies we aim to further compare the
clustering solutions to deepen the insight into how the clustering
solutions based on different relatedness measures differ from
each other.
Thank you for your attention!
Results extended journal paper

More Related Content

What's hot

Community Analysis of Fashion Coordination Using a Distance of Categorical Da...
Community Analysis of Fashion Coordination Using a Distance of Categorical Da...Community Analysis of Fashion Coordination Using a Distance of Categorical Da...
Community Analysis of Fashion Coordination Using a Distance of Categorical Da...IJERA Editor
Β 
E(p)owering Your Institution
E(p)owering Your InstitutionE(p)owering Your Institution
E(p)owering Your InstitutionDouglas Joubert
Β 
Research Methodology (Correlational Research) By Emeral & Sarah
Research Methodology (Correlational Research) By Emeral & SarahResearch Methodology (Correlational Research) By Emeral & Sarah
Research Methodology (Correlational Research) By Emeral & SarahEmeral Djunas
Β 
Correlation Research Question
Correlation Research QuestionCorrelation Research Question
Correlation Research Questionguest144155
Β 
A Formal Machine Learning or Multi Objective Decision Making System for Deter...
A Formal Machine Learning or Multi Objective Decision Making System for Deter...A Formal Machine Learning or Multi Objective Decision Making System for Deter...
A Formal Machine Learning or Multi Objective Decision Making System for Deter...Editor IJCATR
Β 
Advanced statistics for librarians
Advanced statistics for librariansAdvanced statistics for librarians
Advanced statistics for librariansJohn McDonald
Β 
Statistical Analysis for Educational Outcomes Measurement in CME
Statistical Analysis for Educational Outcomes Measurement in CMEStatistical Analysis for Educational Outcomes Measurement in CME
Statistical Analysis for Educational Outcomes Measurement in CMED. Warnick Consulting
Β 
Analysing/Interpreting Quantitative Research
Analysing/Interpreting  Quantitative Research Analysing/Interpreting  Quantitative Research
Analysing/Interpreting Quantitative Research HariBolKafle
Β 
Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)Mohammed Musah
Β 
Presentation STAT 639 (TAMU) final project
Presentation STAT 639 (TAMU) final projectPresentation STAT 639 (TAMU) final project
Presentation STAT 639 (TAMU) final projectRhythmVerma4
Β 
Measuresofcentraltendency 121117004155-phpapp01
Measuresofcentraltendency 121117004155-phpapp01Measuresofcentraltendency 121117004155-phpapp01
Measuresofcentraltendency 121117004155-phpapp01Jouaine Ombay
Β 
Approaches for Keyword Query Routing
Approaches for Keyword Query RoutingApproaches for Keyword Query Routing
Approaches for Keyword Query RoutingIJERA Editor
Β 
High-Dimensional Data Visualization, Geometry, and Stock Market Crashes
High-Dimensional Data Visualization, Geometry, and Stock Market CrashesHigh-Dimensional Data Visualization, Geometry, and Stock Market Crashes
High-Dimensional Data Visualization, Geometry, and Stock Market CrashesColleen Farrelly
Β 
Analyzing data (chapter 9)
Analyzing data (chapter 9)Analyzing data (chapter 9)
Analyzing data (chapter 9)Humbertovsky
Β 

What's hot (19)

Community Analysis of Fashion Coordination Using a Distance of Categorical Da...
Community Analysis of Fashion Coordination Using a Distance of Categorical Da...Community Analysis of Fashion Coordination Using a Distance of Categorical Da...
Community Analysis of Fashion Coordination Using a Distance of Categorical Da...
Β 
E(p)owering Your Institution
E(p)owering Your InstitutionE(p)owering Your Institution
E(p)owering Your Institution
Β 
B025209013
B025209013B025209013
B025209013
Β 
Research Methodology (Correlational Research) By Emeral & Sarah
Research Methodology (Correlational Research) By Emeral & SarahResearch Methodology (Correlational Research) By Emeral & Sarah
Research Methodology (Correlational Research) By Emeral & Sarah
Β 
One Graduate Paper
One Graduate PaperOne Graduate Paper
One Graduate Paper
Β 
Correlation Research Question
Correlation Research QuestionCorrelation Research Question
Correlation Research Question
Β 
A Formal Machine Learning or Multi Objective Decision Making System for Deter...
A Formal Machine Learning or Multi Objective Decision Making System for Deter...A Formal Machine Learning or Multi Objective Decision Making System for Deter...
A Formal Machine Learning or Multi Objective Decision Making System for Deter...
Β 
Advanced statistics for librarians
Advanced statistics for librariansAdvanced statistics for librarians
Advanced statistics for librarians
Β 
Statistical Analysis for Educational Outcomes Measurement in CME
Statistical Analysis for Educational Outcomes Measurement in CMEStatistical Analysis for Educational Outcomes Measurement in CME
Statistical Analysis for Educational Outcomes Measurement in CME
Β 
Analysing/Interpreting Quantitative Research
Analysing/Interpreting  Quantitative Research Analysing/Interpreting  Quantitative Research
Analysing/Interpreting Quantitative Research
Β 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
Β 
Correlational research
Correlational researchCorrelational research
Correlational research
Β 
Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)
Β 
Presentation STAT 639 (TAMU) final project
Presentation STAT 639 (TAMU) final projectPresentation STAT 639 (TAMU) final project
Presentation STAT 639 (TAMU) final project
Β 
Measuresofcentraltendency 121117004155-phpapp01
Measuresofcentraltendency 121117004155-phpapp01Measuresofcentraltendency 121117004155-phpapp01
Measuresofcentraltendency 121117004155-phpapp01
Β 
Approaches for Keyword Query Routing
Approaches for Keyword Query RoutingApproaches for Keyword Query Routing
Approaches for Keyword Query Routing
Β 
Ae044209211
Ae044209211Ae044209211
Ae044209211
Β 
High-Dimensional Data Visualization, Geometry, and Stock Market Crashes
High-Dimensional Data Visualization, Geometry, and Stock Market CrashesHigh-Dimensional Data Visualization, Geometry, and Stock Market Crashes
High-Dimensional Data Visualization, Geometry, and Stock Market Crashes
Β 
Analyzing data (chapter 9)
Analyzing data (chapter 9)Analyzing data (chapter 9)
Analyzing data (chapter 9)
Β 

Similar to Community detection using citation relations and textual similarities in a large set of PubMed publications

Correlational research
Correlational researchCorrelational research
Correlational researchJijo G John
Β 
cannonicalpresentation-110505114327-phpapp01.pdf
cannonicalpresentation-110505114327-phpapp01.pdfcannonicalpresentation-110505114327-phpapp01.pdf
cannonicalpresentation-110505114327-phpapp01.pdfJermaeDizon2
Β 
Population-adjusted treatment comparisons: estimates based on MAIC (Matching-...
Population-adjusted treatment comparisons: estimates based on MAIC (Matching-...Population-adjusted treatment comparisons: estimates based on MAIC (Matching-...
Population-adjusted treatment comparisons: estimates based on MAIC (Matching-...cheweb1
Β 
Recommender system
Recommender systemRecommender system
Recommender systemBhumi Patel
Β 
Correlation research
Correlation researchCorrelation research
Correlation researchAmina Tariq
Β 
STRUCTURAL EQUATION MODEL (SEM)
STRUCTURAL EQUATION MODEL (SEM)STRUCTURAL EQUATION MODEL (SEM)
STRUCTURAL EQUATION MODEL (SEM)AJHSSR Journal
Β 
Cannonical Correlation
Cannonical CorrelationCannonical Correlation
Cannonical Correlationdomsr
Β 
Cannonical correlation
Cannonical correlationCannonical correlation
Cannonical correlationdomsr
Β 
A Statistical Analysis Of Summarization Evaluation Metrics Using Resampling M...
A Statistical Analysis Of Summarization Evaluation Metrics Using Resampling M...A Statistical Analysis Of Summarization Evaluation Metrics Using Resampling M...
A Statistical Analysis Of Summarization Evaluation Metrics Using Resampling M...Lisa Cain
Β 
Similarity Measures for Semantic Relation Extraction
Similarity Measures for Semantic Relation ExtractionSimilarity Measures for Semantic Relation Extraction
Similarity Measures for Semantic Relation ExtractionAlexander Panchenko
Β 
Improving Correlation with Human Judgments by Integrating Second-Order Vector...
Improving Correlation with Human Judgments by Integrating Second-Order Vector...Improving Correlation with Human Judgments by Integrating Second-Order Vector...
Improving Correlation with Human Judgments by Integrating Second-Order Vector...Ted Pedersen
Β 
Unit -5 - Data Analysis & Report Writing.pptx
Unit -5 - Data Analysis & Report Writing.pptxUnit -5 - Data Analysis & Report Writing.pptx
Unit -5 - Data Analysis & Report Writing.pptxDrPrachiAjit
Β 
Evaluation Of A Correlation Analysis Essay
Evaluation Of A Correlation Analysis EssayEvaluation Of A Correlation Analysis Essay
Evaluation Of A Correlation Analysis EssayCrystal Alvarez
Β 
Statistical Analysis in Social Science Researches.pptx
Statistical Analysis in Social Science Researches.pptxStatistical Analysis in Social Science Researches.pptx
Statistical Analysis in Social Science Researches.pptxhishamhanfy
Β 
Secured Ontology Mapping
Secured Ontology Mapping Secured Ontology Mapping
Secured Ontology Mapping dannyijwest
Β 
this activity is designed for you to explore the continuum of an a.docx
this activity is designed for you to explore the continuum of an a.docxthis activity is designed for you to explore the continuum of an a.docx
this activity is designed for you to explore the continuum of an a.docxhowardh5
Β 
TYPESOFDATAANALYSIS research methodology .pdf
TYPESOFDATAANALYSIS research methodology .pdfTYPESOFDATAANALYSIS research methodology .pdf
TYPESOFDATAANALYSIS research methodology .pdfMounika711622
Β 
Factor Analysis in Research
Factor Analysis in ResearchFactor Analysis in Research
Factor Analysis in ResearchQasim Raza
Β 
Correlation
CorrelationCorrelation
Correlationrkalidasan
Β 

Similar to Community detection using citation relations and textual similarities in a large set of PubMed publications (20)

Correlational research
Correlational researchCorrelational research
Correlational research
Β 
cannonicalpresentation-110505114327-phpapp01.pdf
cannonicalpresentation-110505114327-phpapp01.pdfcannonicalpresentation-110505114327-phpapp01.pdf
cannonicalpresentation-110505114327-phpapp01.pdf
Β 
Population-adjusted treatment comparisons: estimates based on MAIC (Matching-...
Population-adjusted treatment comparisons: estimates based on MAIC (Matching-...Population-adjusted treatment comparisons: estimates based on MAIC (Matching-...
Population-adjusted treatment comparisons: estimates based on MAIC (Matching-...
Β 
Recommender system
Recommender systemRecommender system
Recommender system
Β 
Correlation research
Correlation researchCorrelation research
Correlation research
Β 
STRUCTURAL EQUATION MODEL (SEM)
STRUCTURAL EQUATION MODEL (SEM)STRUCTURAL EQUATION MODEL (SEM)
STRUCTURAL EQUATION MODEL (SEM)
Β 
Cannonical Correlation
Cannonical CorrelationCannonical Correlation
Cannonical Correlation
Β 
Cannonical correlation
Cannonical correlationCannonical correlation
Cannonical correlation
Β 
Chi-Square Test Non Parametric Test Categorical Variable
Chi-Square Test Non Parametric Test Categorical VariableChi-Square Test Non Parametric Test Categorical Variable
Chi-Square Test Non Parametric Test Categorical Variable
Β 
A Statistical Analysis Of Summarization Evaluation Metrics Using Resampling M...
A Statistical Analysis Of Summarization Evaluation Metrics Using Resampling M...A Statistical Analysis Of Summarization Evaluation Metrics Using Resampling M...
A Statistical Analysis Of Summarization Evaluation Metrics Using Resampling M...
Β 
Similarity Measures for Semantic Relation Extraction
Similarity Measures for Semantic Relation ExtractionSimilarity Measures for Semantic Relation Extraction
Similarity Measures for Semantic Relation Extraction
Β 
Improving Correlation with Human Judgments by Integrating Second-Order Vector...
Improving Correlation with Human Judgments by Integrating Second-Order Vector...Improving Correlation with Human Judgments by Integrating Second-Order Vector...
Improving Correlation with Human Judgments by Integrating Second-Order Vector...
Β 
Unit -5 - Data Analysis & Report Writing.pptx
Unit -5 - Data Analysis & Report Writing.pptxUnit -5 - Data Analysis & Report Writing.pptx
Unit -5 - Data Analysis & Report Writing.pptx
Β 
Evaluation Of A Correlation Analysis Essay
Evaluation Of A Correlation Analysis EssayEvaluation Of A Correlation Analysis Essay
Evaluation Of A Correlation Analysis Essay
Β 
Statistical Analysis in Social Science Researches.pptx
Statistical Analysis in Social Science Researches.pptxStatistical Analysis in Social Science Researches.pptx
Statistical Analysis in Social Science Researches.pptx
Β 
Secured Ontology Mapping
Secured Ontology Mapping Secured Ontology Mapping
Secured Ontology Mapping
Β 
this activity is designed for you to explore the continuum of an a.docx
this activity is designed for you to explore the continuum of an a.docxthis activity is designed for you to explore the continuum of an a.docx
this activity is designed for you to explore the continuum of an a.docx
Β 
TYPESOFDATAANALYSIS research methodology .pdf
TYPESOFDATAANALYSIS research methodology .pdfTYPESOFDATAANALYSIS research methodology .pdf
TYPESOFDATAANALYSIS research methodology .pdf
Β 
Factor Analysis in Research
Factor Analysis in ResearchFactor Analysis in Research
Factor Analysis in Research
Β 
Correlation
CorrelationCorrelation
Correlation
Β 

More from Nees Jan van Eck

Crossref as a source of open bibliographic metadata
Crossref as a source of open bibliographic metadataCrossref as a source of open bibliographic metadata
Crossref as a source of open bibliographic metadataNees Jan van Eck
Β 
Bibliometrische visualisaties voor het bijhouden van wetenschappelijke litera...
Bibliometrische visualisaties voor het bijhouden van wetenschappelijke litera...Bibliometrische visualisaties voor het bijhouden van wetenschappelijke litera...
Bibliometrische visualisaties voor het bijhouden van wetenschappelijke litera...Nees Jan van Eck
Β 
Visual exploration of scientific literature using VOSviewer and CitNetExplorer
Visual exploration of scientific literature using VOSviewer and CitNetExplorerVisual exploration of scientific literature using VOSviewer and CitNetExplorer
Visual exploration of scientific literature using VOSviewer and CitNetExplorerNees Jan van Eck
Β 
Intermediacy of publications
Intermediacy of publicationsIntermediacy of publications
Intermediacy of publicationsNees Jan van Eck
Β 
Visualizing science using VOSviewer based on Crossref, Microsoft Academic, an...
Visualizing science using VOSviewer based on Crossref, Microsoft Academic, an...Visualizing science using VOSviewer based on Crossref, Microsoft Academic, an...
Visualizing science using VOSviewer based on Crossref, Microsoft Academic, an...Nees Jan van Eck
Β 
Open data sources in VOSviewer
Open data sources in VOSviewerOpen data sources in VOSviewer
Open data sources in VOSviewerNees Jan van Eck
Β 
A scientometric perspective on university ranking
A scientometric perspective on university rankingA scientometric perspective on university ranking
A scientometric perspective on university rankingNees Jan van Eck
Β 
Open data sources in VOSviewer
Open data sources in VOSviewerOpen data sources in VOSviewer
Open data sources in VOSviewerNees Jan van Eck
Β 
A scientometric perspective on university ranking
A scientometric perspective on university rankingA scientometric perspective on university ranking
A scientometric perspective on university rankingNees Jan van Eck
Β 
CWTS Leiden Ranking: An advanced bibliometric approach to university ranking
CWTS Leiden Ranking: An advanced bibliometric approach to university rankingCWTS Leiden Ranking: An advanced bibliometric approach to university ranking
CWTS Leiden Ranking: An advanced bibliometric approach to university rankingNees Jan van Eck
Β 
Open data sources in VOSviewer
Open data sources in VOSviewerOpen data sources in VOSviewer
Open data sources in VOSviewerNees Jan van Eck
Β 
Large-scale visualization of science
Large-scale visualization of scienceLarge-scale visualization of science
Large-scale visualization of scienceNees Jan van Eck
Β 
Scientometric approaches to classification
Scientometric approaches to classificationScientometric approaches to classification
Scientometric approaches to classificationNees Jan van Eck
Β 
Visualizing science based on open data sources
Visualizing science based on open data sourcesVisualizing science based on open data sources
Visualizing science based on open data sourcesNees Jan van Eck
Β 
Accuracy of citation data in Web of Science and Scopus
Accuracy of citation data in Web of Science and ScopusAccuracy of citation data in Web of Science and Scopus
Accuracy of citation data in Web of Science and ScopusNees Jan van Eck
Β 
Using full-text data to create improved term maps
Using full-text data to create improved term mapsUsing full-text data to create improved term maps
Using full-text data to create improved term mapsNees Jan van Eck
Β 
VOSviewer: A software tool for analyzing and visualizing scientific literature
VOSviewer: A software tool for analyzing and visualizing scientific literatureVOSviewer: A software tool for analyzing and visualizing scientific literature
VOSviewer: A software tool for analyzing and visualizing scientific literatureNees Jan van Eck
Β 
Science Mapping and Research Positioning
Science Mapping and Research PositioningScience Mapping and Research Positioning
Science Mapping and Research PositioningNees Jan van Eck
Β 
How to design a ranking system: Criteria and opportunities for a comparison
How to design a ranking system: Criteria and opportunities for a comparisonHow to design a ranking system: Criteria and opportunities for a comparison
How to design a ranking system: Criteria and opportunities for a comparisonNees Jan van Eck
Β 
Advanced citation matching and large-scale cited reference extraction
Advanced citation matching and large-scale cited reference extractionAdvanced citation matching and large-scale cited reference extraction
Advanced citation matching and large-scale cited reference extractionNees Jan van Eck
Β 

More from Nees Jan van Eck (20)

Crossref as a source of open bibliographic metadata
Crossref as a source of open bibliographic metadataCrossref as a source of open bibliographic metadata
Crossref as a source of open bibliographic metadata
Β 
Bibliometrische visualisaties voor het bijhouden van wetenschappelijke litera...
Bibliometrische visualisaties voor het bijhouden van wetenschappelijke litera...Bibliometrische visualisaties voor het bijhouden van wetenschappelijke litera...
Bibliometrische visualisaties voor het bijhouden van wetenschappelijke litera...
Β 
Visual exploration of scientific literature using VOSviewer and CitNetExplorer
Visual exploration of scientific literature using VOSviewer and CitNetExplorerVisual exploration of scientific literature using VOSviewer and CitNetExplorer
Visual exploration of scientific literature using VOSviewer and CitNetExplorer
Β 
Intermediacy of publications
Intermediacy of publicationsIntermediacy of publications
Intermediacy of publications
Β 
Visualizing science using VOSviewer based on Crossref, Microsoft Academic, an...
Visualizing science using VOSviewer based on Crossref, Microsoft Academic, an...Visualizing science using VOSviewer based on Crossref, Microsoft Academic, an...
Visualizing science using VOSviewer based on Crossref, Microsoft Academic, an...
Β 
Open data sources in VOSviewer
Open data sources in VOSviewerOpen data sources in VOSviewer
Open data sources in VOSviewer
Β 
A scientometric perspective on university ranking
A scientometric perspective on university rankingA scientometric perspective on university ranking
A scientometric perspective on university ranking
Β 
Open data sources in VOSviewer
Open data sources in VOSviewerOpen data sources in VOSviewer
Open data sources in VOSviewer
Β 
A scientometric perspective on university ranking
A scientometric perspective on university rankingA scientometric perspective on university ranking
A scientometric perspective on university ranking
Β 
CWTS Leiden Ranking: An advanced bibliometric approach to university ranking
CWTS Leiden Ranking: An advanced bibliometric approach to university rankingCWTS Leiden Ranking: An advanced bibliometric approach to university ranking
CWTS Leiden Ranking: An advanced bibliometric approach to university ranking
Β 
Open data sources in VOSviewer
Open data sources in VOSviewerOpen data sources in VOSviewer
Open data sources in VOSviewer
Β 
Large-scale visualization of science
Large-scale visualization of scienceLarge-scale visualization of science
Large-scale visualization of science
Β 
Scientometric approaches to classification
Scientometric approaches to classificationScientometric approaches to classification
Scientometric approaches to classification
Β 
Visualizing science based on open data sources
Visualizing science based on open data sourcesVisualizing science based on open data sources
Visualizing science based on open data sources
Β 
Accuracy of citation data in Web of Science and Scopus
Accuracy of citation data in Web of Science and ScopusAccuracy of citation data in Web of Science and Scopus
Accuracy of citation data in Web of Science and Scopus
Β 
Using full-text data to create improved term maps
Using full-text data to create improved term mapsUsing full-text data to create improved term maps
Using full-text data to create improved term maps
Β 
VOSviewer: A software tool for analyzing and visualizing scientific literature
VOSviewer: A software tool for analyzing and visualizing scientific literatureVOSviewer: A software tool for analyzing and visualizing scientific literature
VOSviewer: A software tool for analyzing and visualizing scientific literature
Β 
Science Mapping and Research Positioning
Science Mapping and Research PositioningScience Mapping and Research Positioning
Science Mapping and Research Positioning
Β 
How to design a ranking system: Criteria and opportunities for a comparison
How to design a ranking system: Criteria and opportunities for a comparisonHow to design a ranking system: Criteria and opportunities for a comparison
How to design a ranking system: Criteria and opportunities for a comparison
Β 
Advanced citation matching and large-scale cited reference extraction
Advanced citation matching and large-scale cited reference extractionAdvanced citation matching and large-scale cited reference extraction
Advanced citation matching and large-scale cited reference extraction
Β 

Recently uploaded

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
Β 
Hire πŸ’• 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire πŸ’• 9907093804 Hooghly Call Girls Service Call Girls AgencyHire πŸ’• 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire πŸ’• 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
Β 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
Β 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
Β 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
Β 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
Β 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
Β 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
Β 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
Β 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSΓ©rgio Sacani
Β 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
Β 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
Β 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
Β 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
Β 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
Β 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
Β 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
Β 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
Β 

Recently uploaded (20)

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
Β 
Hire πŸ’• 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire πŸ’• 9907093804 Hooghly Call Girls Service Call Girls AgencyHire πŸ’• 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire πŸ’• 9907093804 Hooghly Call Girls Service Call Girls Agency
Β 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
Β 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
Β 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
Β 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
Β 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
Β 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
Β 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
Β 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
Β 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Β 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Β 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
Β 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Β 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
Β 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
Β 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Β 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
Β 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
Β 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Β 

Community detection using citation relations and textual similarities in a large set of PubMed publications

  • 1. Community detection using citation relations and textual similarities in a large set of PubMed publications Per Ahlgren, Yunwei Chen, Cristian Colliander, and Nees Jan van Eck
  • 2. Publication-level classification system 2 Social sciences and humanities Biomedical and health sciences Life and earth sciences Mathematics and computer science Physical sciences and engineering
  • 3. Introduction Purpose of the study: To analyze whether clustering accuracy can be improved by combining direct citations with indirect citation relations or text relations.
  • 4. Introduction β€’ We compare 6 publication clustering approaches. β€’ The main difference between them is how the relatedness of publications is defined. β€’ We build on, and were inspired by, two studies presented at the ISSI conference in Wuhan 2017: – Chen et. al (2017). A weighted method for citation network community detection. – Waltman et al. (2017). A principled methodology for comparing relatedness measures for clustering publications.
  • 5. Data and methods β€’ Five-year publication period, 2013-2017. β€’ About 4 million publications were retrieved from MEDLINE, the largest subset of PubMed. β€’ PubMed does not contain citation relations between publications. Therefore, we also used Web of Science (WoS) data. – Each publication was matched to a publication included in the in-house version of the WoS database available at the Centre for Science and Technology Studies (CWTS) at Leiden University.
  • 6. Data and methods – About 3.5 million publications remained after matching. – From these publications, we selected each publication p such that p satisfies each of the following four conditions: 1. p has a WoS publication year in the period 2013-2017. 2. p is of WoS document type Article or Review. 3. p has both an abstract and a title with respect to its WoS record. 4. p has a citation relation to at least one publication p’ such that p’ satisfies points 1-3 in this list. – About 3 million publications finally obtained.
  • 7. Data and methods β€’ Investigated relatedness measures – Direct citations (DC). The relatedness of two publications i and j is 1 if there is a direct citation from i to j or such a relation from j to i, otherwise the relatedness is 0. – Bibliographic coupling (BC). The relatedness of i and j is defined as the number of shared cited references in i and j. – Co-citation (CC). The relatedness of i and j is defined as the number of publications that cite both i and j. – BM25. Terms (noun phrases) in the titles and abstracts of the publications are used to represent the publications. The approach involves the BM25 measure, a well- known query-publication similarity measure in information retrieval research. β€’ The value of the measure for i with j is a sum across all unique terms in the dataset, where the number of occurrences of a term in i and j, the inverse document frequency of the term and the length of j are taken into account.
  • 8. Data and methods – DC-BC-CC. In this approach, direct citations are enhanced by the citation relations corresponding to the approaches BC and CC. We define relatedness of i and j, π‘Ÿπ‘–π‘— DCβˆ’BCβˆ’CC , as π‘Ÿπ‘–π‘— DCβˆ’BCβˆ’CC = π›Όπ‘Ÿπ‘–π‘— DC + π‘Ÿπ‘–π‘— BC + π‘Ÿπ‘–π‘— CC where 𝛼 is a weight of direct citations relative to BC and CC. In our analysis, we use 1 and 5 as values of 𝛼.
  • 9. Data and methods – DC-BM25. In this approach, direct citations are enhanced by the text relations. We define relatedness of i and j, π‘Ÿπ‘–π‘— DCβˆ’BM25 , as π‘Ÿπ‘–π‘— DCβˆ’BM25 = π›Όπ‘Ÿπ‘–π‘— DC + π‘Ÿπ‘–π‘— BM25 where 𝛼 is a weight of direct citations relative to BM25. The average across all BM25 relatedness values greater than 0 was calculated, an average that turned out to be equal to 50. By setting 𝛼 to 50, the DC values are put on the same scale as the BM25 relatedness values, in an average sense. By setting 𝛼 to 25 (100), less (more) emphasis would be put on DC. We use all these three 𝛼 values in our analysis.
  • 10. Data and methods β€’ Clustering of publications – In this study, we use the Leiden algorithm (Traag et al., 2018a, 2018b) to generate a series of clustering solutions for each of the relatedness measures. The Leiden algorithm is used to maximize the Constant Potts Model as quality function (Traag et al., 2011; Waltman & Van Eck, 2012). – Using different values of the resolution parameter (0.000001, 0.000002, 0.000005, 0.00001, 0.00002, 0.00005, 0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, 0.01), we obtain 13 clustering solutions for each relatedness measure.
  • 11. Data and methods β€’ Evaluation of approach performance – We use the evaluation framework of Waltman et al. (2017, 2019). – A relatedness measure based on MeSH terms is used as an independent evaluation measure to compare the accuracy of clustering solutions produced by all approaches. – MeSH is a detailed item-level subject classification scheme. β€’ MeSH descriptors (more than 28 thousand) and subheadings are used to index publications in PubMed. β€’ Approximately 80 subheadings (or qualifiers) can be used by the indexer to qualify a descriptor.
  • 12. Data and methods – The accuracy of the kth (1 ≀ k ≀ 13) clustering solution for 𝑋 ∈ {DC, BC, CC, BM25, DC-BC-CC, DC-BM25, MeSH}, where the accuracy is based on MeSH cosine similarity, symbolically 𝐴 𝑋 π‘˜|MeSH, is defined as follows (Waltman et al., 2017, 2019): 𝐴 𝑋 π‘˜|MeSH = 1 𝑁 𝑖,𝑗 𝐼(𝑐𝑖 𝑋 π‘˜ = 𝑐𝑗 𝑋 π‘˜ )π‘Ÿπ‘–π‘— MeSH where N is the number of publications in the dataset, 𝑐𝑖 𝑋 π‘˜ a positive integer denoting the cluster to which publication i belongs with respect to the kth clustering solution for X, 𝐼(𝑐𝑖 𝑋 π‘˜ = 𝑐𝑗 𝑋 π‘˜ ) is 1 if its condition is true, otherwise 0, and π‘Ÿπ‘–π‘— MeSH (norm) the normalized MeSH cosine similarity of i with j.
  • 13. Results β€’ We visualize the evaluation results by using granularity- accuracy (GA) plots (Waltman et al., 2017, 2019). β€’ We present three figures containing GA plots. – DC and the other citation-based approaches – DC and the text-based approaches – DC and best performing approaches
  • 17. Conclusions and future research β€’ Enhancing direct citations with indirect citation relations (BC- CC) or text relations (BM25) gives rise to substantial performance gains relative to direct citations β€’ Combination of direct citations and text (BM25) performs best β€’ These results assume that MeSH terms serve as an appropriate evaluation measure
  • 18. Conclusions and future research β€’ An extended version of our paper has been submitted to the journal Quantitative Science Studies. – One more approach is added: extended direct citations (EDC). – EDC shows the best performance.
  • 19. Conclusions and future research β€’ It does not follow that two cluster solutions with similar accuracy also have similar groupings of publications into clusters. In view of this, in future studies we aim to further compare the clustering solutions to deepen the insight into how the clustering solutions based on different relatedness measures differ from each other.
  • 20. Thank you for your attention!

Editor's Notes

  1. Compared to the study by Chen et al. (2017), a considerably larger publication set is used in our study, as well as a more sophisticated evaluation methodology. Moreover, in contrast to the earlier work, we use a different approach regarding the combination of direct citations and text relations. Compared to Waltman et al. (2017, 2019), these authors did not evaluate hybrid relatedness approaches (approaches combining citation and text relations). Further, citation-only approaches were only compared to other such approaches in their analysis, and the same was the case for text-only approaches. In our study, however, comparisons across such approach groups are made, due to the use of MeSH as an independent evaluation criterion.
  2. Since direct citations are used in the study, we needed a sufficiently long publication period.
  3. BC: Only cited references pointing to publications covered by the CWTS in-house version of WoS are taken into account.
  4. With this weight, one has the possibility to boost direct citations, which might be considered as stronger signals of the relatedness of two publications compared to a bibliographic coupling or a co-citation relation (Waltman & van Eck, 2012).
  5. The accuracy measure quantifies how similar the publications belonging to the same clusters are with respect to MeSH as an evaluation criterion.
  6. In a GA plot, the horizontal axis represents granularity (as defined earlier), whereas the vertical axis represents accuracy. For a given approach, like DC, a point in the plot represents the accuracy and granularity of a clustering solution, obtained using a certain resolution value of the resolution parameter gamma. Further, a line is connecting the points of the approach, where accuracy values for granularity values between points are estimated by interpolation. Based on the interpolations, the performance of the approaches can be compared at a given granularity level. CC exhibits the worst performance among the citation-based approaches. DC is outperformed by BC and the two DC-BC-CC variants, whereas BC performs slightly worse than the DC-BC-CC variants, which perform equally well.
  7. BM25 performs better than DC but is outperformed by all three DC-BM25 variants. Of these, the ones with alpha equal to 50 and 100 perform about equally well, and better than the variant that put less emphasis on DC (alpha = 25).
  8. Enhancing DC by BM25 yields the best performance in our analysis, while DC-BC-CC, where DC is enhanced by the combination of BC and CC, has the second best performance, followed by BC.