Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Β
Community detection using citation relations and textual similarities in a large set of PubMed publications
1. Community detection using citation relations and
textual similarities in a large set of PubMed
publications
Per Ahlgren, Yunwei Chen, Cristian Colliander, and
Nees Jan van Eck
2. Publication-level classification system
2
Social sciences
and humanities
Biomedical and
health sciences
Life and earth
sciences
Mathematics and
computer science
Physical
sciences and
engineering
3. Introduction
Purpose of the study:
To analyze whether clustering accuracy can be improved by
combining direct citations with indirect citation relations or text
relations.
4. Introduction
β’ We compare 6 publication clustering approaches.
β’ The main difference between them is how the
relatedness of publications is defined.
β’ We build on, and were inspired by, two studies
presented at the ISSI conference in Wuhan 2017:
β Chen et. al (2017). A weighted method for citation network
community detection.
β Waltman et al. (2017). A principled methodology for comparing
relatedness measures for clustering publications.
5. Data and methods
β’ Five-year publication period, 2013-2017.
β’ About 4 million publications were retrieved from MEDLINE, the
largest subset of PubMed.
β’ PubMed does not contain citation relations between
publications. Therefore, we also used Web of Science (WoS)
data.
β Each publication was matched to a publication included in the in-house version of the
WoS database available at the Centre for Science and Technology Studies (CWTS)
at Leiden University.
6. Data and methods
β About 3.5 million publications remained after matching.
β From these publications, we selected each publication p such that p satisfies each of
the following four conditions:
1. p has a WoS publication year in the period 2013-2017.
2. p is of WoS document type Article or Review.
3. p has both an abstract and a title with respect to its WoS record.
4. p has a citation relation to at least one publication pβ such that pβ satisfies points 1-3 in this list.
β About 3 million publications finally obtained.
7. Data and methods
β’ Investigated relatedness measures
β Direct citations (DC). The relatedness of two publications i and j is 1 if there is a
direct citation from i to j or such a relation from j to i, otherwise the relatedness is 0.
β Bibliographic coupling (BC). The relatedness of i and j is defined as the number of
shared cited references in i and j.
β Co-citation (CC). The relatedness of i and j is defined as the number of publications
that cite both i and j.
β BM25. Terms (noun phrases) in the titles and abstracts of the publications are used
to represent the publications. The approach involves the BM25 measure, a well-
known query-publication similarity measure in information retrieval research.
β’ The value of the measure for i with j is a sum across all unique terms in the dataset, where the
number of occurrences of a term in i and j, the inverse document frequency of the term and the
length of j are taken into account.
8. Data and methods
β DC-BC-CC. In this approach, direct citations are enhanced by the citation relations
corresponding to the approaches BC and CC. We define relatedness of i and j,
πππ
DCβBCβCC
, as
πππ
DCβBCβCC
= πΌπππ
DC
+ πππ
BC
+ πππ
CC
where πΌ is a weight of direct citations relative to BC and CC. In our analysis, we use 1
and 5 as values of πΌ.
9. Data and methods
β DC-BM25. In this approach, direct citations are enhanced by the text relations. We
define relatedness of i and j, πππ
DCβBM25
, as
πππ
DCβBM25
= πΌπππ
DC
+ πππ
BM25
where πΌ is a weight of direct citations relative to BM25. The average across all BM25
relatedness values greater than 0 was calculated, an average that turned out to be
equal to 50. By setting πΌ to 50, the DC values are put on the same scale as the BM25
relatedness values, in an average sense. By setting πΌ to 25 (100), less (more)
emphasis would be put on DC. We use all these three πΌ values in our analysis.
10. Data and methods
β’ Clustering of publications
β In this study, we use the Leiden algorithm (Traag et al., 2018a, 2018b) to generate a
series of clustering solutions for each of the relatedness measures. The Leiden
algorithm is used to maximize the Constant Potts Model as quality function (Traag et
al., 2011; Waltman & Van Eck, 2012).
β Using different values of the resolution parameter (0.000001, 0.000002, 0.000005,
0.00001, 0.00002, 0.00005, 0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, 0.01), we
obtain 13 clustering solutions for each relatedness measure.
11. Data and methods
β’ Evaluation of approach performance
β We use the evaluation framework of Waltman et al. (2017,
2019).
β A relatedness measure based on MeSH terms is used as
an independent evaluation measure to compare the
accuracy of clustering solutions produced by all
approaches.
β MeSH is a detailed item-level subject classification scheme.
β’ MeSH descriptors (more than 28 thousand) and subheadings are
used to index publications in PubMed.
β’ Approximately 80 subheadings (or qualifiers) can be used by the
indexer to qualify a descriptor.
12. Data and methods
β The accuracy of the kth (1 β€ k β€ 13) clustering solution for π β {DC, BC, CC, BM25,
DC-BC-CC, DC-BM25, MeSH}, where the accuracy is based on MeSH cosine
similarity, symbolically π΄ π π|MeSH, is defined as follows (Waltman et al., 2017, 2019):
π΄ π π|MeSH =
1
π π,π
πΌ(ππ
π π
= ππ
π π
)πππ
MeSH
where N is the number of publications in the dataset, ππ
π π
a positive integer denoting
the cluster to which publication i belongs with respect to the kth clustering solution
for X, πΌ(ππ
π π
= ππ
π π
) is 1 if its condition is true, otherwise 0, and πππ
MeSH
(norm) the
normalized MeSH cosine similarity of i with j.
13. Results
β’ We visualize the evaluation results by using granularity-
accuracy (GA) plots (Waltman et al., 2017, 2019).
β’ We present three figures containing GA plots.
β DC and the other citation-based approaches
β DC and the text-based approaches
β DC and best performing approaches
17. Conclusions and future research
β’ Enhancing direct citations with indirect citation relations (BC-
CC) or text relations (BM25) gives rise to substantial
performance gains relative to direct citations
β’ Combination of direct citations and text (BM25) performs best
β’ These results assume that MeSH terms serve as an appropriate
evaluation measure
18. Conclusions and future research
β’ An extended version of our paper has been submitted to the
journal Quantitative Science Studies.
β One more approach is added: extended direct citations (EDC).
β EDC shows the best performance.
19. Conclusions and future research
β’ It does not follow that two cluster solutions with similar accuracy
also have similar groupings of publications into clusters. In view
of this, in future studies we aim to further compare the
clustering solutions to deepen the insight into how the clustering
solutions based on different relatedness measures differ from
each other.
Compared to the study by Chen et al. (2017), a considerably larger publication set is used in our study, as well as a more sophisticated evaluation methodology. Moreover, in contrast to the earlier work, we use a different approach regarding the combination of direct citations and text relations.
Compared to Waltman et al. (2017, 2019), these authors did not evaluate hybrid relatedness approaches (approaches combining citation and text relations). Further, citation-only approaches were only compared to other such approaches in their analysis, and the same was the case for text-only approaches. In our study, however, comparisons across such approach groups are made, due to the use of MeSH as an independent evaluation criterion.
Since direct citations are used in the study, we needed a sufficiently long publication period.
BC: Only cited references pointing to publications covered by the CWTS in-house version of WoS are taken into account.
With this weight, one has the possibility to boost direct citations, which might be considered as stronger signals of the relatedness of two publications compared to a bibliographic coupling or a co-citation relation (Waltman & van Eck, 2012).
The accuracy measure quantifies how similar the publications belonging to the same clusters are with respect to MeSH as an evaluation criterion.
In a GA plot, the horizontal axis represents granularity (as defined earlier), whereas the vertical axis represents accuracy. For a given approach, like DC, a point in the plot represents the accuracy and granularity of a clustering solution, obtained using a certain resolution value of the resolution parameter gamma. Further, a line is connecting the points of the approach, where accuracy values for granularity values between points are estimated by interpolation. Based on the interpolations, the performance of the approaches can be compared at a given granularity level.
CC exhibits the worst performance among the citation-based approaches. DC is outperformed by BC and the two DC-BC-CC variants, whereas BC performs slightly worse than the DC-BC-CC variants, which perform equally well.
BM25 performs better than DC but is outperformed by all three DC-BM25 variants. Of these, the ones with alpha equal to 50 and 100 perform about equally well, and better than the variant that put less emphasis on DC (alpha = 25).
Enhancing DC by BM25 yields the best performance in our analysis, while DC-BC-CC, where DC is enhanced by the combination of BC and CC, has the second best performance, followed by BC.