First Steps Towards a Risk of Bias Corpus

First Steps Towards a Risk of Bias Corpus
of Randomized Controlled Trials
Presenter – Anjani Dhrangadhariya
MIE2023 - Göteborg, Sweden, 23.05.23
Authors: Anjani Dhrangadhariya, Roger Hilfiker, Martin Sattelmayer, Katia
Giacomino, Rahel Caliesch, Simone Elsig, Nona Naderi, Henning Müller

Randomized Controlled Trial
• In theory, an RCT accurately measures intervention effects on patient
outcomes, but in practice, biases enter
• Design/Planning
• Execution
• Analysis
• Outcomes reporting
• Systematic Reviews
• Utility
• Medical professionals
• Health policies
• Surgeons

• The risk of bias specifically pertains to systematic errors in the design,
conduct, or reporting of a study that can potentially lead to a
deviation from the true effect being measured.
• RoB assessment guidelines
Risk of Bias (RoB)
Example RoB assessment guidelines Year
Physiotherapy Evidence Database (PEDro) 1999
Risk of Bias Assessment Tool for Nonrandomized Studies (RoBANS) 2004
Cochrane Risk of Bias assessment guidelines 2008
Risk of Bias in Non-randomized Studies of Interventions (ROBINS-I) 2016
Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) 2017
Newcastle-Ottawa Scale (NOS) 2018
Revised Cochrane Risk of Bias for RCTs 2.0 tool (RoB 2) 2019

RoB information extraction
• Thorough assessment
• Manual assessment
• Time-consuming
• Cognitively demanding
• Two experts for manual assessment
• Third, for conflict resolution
• Automation imperative

Related Work
• RoB labelled corpus
• Wang et al. 2022
• Preclinical animal
studies
• Human RCTs
• RobotReviewer
• PDF highlights
• Freely-available
• Closed assess data
• Cochrane RoB v1
• RoB 2.0?
• RoB automation
• Marshall et al. 2015
• Millard et al. 2016
• Cochrane Database
(CDSR)
• Closed access

Motivation
1
No RoB text annotation
guidelines exist
2
No RoB annotated RCTs
exist

Revised Cochrane RoB 2.0 tool
• Can you use the guidelines to
annotate text corpus?
• Extensive guidelines
• Step-by-step instructions
• Divides RoB into 5 domains
• Each domain is assessed using several
signalling questions
Randomization
process
Deviations from
intended
interventions
Missing
outcomes data
Outcomes
measurement
Selection of
reported result
Sterne, J.A., Savović, J., Page, M.J., Elbers, R.G., Blencowe, N.S., Boutron, I., Cates, C.J., Cheng, H.Y., Corbett, M.S., Eldridge, S.M. and Emberson, J.R., 2019. RoB 2: a
revised tool for assessing risk of bias in randomised trials. bmj, 366.

• Reviewers manually go through the RCT to identify text describing the
answer to a signalling question.
• Based on the answer to the signalling question, select one of the five
response judgements:
Yes Probably Yes Probably No No No Information

• 2.1 - Were the participants aware of their assigned intervention
during the trial?
2.1 No Good
Risk domains Signalling questions
5 22

Annotation schema
• Follow the revised Cochrane RoB 2.0
• 110 span Labels
• 1.1 Yes Good
• 1.1 Probably Yes Good
• 1.1 Probably No Bad
• 1.1 No bad
• 1.1 No Information
• 1.2 Yes Good
• 1.2 Probably Yes Good
• 1.2 Probably No Bad
• …
1.1 Yes Good
Risk domain
Signalling question
SQ response
Direction
Good = low risk
Bad = High risk

Pilot Annotation
• Ten RCT full-text PDFs
• 2000-2019
• Four annotators
• 2 scientists
• 1 doctoral student
• 1 scientific collaborator
• Two NLP experts
• 1 professor
• 1 doctoral student
• tagtog PDF annotation tool
https://www.tagtog.com/

Evaluation
• F1-measure as Inter-annotator agreement
• Disregards out-of-the-span tokens (unannotated tokens)
1. IAASQ
Do the annotator pairs annotate
the same text span to answer a
signalling question (SQ)?
2. IAAresponse
If the annotator pairs annotate
the same text to answer a
signalling question, do they also
select same response
judgment?

Results - IAASQ
• Zero or no Annotation
• Domain 2 - 52%
• Domain 3 - 54%
• Domain 4 - 50%
• Domain 5 - 61% (protocol)
• Less subjective questions
• Better IAA
The table details the interpretation of pairwise F1-measure.

Results - IAAresponse
• IAA - SQ response judgment
• Averaged over all annotator pairs
• Zero agreement - 52.63%
• No annotation – 22%
~75%
The table details the interpretation of pairwise F1-measure.

Error Inspection – 1. Text span disagreement
• Not limiting the annotators to
annotating
• phrases vs full sentences
4.1 Was the method of measuring the outcome
inappropriate?
…The primary outcome measure was a 0–10
NRS pain score, which reflected the average
pain experienced by the patient for ten days
prior to follow-up…
…a 0–10 NRS pain score…
Phrase!
Sentence

Error Inspection – 2. Different sections
• Annotators use different regions
(Methods section, Results section,
Table, …) of full text to come to
identical labels.
• Same judgment, different parts of
text evidence
2.6 Was an appropriate analysis used to estimate
the effect of assignment to intervention?
…This study was guided by the HAPA, which
has been widely used to address the gap
between intention to change and a person’s
actual change in behaviour [25-27]…
…intention-to-treat analysis was done with
missing data substituted by the last-
observation-carried-forward procedure…
2.1 Yes Good

Error Inspection – 3. Polarity disagreement
… 71 allocated routine services, 67 allocated
intervention service, 69 assessed at 8 weeks,
64 assessed at 8 week...
3.1 Were data for the outcome of interest
available for all, or nearly all, participants
randomized?
• Selecting response judgment
options with different polarities
• Yes vs. No
• Three of the four annotators
responded to 3.1 with Yes, but
one chose Probably no.
• All or nearly all (cut-off?)

Error Inspection – 4. Degree disagreement
• Lenient - definitive
• Yes
• No
• Stringent
• Probably yes
• Probably no
1.1 Was a random sequence generation
method used to assign participants to
intervention groups?
…Patients were randomly allocated to either
intervention by a computer-generated
schedule stratified by sex and attendance at
a day hospital…

Conclusions
1. RoB 2.0 assessment guidelines cannot be directly used as RoB
corpus annotation guidelines.
2. RoB assessment and RoB text annotation tasks are both highly
subjective, but the annotation guidelines can be refined with an
iterative process to improve both.

Future Directions
1. Instructional placards as
annotation guidelines
2. Larger annotated corpus
of RCTs

Dr. Roger Hilfiker
Dr. Martin Sattelmayer
Rahel Caliesch
Katia Giacomino
Dr. Nona Naderi
Annotation team

References
1. Wang, Q., Liao, J., Lapata, M., & Macleod, M. (2022). Risk of bias assessment in preclinical literature using natural language processing. Research Synthesis
Methods, 13(3), 368-380.
2. Macleod, M. R., O’Collins, T., Howells, D. W., & Donnan, G. A. (2004). Pooling of animal experimental data reveals influence of study design and publication
bias. Stroke, 35(5), 1203-1208.
3. Deleger L, Li Q, Lingren T, Kaiser M, Molnar K, Stoutenborough L, Kouril M, Marsolo K, Solti I. Building gold standard corpora for medical natural language processing tasks. InAMIA
Annual Symposium Proceedings 2012 (Vol. 2012, p. 144). American Medical Informatics Association.
4. Sterne, J.A., Savović, J., Page, M.J., Elbers, R.G., Blencowe, N.S., Boutron, I., Cates, C.J., Cheng, H.Y., Corbett, M.S., Eldridge, S.M. and Emberson, J.R., 2019.
RoB 2: a revised tool for assessing risk of bias in randomised trials. bmj, 366.

Thank You
Questions?
Dataset: https://zenodo.org/record/7698941#.ZEGhXexBzzU
Email: anjani.k.dhrangadhariya@gmail.com
LinkedIn: https://www.linkedin.com/in/anjani-dhrangadhariya/

First Steps Towards a Risk of Bias Corpus

Recommended

Recommended

More Related Content

Similar to First Steps Towards a Risk of Bias Corpus

Similar to First Steps Towards a Risk of Bias Corpus (20)

More from Institute of Information Systems (HES-SO)

More from Institute of Information Systems (HES-SO) (20)

Recently uploaded

Recently uploaded (20)

First Steps Towards a Risk of Bias Corpus

Editor's Notes