This document discusses developing a risk of bias corpus from randomized controlled trials. Annotations were conducted on 10 RCT full texts using the Revised Cochrane Risk of Bias 2.0 tool as guidelines. Inter-annotator agreement was around 75% for identifying text spans and response judgments. Errors included annotating different text spans, sections, and disagreement on polarity and degree of risk of bias. Future work includes refining the guidelines through an iterative process to improve annotation quality and expanding the corpus size.
1. First Steps Towards a Risk of Bias Corpus
of Randomized Controlled Trials
Presenter – Anjani Dhrangadhariya
MIE2023 - Göteborg, Sweden, 23.05.23
Authors: Anjani Dhrangadhariya, Roger Hilfiker, Martin Sattelmayer, Katia
Giacomino, Rahel Caliesch, Simone Elsig, Nona Naderi, Henning Müller
2. Randomized Controlled Trial
• In theory, an RCT accurately measures intervention effects on patient
outcomes, but in practice, biases enter
• Design/Planning
• Execution
• Analysis
• Outcomes reporting
• Systematic Reviews
• Utility
• Medical professionals
• Health policies
• Surgeons
3. • The risk of bias specifically pertains to systematic errors in the design,
conduct, or reporting of a study that can potentially lead to a
deviation from the true effect being measured.
• RoB assessment guidelines
Risk of Bias (RoB)
Example RoB assessment guidelines Year
Physiotherapy Evidence Database (PEDro) 1999
Risk of Bias Assessment Tool for Nonrandomized Studies (RoBANS) 2004
Cochrane Risk of Bias assessment guidelines 2008
Risk of Bias in Non-randomized Studies of Interventions (ROBINS-I) 2016
Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) 2017
Newcastle-Ottawa Scale (NOS) 2018
Revised Cochrane Risk of Bias for RCTs 2.0 tool (RoB 2) 2019
4. RoB information extraction
• Thorough assessment
• Manual assessment
• Time-consuming
• Cognitively demanding
• Two experts for manual assessment
• Third, for conflict resolution
• Automation imperative
5. Related Work
• RoB labelled corpus
• Wang et al. 2022
• Preclinical animal
studies
• Human RCTs
• RobotReviewer
• PDF highlights
• Freely-available
• Closed assess data
• Cochrane RoB v1
• RoB 2.0?
• RoB automation
• Marshall et al. 2015
• Millard et al. 2016
• Cochrane Database
(CDSR)
• Closed access
7. Revised Cochrane RoB 2.0 tool
• Can you use the guidelines to
annotate text corpus?
• Extensive guidelines
• Step-by-step instructions
• Divides RoB into 5 domains
• Each domain is assessed using several
signalling questions
Randomization
process
Deviations from
intended
interventions
Missing
outcomes data
Outcomes
measurement
Selection of
reported result
Sterne, J.A., Savović, J., Page, M.J., Elbers, R.G., Blencowe, N.S., Boutron, I., Cates, C.J., Cheng, H.Y., Corbett, M.S., Eldridge, S.M. and Emberson, J.R., 2019. RoB 2: a
revised tool for assessing risk of bias in randomised trials. bmj, 366.
8. Revised Cochrane RoB 2.0 tool
• Reviewers manually go through the RCT to identify text describing the
answer to a signalling question.
• Based on the answer to the signalling question, select one of the five
response judgements:
Yes Probably Yes Probably No No No Information
9. Revised Cochrane RoB 2.0 tool
• 2.1 - Were the participants aware of their assigned intervention
during the trial?
2.1 No Good
Risk domains Signalling questions
5 22
10. Annotation schema
• Follow the revised Cochrane RoB 2.0
• 110 span Labels
• 1.1 Yes Good
• 1.1 Probably Yes Good
• 1.1 Probably No Bad
• 1.1 No bad
• 1.1 No Information
• 1.2 Yes Good
• 1.2 Probably Yes Good
• 1.2 Probably No Bad
• …
1.1 Yes Good
Risk domain
Signalling question
SQ response
Direction
Good = low risk
Bad = High risk
11. Pilot Annotation
• Ten RCT full-text PDFs
• 2000-2019
• Four annotators
• 2 scientists
• 1 doctoral student
• 1 scientific collaborator
• Two NLP experts
• 1 professor
• 1 doctoral student
• tagtog PDF annotation tool
https://www.tagtog.com/
12. Evaluation
• F1-measure as Inter-annotator agreement
• Disregards out-of-the-span tokens (unannotated tokens)
1. IAASQ
Do the annotator pairs annotate
the same text span to answer a
signalling question (SQ)?
2. IAAresponse
If the annotator pairs annotate
the same text to answer a
signalling question, do they also
select same response
judgment?
13. Results - IAASQ
• Zero or no Annotation
• Domain 2 - 52%
• Domain 3 - 54%
• Domain 4 - 50%
• Domain 5 - 61% (protocol)
• Less subjective questions
• Better IAA
The table details the interpretation of pairwise F1-measure.
14. Results - IAAresponse
• IAA - SQ response judgment
• Averaged over all annotator pairs
• Zero agreement - 52.63%
• No annotation – 22%
~75%
The table details the interpretation of pairwise F1-measure.
15. Error Inspection – 1. Text span disagreement
• Not limiting the annotators to
annotating
• phrases vs full sentences
4.1 Was the method of measuring the outcome
inappropriate?
…The primary outcome measure was a 0–10
NRS pain score, which reflected the average
pain experienced by the patient for ten days
prior to follow-up…
…a 0–10 NRS pain score…
Phrase!
Sentence
16. Error Inspection – 2. Different sections
• Annotators use different regions
(Methods section, Results section,
Table, …) of full text to come to
identical labels.
• Same judgment, different parts of
text evidence
2.6 Was an appropriate analysis used to estimate
the effect of assignment to intervention?
…This study was guided by the HAPA, which
has been widely used to address the gap
between intention to change and a person’s
actual change in behaviour [25-27]…
…intention-to-treat analysis was done with
missing data substituted by the last-
observation-carried-forward procedure…
2.1 Yes Good
17. Error Inspection – 3. Polarity disagreement
… 71 allocated routine services, 67 allocated
intervention service, 69 assessed at 8 weeks,
64 assessed at 8 week...
3.1 Were data for the outcome of interest
available for all, or nearly all, participants
randomized?
• Selecting response judgment
options with different polarities
• Yes vs. No
• Three of the four annotators
responded to 3.1 with Yes, but
one chose Probably no.
• All or nearly all (cut-off?)
18. Error Inspection – 4. Degree disagreement
• Lenient - definitive
• Yes
• No
• Stringent
• Probably yes
• Probably no
1.1 Was a random sequence generation
method used to assign participants to
intervention groups?
…Patients were randomly allocated to either
intervention by a computer-generated
schedule stratified by sex and attendance at
a day hospital…
19. Conclusions
1. RoB 2.0 assessment guidelines cannot be directly used as RoB
corpus annotation guidelines.
2. RoB assessment and RoB text annotation tasks are both highly
subjective, but the annotation guidelines can be refined with an
iterative process to improve both.
21. Dr. Roger Hilfiker
Dr. Martin Sattelmayer
Rahel Caliesch
Katia Giacomino
Dr. Nona Naderi
Annotation team
22. References
1. Wang, Q., Liao, J., Lapata, M., & Macleod, M. (2022). Risk of bias assessment in preclinical literature using natural language processing. Research Synthesis
Methods, 13(3), 368-380.
2. Macleod, M. R., O’Collins, T., Howells, D. W., & Donnan, G. A. (2004). Pooling of animal experimental data reveals influence of study design and publication
bias. Stroke, 35(5), 1203-1208.
3. Deleger L, Li Q, Lingren T, Kaiser M, Molnar K, Stoutenborough L, Kouril M, Marsolo K, Solti I. Building gold standard corpora for medical natural language processing tasks. InAMIA
Annual Symposium Proceedings 2012 (Vol. 2012, p. 144). American Medical Informatics Association.
4. Sterne, J.A., Savović, J., Page, M.J., Elbers, R.G., Blencowe, N.S., Boutron, I., Cates, C.J., Cheng, H.Y., Corbett, M.S., Eldridge, S.M. and Emberson, J.R., 2019.
RoB 2: a revised tool for assessing risk of bias in randomised trials. bmj, 366.
Randomized controlled trials or RCTs, aim to accurately measure treatment effects on patient outcomes.
In theory, they aim to minimize bias, but in practice, biases tend to creep into any of the trial stages.
When RCTs with such questionable biases are used to write systematic reviews, they reduce the validity and utility of the review.
Now, biases cannot be assessed from RCT studies, but the risk of bias can be estimated by identifying the systematic flaws in study design, planning, execution or even outcomes reporting.
There are several risk-of-bias assessment guidelines that help thoroughly assess several bias risks in RCT literature.
The latest published guidelines are the revised Cochrane RoB 2.0 guidelines.
These guidelines help you thoroughly assess biases from RCT full-texts, but the process of manual RoB assessment is extremely time-consuming, resource intensive and cognitively demanding.
Manual bias assessment is challenged by the rapidly rising publication of RCTs, and therefore, automatic RoB information extraction is imperative.
There has been some work in automating RoB information extraction by Marshal and Millard studies, but the dataset used to train machine learning models is closed access.
Later they developed a tool called RobotReviewer which is freely available but develops on closed access data which isn’t available to the community, and they automate using the older risk of bias guidelines.
Recently, a RoB labelled corpus was released by Wang et al, but the corpus is based on preclinical animal studies and not human RCTs.
So currently, we do no have any open access corpus annotated with risk of bias judgments and neither do we have guidelines to build one.
These gaps prompted us to conduct this pilot project.
RoB 2 are these really extensive and instructional guidelines that help you step-by-step assess the overall risk of bias from any RCT study.
So before building our own annotation guidelines, we thought maybe we could use the RoB2 tool to annotate a text corpus as well.
And to understand if we can use RoB 2 for this matter, we need to examine how it structures the bias assessment procedure.
It divides the biases into 5 domains, each domain loosely translating to each of the trial stages.
Each domain is assessed using several signalling questions.
The reviewers manually go through each signalling question as it appears in the guidelines, and they try to identify text to answer this question in the RCT they are assessing.
Once an answer text is found, based on that answer, they use this information to judge a minute chunk of risk corresponding to this signalling question.
And based on the judgment they chose one of the five response options, with Yes mostly corresponding to yes – the answer suggests there’s risk of bias or No – there is no risk of bias for this question. However, it can also correspond to “Yes” – everything is alright and theres no risk of bias for this question.
Take, for example, the signalling question 2.1.
It asks whether the participants were aware of their assigned intervention during the trial.
The reviewers identify the answer to this question in the text and let’s say they found that the participants were properly blinded to the intervention and were unaware of the assigned intervention meaning the bias is low and all is good for this signalling question.
The reviewers needed to do it for 22 signalling questions in the RoB 2 tool so the exact procedure shown manually could be translated into the process of annotation.
We need an annotation schema before starting to annotate the corpus
We keep our annotation scheme very similar to how the assessment is structured in the RoB2 guidelines.
Each of our span labels contains information about the domain the text is labelled for, the signalling question and also the response judgment.
As the overall task of RoB assessment and annotation is very complex, we wanted to ensure the way labels are designed makes it easier for them to annotate.
We then proceeded to annotate 10 full-text RCTs by four experts with varied RoB assessment expertise.
This signalling question asks whether the outcomes data were available for all, or nearly all, participants randomized but does not clarify the exact cut-off for how many participant dropouts increase the risk?
Therefore, the annotators make subjective response judgments depending upon what exact percentage of participant dropout is considered valid in their experience.