Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Developing and maintaining an assessment system

529 views

Published on

  • I pasted a website that might be helpful to you: ⇒ www.HelpWriting.net ⇐ Good luck!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Developing and maintaining an assessment system

  1. 1. Postgraduate Medical Education and Training Board Developing and maintaining an assessment system - a PMETB guide to good practice Guidance January 2007
  2. 2. Developing and maintaining an assessment system - a PMETB guide to good practice Foreword Developing and maintaining an assessment system - a PMETB guide to good practice completes the guidance for medical Royal Colleges and Faculties who are developing assessment systems based on curricula approved by PMETB. As the title implies, this is a good practice guide rather than a cook book providing recipes for assessment systems. This guide covers the assessment Principles 3, 4 and 6. As indicated by PMETB, the Colleges and Faculties developing assessment systems have until August 2010 to comply with all nine Principles of assessment produced by PMETB (1). These Principles highlight the issues that need to be addressed for transparency and fairness to trainees and to encourage curricula designers in not forgetting the duty of care to the trainers and trainees alike. The work was undertaken by the Assessment Working Group, which consisted of people from different disciplines of medicine and who are considered experts in designing assessments. This good practice guide explains some of the challenges faced by anyone who is devising an assessment system. As far as possible we have developed this guidance in the context of practicality, feasibility in respect of quality management, utility sources of evidence required for competency progression, standard setting and integrating various assessments bearing in mind the fine balance between the training and service requirements.We have avoided being prescriptive and have not produced a toolbox of PMETB approved assessment tools. Instead, we recommend that the Colleges and Faculties should consult the guidance produced by the Academy of Medical Royal Colleges (AoMRC) as well as Modernising Medical Careers (MMC) to choose assessment tools which comply with PMETB’s Principles for an assessment system for postgraduate medical training (1). PMETB has been fortunate in securing the services of highly skilled and enthusiastic experts who worked in their own time on the Assessment Working Group to produce this document. I would like to extend my grateful thanks to all these dedicated people on behalf of PMETB. I would principally like to mention the dedication and effort of the work stream leaders, Dr Gareth Holsgrove, Dr Helena Davies and Professor David Rowley, who have worked through all hours in developing this guide. Dr Has Joshi FRCGP Chair of the Assessment Committee and the Assessment Working Group PMETB January 2007
  3. 3. Developing and maintaining an assessment system - a PMETB guide to good practice The AssessmentWorking Group Dr Has Joshi FRCGP: Chair of the Assessment Committee and the Assessment Working Group Jonathan D Beard: Consultant Vascular Surgeon and Education Tutor, Royal College of Surgeons of England Dr Nav Chana: General Practitioner, RCGP Assessor Helena Davies: Senior Lecturer in Late Effects/Medical Education, University of Sheffield John C Ferguson: Consultant Surgeon, Southern General Hospital, Glasgow Professor Tony Freemont: Chair of the Examiners in Histopathology, Royal College of Pathologists Miss L A Hawksworth: Director of Certification, PMETB Dr Gareth Holsgrove: Medical Education Adviser, Royal College of Psychiatrists Dr Namita Kumar: Consultant Rheumatologist and Physician Dr Tom Lissauer: FRCPCH: Officer for Exams, Royal College of Paediatrics and Child Health, and Consultant Neonatologist, St Mary’s Hospital, London Dr Andrew Long: Consultant Paediatrician and Director of Medical Education, Princess Royal University Hospital Dr Amit Malik: Specialist Registrar in Psychiatry, Nottinghamshire Healthcare NHS Trust Dr Keith Myerson: Member of Council, Royal College of Anaesthetists Mr Chris Oliver: Co-Convener Examinations, Royal College of Surgeons of Edinburgh and Consultant Trauma Orthopaedic Surgeon, Edinburgh Orthopaedic Trauma Unit Jan Quirke: Board Secretary, PMETB Professor David Rowley: Director of Education, Royal College of Surgeons Edinburgh Dr David Sales: RCGP Assessment Fellow Professor Dame Lesley Southgate: PMETB Board member to November 2006 and Chair Assessment Committee to January 2006 Dr Allister Vale: Medical Director, MRCP (UK) Examination Winnie Wade: Director of Education, Royal College of Physicians Val Wass: Professor of Community Based Medical Education, Division of Primary Care, University of Manchester Laurence Wood: RCOG representative to the Academy of Royal Colleges, Associate Postgraduate Dean West Midlands Editors PMETB would like to thank the following individuals for their editorial assistance in the production of this guide to good practice: Helena Davies - Senior Lecturer in Late Effects/Medical Education, University of Sheffield Dr Has Joshi, FRCGP - Chair of the Assessment Committee and the Assessment Working Group Dr Gareth Hoslgrove - Medical Education Adviser, Royal College of Psychiatrists Professor David Rowley - Director of Education, Royal College of Surgeons Edinburgh
  4. 4. Developing and maintaining an assessment system - a PMETB guide to good practice Introduction .......................................................................................................................................6 Chapter 1: An assessment system based on principles..............................................................................7 Introduction........................................................................................................................................................... 7 Utility ................................................................................................................................................................... 7 What is meant by utility?............................................................................................................................ 7 Why is the utility index important?............................................................................................................. 8 Defining the components...................................................................................................................................... 8 Reliability................................................................................................................................................... 8 Validity....................................................................................................................................................... 9 Educational impact................................................................................................................................... 10 Cost, acceptability and feasibility............................................................................................................ 12 Chapter 2:Transparent standard setting in professional assessments........................................................ 13 Introduction......................................................................................................................................................... 13 Types of standard................................................................................................................................................ 13 Prelude to standard setting................................................................................................................................. 14 Standard setting methods......................................................................................................................... 15 Test based methods.................................................................................................................................. 15 Trainee or performance based methods.................................................................................................. 16 Combined and hybrid methods - a compromise...................................................................................... 16 Hybrid standard setting in performance assessment.......................................................................................... 16 Standard setting for skills assessments............................................................................................................... 17 A proposed method for standard setting in skills assessments using a hybrid method....................................... 17 Standard setting for workplace based assessment.............................................................................................. 19 Anchored rating scales........................................................................................................................................ 19 Decisions about borderline trainees................................................................................................................... 19 Making decisions about borderline trainees............................................................................................ 19 Summary............................................................................................................................................................. 20 Conclusion.......................................................................................................................................................... 20 Chapter 3: Meeting PMETB approved Principles of assessment................................................................ 21 Introduction......................................................................................................................................................... 21 Quality assurance and workplace based assessment.......................................................................................... 21 The learning agreement...................................................................................................................................... 22 Appraisal............................................................................................................................................................ 22 The annual assessment........................................................................................................................................ 23 Additional information........................................................................................................................................ 23 Quality assuring summative exams.......................................................................................................... 23 Summary............................................................................................................................................................. 25 Chapter 4: Selection, training and evaluation of assessors....................................................................... 26 Introduction......................................................................................................................................................... 26 Selection of assessors.......................................................................................................................................... 26 Assessor training................................................................................................................................................. 26 Feedback for assessors....................................................................................................................................... 27 Chapter 5: Integrating assessment into the curriculum - a practical guide................................................. 28 Introduction......................................................................................................................................................... 28 Blueprinting........................................................................................................................................................ 28 Sampling............................................................................................................................................................. 28 The assessment burden (feasibility).................................................................................................................... 29 Feedback............................................................................................................................................................ 29 Table of Contents
  5. 5. Developing and maintaining an assessment system - a PMETB guide to good practice Chapter 6: Constructing the assessment system..................................................................................... 30 Introduction......................................................................................................................................................... 30 Purposes............................................................................................................................................................. 30 One classification of assessments........................................................................................................................ 30 References ..................................................................................................................................... 32 Further reading................................................................................................................................................... 34 Appendices ..................................................................................................................................... 36 Appendix 1: Reliability and measurement error.................................................................................................. 36 Reliability................................................................................................................................................. 36 Measurement error................................................................................................................................... 37 Appendix 2: Procedures for using some common methods of standard setting................................................... 39 Test based methods.................................................................................................................................. 39 Trainee based methods............................................................................................................................ 40 Combined and compromise methods...................................................................................................... 41 Appendix 3: AoMRC, PMETB and MMC categorisation of assessments............................................................... 42 Purpose.................................................................................................................................................... 42 The categories......................................................................................................................................... 42 Appendix 4: Assessment good practice plotted against GMP.............................................................................. 44 Glossary of terms................................................................................................................................................ 46
  6. 6. Developing and maintaining an assessment system - a PMETB guide to good practice Introduction This guide explains some of the challenges which face anyone devising an assessment system in response to the Principles for an assessment system for postgraduate medical training laid out by PMETB (1). The original principles have not needed any fundamental changes since they were written and serve to highlight the issues which should be addressed to ensure transparency and fairness for trainees, and encourage curricular designers to ensure a proper duty of care to trainers and trainees alike. Most importantly, the principles assure the general public that trainees who undergo professional accreditation will be assessed properly and only those who have achieved the required level of competence are allowed to progress. In producing this document we wish to acknowledge that most colleges and many committed individuals, usually as volunteers, have considerable expertise in the area of assessment. Where possible we have drawn on that expertise and we hope this is reflected in the text.What we wish to do is to provide a long term framework for continuing to improve assessments for all parties, from trainees to patients. Educational trends change but principles do not and by providing this document PMETB wishes to set a benchmark against which a programme of continuing quality improvement can progress. A full glossary of terms related to assessment in the context of medical education can be found on page 46. In particular, to ensure consistency with other PMETB guidance, we largely refer to ‘assessment systems’ rather than ‘assessment programmes’ and ‘quality management’ rather than ‘quality control’. The term ‘assessor’ should be assumed to encompass examiners for formal exams as well as those undertaking assessments in other contexts, including the workplace.‘Assessment instrument’ is used throughout to refer to individual assessment methods. The guide is a reference document rather than a narrative and it is anticipated that it will help organisations collate all the information likely to be asked of them by PMETB in any quality assurance activity.
  7. 7. Developing and maintaining an assessment system - a PMETB guide to good practice Chapter 1: An assessment system based on principles Introduction This chapter aims to: • define what is meant by utility in relation to assessment; • explain the importance of utility and its evolution from the original concept; • provide clear workable definitions of the components of utility that enable the reader to understand their relevance and importance to their own assessment context; • summarise the existing evidence base in relation to utility both to provide a guide for the interested reader and to reassure those responsible for assessment that there is a body of research that can be referred to; • highlight gaps where further research is needed. Utility The ‘utility index’ described by Cees van der Vleuten (2) in 1996 serves as an excellent framework for assessment design and evaluation (3). What is meant by utility? Figure 1: Utility The original utility index described by Cees van der Vleuten consisted of five components: • Reliability • Validity • Educational impact • Cost efficiency • Acceptability Given the massive change in postgraduate training in the UK and the significantly increased assessment burden which has occurred, it is important that feasibility is explicitly acknowledged as an additional sixth component (although it is implicit in cost effectiveness and acceptability). It acknowledges that optimising any assessment tool or programme is about balancing the six components of the utility index. Choice of assessment instruments and aspirations for high validity and reliability are limited by the constraints of feasibility, e.g. resources to deliver the tests and acceptability to the trainees.The relative importance of the components of the utility index for a given assessment will depend on both the purpose and nature of the assessment system. For example, a high stakes examination on which progression into higher specialist training is dependent will need high reliability and validity, and may focus on this at the expense of educational impact. In contrast, an assessment which focuses largely on providing a trainee with feedback to inform their own personal development planning would focus on educational impact, with less of an emphasis on reliability (Figure 2). Figure 2 illustrates the relative importance of reliability vs educational impact, depending on the purpose of the assessment. Educational x validity x reliability x cost x acceptability x feasibility* *Not in van der Vleuten’s original utility index but explicitly here because of its importance
  8. 8. Developing and maintaining an assessment system - a PMETB guide to good practice Why is the utility index important? There is an increasing recognition that no single assessment instrument can adequately assess clinical performance and that assessment planning should focus on assessment systems with triangulation of data in order to build up a complete picture of a doctor’s performance (4, 5).The purpose of both the overall assessment system and its individual assessment instruments must be clearly defined for PMETB.The rationale for developing/implementing the individual components within the assessment system must be transparent, justifiable and based on supportive evidence from the literature, where possible, whilst at the same time recognising the constraints of the ‘real world’.The utility index is an important component of the framework presented to PMETB and allows a series of questions that should be asked of each assessment instrument - and of the assessment system as a whole. An understanding of these individual components of the utility index will help when planning or reviewing assessment programmes in order to ensure that, where possible, all of its components have been addressed adequately. Defining the components Reliability What is the quality of the results? Are they consistent and reproducible? Reliability is of central importance in assessment because trainees, assessors, regulatory bodies and the public alike want to be reassured that assessments used to ensure that doctors are competent would reach the same conclusions if it were possible to administer the same test again on the same doctor in the same circumstances. Reliability is a quantifiable measure which can be expressed as a coefficient and is most commonly approached using classical test theory or generalisability analysis (6-8,3, 9). A perfectly reproducible test would have a coefficient of 1.0; that is 100% of the trainees would achieve the same rank order on retesting. In reality, tests are affected by many sources of potential error such as examiner judgments, cases used, trainee nervousness and test conditions. Traditionally, a reliability coefficient of greater than 0.8 has been considered as an appropriate cut off for high stakes assessments. It is recognised, however, that reliability coefficients at this level will not be achievable for some assessment tools but they may nevertheless be a valuable part of an assessment programme, both to provide additional evidence for triangulation and/or because of their effect on learning. Estimation of reliability as part of the overall quality management (QM) of an assessment programme will require specialist expertise, over and above that to be found in those simply involved as assessors.This means evidence of the application of appropriate psychometric and statistical support for the evaluation of the programme should be provided. Exploration of sources of bias is essential as part of the overall evaluation of the programme and collection of, for example, demographic data to allow exploration of effects, such as age, gender, race, etc, must be planned at the outset. Intrinsic to the validity of any assessment is analysis of the scores to quantify their reproducibility. An assessment cannot be viewed as valid unless it is reliable. PMETB will require evidence of the reliability of each component appropriate to the weight given to that component in the utility equation. Figure 2: Utility function 100% 0% In-training formative assessment U = R x V x E 100% 0% High stakes assessment U = R x V x E U = Utility R = Reality V = Validity E = Educational Impact C van der Vleuten 100% 0% In-training formative assessment U = R x V x E 100% 0% High stakes assessment U = R x V x E U = Utility R = Reality V = Validity E = Educational Impact C van der Vleuten 100% 0% In-training formative assessment U = R x V x E 100% 0% High stakes assessment U = R x V x E U = Utility R = Reality V = Validity E = Educational Impact C van der Vleuten
  9. 9. Developing and maintaining an assessment system - a PMETB guide to good practice Remember that sufficient testing time is essential in order to achieve adequate reliability. It is becoming increasingly clear that whatever the format, total testing time, ensuring breadth of content sampling and sufficient individual assessments by individual assessors, is critical to the reliability of any clinical competence test (10) (Box 1). Box 1: Reliability as a function of testing time Testing time in hours MCQ1 Case based short essay2 PMP1 Oral exam3 Long case4 OSCE5 mini - CEX6 Practice video assess- ment7 Incognito SPs8 1 0.62 0.68 0.36 0.50 0.60 0.47 0.73 0.62 0.61 2 0.76 0.73 0.53 0.69 0.75 0.64 0.84 0.76 0.76 4 0.93 0.84 0.69 0.82 0.86 0.78 0.92 0.93 0.92 8 0.93 0.82 0.82 0.90 0.90 0.88 0.96 0.93 0.93 A significant current challenge is to introduce sample frameworks into workplace based assessments of performance which sample sufficiently to address issues of content specificity. Because content specificity (differences in performance across different clinical problem areas) and assessor variability consistently represent the two greatest threats to reliability, sampling of both clinical content and assessors is essential and this should be reflected in assessment system planning. PMETB will require an explanation of the weight placed on each assessment tool within the modified utility index. For example, a workplace based assessment aimed at testing performance has a higher weighting for validity (at the apex of Miller’s Pyramid - Figure 3) but may not achieve a high stakes reliability coefficient of 0.8 as it is difficult to standardise content. On the other hand the inclusion of a defensible clinical competency assessment in an artificial summative examination environment to decide on progression may be justified on the grounds of high reliability, but only achieved at the expense of high face validity. Validity Reliability is a measure of how reproducible is a test. If you administered the same assessment again on the same person would you get the same outcome? Validity is a measure of how completely an assessment tests what it is designed to test. There is usually a trade off between validity and reliability as the assessment with perfect validity and perfect reliability does not exist. However, it is important to recognise that if a test is not reliable it cannot be valid. Validity is a conceptual term which should be approached as a hypothesis and cannot be expressed as a simple coefficient (11). It is evaluated against the various facets of clinical competency.Traditionally, a number of facets of validity have been defined (12) (Box 2), separately acknowledging that evaluating the validity of an assessment requires multiple sources of evidence. An alternative approach arguing that validity is a unitary concept which requires these multiple sources of evidence to evaluate and interpret the outcomes of an assessment has also been proposed (11). Box 2 summarises the traditional facets of validity and can provide a useful framework for evaluating validity. Predictive and consequential validity are important but poorly explored aspects of assessment, particularly workplace based assessment. Consequential validity is integral to evaluation of educational impact. Predictive validity may not be able to be evaluated for many years but plans to determine predictive validity should be described and this will be facilitated by high quality centralised data management. 1 Norcini et al., 1985 2 Stalenhoef-Halling et al., 1990 3 Swanson, 1987 4 Wass et al., 2001 5 Petrusa, 2002 6 Norcini et al., 1999 7 Ram et al., 1999 8 Gorter, 2002
  10. 10. 10 Developing and maintaining an assessment system - a PMETB guide to good practice Box 2:Traditional facets of validity Type of validity Test facet being measured Questions being asked Face validity Compatibility with the curriculum’s educational philosophy What is the test’s face value? Does it match up with the educational intentions? Content validity (May also be referred to as direct validity) The content of the curriculum Does the test include a representative sample of the subject matter? Construct validity (May also be referred to as indirect validity) Does the evidence in relation to assessment support a sensible underpinning construct or constructs? What is the construct, does the evidence support it? E.g. differentiation between novice and expert on a test of overall clinical assessments - do two assessments designed to test different things have a low correlation? Predictive validity The ability to predict an outcome in the future, e.g. professional success after graduation Does the test predict future performance and level of competency? Consequential validity The educational consequence or impact of the test Does the test produce the desired educational outcome? Educational impact Assessment must have clarity of educational purpose and be designed to maximise learning in areas relevant to the curriculum. Based on the assumption that assessment drives learning that underpins training, it should be used strategically to promote desirable learning strategies in contrast to some of the learning behaviours that have been promoted by traditional approaches to assessment within medicine. Careful planning is essential. Agreement on how to maximise educational impact must be an integral part of planning assessment and the rationale and thinking underpinning this must be evident to those reviewing the assessment programme. Several different factors contribute to overall educational impact and a number of questions will therefore need to be considered. a)What is the educational intent or purpose of the assessment? In the past a clear distinction between summative and formative assessment has been made. However, in line with modern assessment theory, the PMETB principles emphasise the importance of giving students feedback on all assessments, encouraging reflection and deeper learning (PMETB Principle 5(1)).The purpose of assessment should be clearly described. For example, is it for final certification, is it to determine progress from one stage to another, is it to determine whether to exclude an individual from their training programme, etc? Feedback should be provided that is relevant to the purpose as well as the content of the assessment in order that personal development planning in relation to the relevant curriculum can take place effectively. If assessment focuses only on certification and exclusion, the all important potential for a beneficial influence on the learning process will be lost. All those designing and delivering assessment should explore ways of enabling feedback to be provided at all stages and make their intentions transparent to trainees. A quality enhanced assessment system cannot be effective without high quality feedback. In order to plan appropriate feedback it is essential for there to be clarity of purpose for the assessment system. For example, what aptitudes are you aiming to assess, at what level of expertise and how was the content of the assessment defined relative to the curriculum - a process known as blueprinting (13). b)What level of competence are you trying to assess? Is it knowledge, competence or performance? A helpful and widely utilised framework for describing levels of competence is provided by Miller’s Pyramid (14) (Figure 3). The base represents the knowledge components of competence:‘Knows’ (basic facts) followed by ‘Knows How’ (applied knowledge).The progression to ‘Knows How’ highlights that there is more to clinical competency than knowledge alone.
  11. 11. 11Developing and maintaining an assessment system - a PMETB guide to good practice ‘Shows How’ represents a behavioural rather than a cognitive function, i.e. it is ‘hands on’ and not ‘in the head’. Assessment at this level requires an ability to demonstrate a clinical competency in a controlled environment, e.g. an objective structured clinical examination (OSCE).The ultimate goal for a highly valid assessment of clinical ability is to test performance - the ‘Does’ of Miller’s Pyramid, i.e. what the doctor actually does in the workplace. It must be clear at which level you are aiming to test. c) At what level of expertise is the assessment set? Any assessment design must accommodate the progression from novice through competency to expertise. It must be clear against what level the trainee is being assessed. A number of developmental progressions have been described for knowledge, including those in Bloom’s taxonomy (15) (Figure 4). Frameworks are also being developed for the clinical competency model (16). Most of the Royal Colleges are working on assessment frameworks that describe a progression in terms of level of expertise as trainees move through specialty training.When designing an assessment system it is important to identify the level of expertise anticipated at that point in training.The question ‘Is the assessment and standard appropriate for the particular level of training under scrutiny?’ must always be asked. It is not uncommon to find questions in postgraduate examinations assessing basic factual knowledge at undergraduate level rather than applied knowledge reflective of the trainees’ postgraduate experience. d) Is the clinical content clearly defined? Once the purpose of the assessment is agreed, test content must be carefully planned against the curriculum and intended learning outcomes, a process known as blueprinting (12, 17, 18) (see also page 28). The aim of an assessment blueprint is to ensure that sampling within the assessment system ensures adequate coverage of: i) A conceptual framework - a framework against which to map assessment is essential. PMETB recommends Good Medical Practice (GMP) (19) as the broad framework for all UK postgraduate assessments. ii) Content specificity - blueprinting must also ensure that the contextual content of the curriculum is covered. Content needs careful planning to ensure trainees are comprehensively and fairly assessed across their entire training period. Professionals do not perform consistently from task to task or across the range of clinical content (20).Wide sampling of content is essential (13). Schuwirth (3) and van der Vleuten highlight the importance of consideration of the context as well as content of assessment. Context-rich methods test application of knowledge, whereas context-free questions test only the underpinning knowledge base. Sampling broadly to cover the full range of the curriculum is of paramount importance if fair and reliable assessments are to be guaranteed. Blueprinting is essential to the appropriate selection of assessment methods; it is not until the purpose and the content of the assessments has been decided that the assessment methods should be chosen. iii) Selection of assessment methods once blueprinting has been undertaken should take account of the likely educational impact; the nature of assessment methods will influence approaches to learning as well as the stated content coverage. e) Triangulation - how do the different components relate to each other to ensure educational impact is achieved? It is important to develop an assessment system which builds up evidence of performance in the workplace and avoids reliance on examinations alone.Triangulation of observed contextualised performance tasks of ‘Does’ can be assessed alongside high stakes competency based tests of ‘Shows How’ and knowledge tests where appropriate, (Figure 5). Individual assessment instruments should be chosen in the light of the content and purpose of that component of the Figure 4: Bloom’s taxonomy Expertise Evaluate appraise, discriminate Synthesis integrate, design Application demonstrate Analyse order Comprehension interpret, discuss Knowledge define, describe Figure 3: Miller’s Pyramid Does Shows How Knows How Knows Work based assessment Performance/ action OSCE CCS MCQ Competence
  12. 12. 12 Developing and maintaining an assessment system - a PMETB guide to good practice assessment system. PMETB will require evidence of how the different methods used relate to each other to ensure an appropriate educational balance has been achieved. Cost, acceptability and feasibility PMETB recognises that assessment takes place in a real world where pragmatism must be balanced against assessment idealism. It is essential that consideration is given to issues of feasibility, cost and acceptability, recognising that assessments will need to be undertaken in a variety of contexts with a wide range of assessors and trainees. Explicit consideration of feasibility is an essential part of evaluation of any assessment programme. a) Feasibility and cost Assessment is inevitably constrained by feasibility and cost.Trainee numbers, venues for structured examinations, the use of real patients, timing of exit assessments and the availability of assessors in the workplace all place constraints on assessment design.These factors should be part of your explanation to justify the design of your assessment package. Management of the overall assessment system including infrastructure to support it is an important contributor to feasibility. In general, centralisation is likely to increase cost effectiveness. All assessments incur costs and these must be acknowledged and quantified. b) Acceptability Both the trainees’ and assessors’ perspective must be taken into account. At all levels of education, trainees naturally tend to feel overloaded by work and prioritise those aspects of the curriculum which are assessed.To overcome this, the assessment package must be designed to mirror and drive the educational intent. It must be acceptable to the learner. The balance is a fine one. Creating too many burdensome, time consuming assessment ‘hurdles’ can detract from the educational opportunities of the curriculum itself (21, 22). Consideration of the acceptability of assessment programmes to assessors is also important. All assessment programmes are dependent on the goodwill of assessors who are usually balancing participation in assessment against many other conflicting commitments. Formal evaluation of acceptability is an important component of QM and approaches to this should be documented. The high stakes of professional assessments need to be acknowledged not only for potential colleagues of those being assessed and trainees, but also for the general public on whom professionals practice. It is therefore essential that any assessment system is transparent, understandable and demonstratively comprehensible to the general public as well as other stakeholders. Figure 5:Triangulation Exam s W ork based assessm ent Triangulate evidence
  13. 13. 13Developing and maintaining an assessment system - a PMETB guide to good practice Chapter 2:Transparent standard setting in professional assessments Introduction Standard setting is the process used to establish the level of performance required by an examining body for an individual trainee to be judged as competent.This might be simply in the recall or (preferably) the application of factual knowledge; competence in specific skills or technical procedures; performance, day in day out, in the workplace, or a combination of some or all of these.Whatever the aspect and level of performance, the standard is the answer to the question,‘How much is enough?’ (23), and is the point that separates those trainees who pass the assessment from those who do not. In other words, it is the pass mark or, in North America, the cut or cutting score. It should be noted, too, that there will almost inevitably be a group of trainees with marks close to the pass mark that the assessment cannot reliably place on one side or the other. Having considered some methods for standard setting, this section will discuss reliability and measurement error, and how to identify these borderline trainees. However, even before describing the processes and outcomes of standard setting, it must be recognised that although the concept of standard setting might seem straightforward, its methods and the debate surrounding them are not. In fact, there is a cohort of educational academics (such as Gene Glass, 1978 (24)) who are highly critical of the whole concept of standard setting. Certainly, they are not without a case. It is widely known that the standard set for a given test can vary according to the methods used. Experience also shows that different assessors set different standards for the same test using the same method.The aim of this guide, however, is to be practical rather than philosophical and, therefore, PMETB agrees with Cizek’s conclusion that ‘the particular approach to standard setting selected may not be as critical to the success of the endeavour as the fidelity and care with which it is conducted’ (25). Since PMETB’s priority is to ensure that passing standards are set with due diligence and at sufficiently robust levels as to ensure patient safety, assessors should choose methods that they are happy with. It is essential that the people using those methods are appropriately trained and approach the task in a fair and professional manner. The literature describes a wide variation of methods (26) and the procedures for many are set out very clearly (27). Nevertheless, it quickly becomes plain that there is no single best standard setting method for all tests, although there is often a particularly appropriate method for each assessment. However, there are three main requirements in the choice of method. It must be: • defensible, to the extent that it can assure the stakeholders about its validity; • explicable, through the rationale behind the decisions made; • stable, as it is not defensible if the standards vary over time (28). Simply selecting the most appropriate method of standard setting for each element in an examination is not enough. As mentioned above, the selection and training of the judges or subject experts who set the standard for passing assessments is as important as the chosen methodology (29). Types of standard There are two different kinds of standard - relative and absolute. Relative standards are based on a comparison between trainees and they pass or fail according to how well they perform in relation to the other trainees. An exam in which there is a fixed pass rate (for example, the top 80% of the top 200 trainees pass) uses a relative standard. By contrast, when an absolute standard is applied trainees pass or fail according to their own performance, irrespective of how any of the other trainees perform. It is generally accepted that unless there is a particularly good reason to pass or fail a predetermined number of trainees, an absolute standard (based on individual trainee performance) should be used.The methods described in this chapter are for setting absolute standards.This is because absolute or criterion-referenced standards are preferred for any assessment used to inform licensing decisions. The use of relative standards might result in passing trainees with little regard to their ability. For example, if all the trainees in a cohort were exceptionally skilled, the use of norm-referenced standards (passing the top n% of the trainees) would result in failing (i.e. misclassifying) a certain proportion who, in fact, possess adequate ability.This is certainly unfair and at variance with the purpose of a test of competence. Moreover, since relative standards will vary over time with the ability of the trainees being assessed, the reliability of any competence based classifications could be questionable.Therefore, if valid measures of competence are desired, it is essential to set standards with reference to some absolute and defined performance criterion (30). In other words, standards should be set using absolute methods.
  14. 14. 14 Developing and maintaining an assessment system - a PMETB guide to good practice Absolute standard setting methods can be broadly classified as either assessment centred or individual trainee (performance) centred. In test centred methods, theoretical decisions based on test content are used to derive a standard, whereas in trainee centred methods, judgments regarding actual trainee performance are used to determine the passing score. It is implicit that each of these methods will ensure that all performance test material will be subjected to standard setting as an integral part of the test development. In order to set the standard, three things need to be established: • the purpose of the assessment; • the domains to be assessed; • the level of the trainee at the time of that particular assessment. Based on these characteristics, there are various established methods of standard setting that can be used. However, before doing so it is necessary to point out that there has been something of a tradition in UK medical education for standards to be set without giving proper consideration to these points. For example, there are still examinations in which the pass mark has been set quite arbitrarily, often long before the exams themselves have even been written, and subsequently be enshrined in the regulations, making change particularly difficult.There are ways around this, but it would still be far better if pass marks were not predetermined in this way. Moreover, the examination methods are often also stipulated, but not the content of the exam. Correctly, of course, what is to be assessed should be established before selecting the methods. Clearly, such procedures are quite unacceptable in contemporary postgraduate medical education, particularly when the consequences of passing or failing assessments can be so important. Indeed, despite all the talk about maintaining (or, contemporarily,‘driving up’) standards, the process of determining what the standards are has been an extraordinarily lax affair (31). PMETB requirements are proving to be a powerful incentive in bringing about long overdue improvements, and examining bodies are increasingly striving to ensure that all aspects of their assessments are conducted properly, defensibly and transparently. Standard setting is an important element in this. Prelude to standard setting Before the standard can be set there must be agreement about the purpose of the assessment, what will be assessed and how, and the level of expertise that trainees might be expected to demonstrate. For example, the assessment might be made to check that progress through the curriculum is satisfactory, or to identify problems and difficulties at an early stage when they will probably be easier to resolve. On the other hand, it might be to confirm completion of a major stage of professional development, such as graduation or Royal College membership.This will also enable the standard setters to agree on the trainees’ expected level of expertise. Experience has shown that the level of expertise is very often overestimated, particularly by content experts, and is one of several good reasons why assessors should undertake the assessment themselves before they set the standards. This overestimate of trainees’ ability can lead to the pass mark being set unrealistically high, or to the assessment containing an excessive proportion of difficult items. Contrary to the belief held by many assessors that difficult exams sort out the best trainees, the most effective assessment items are generally found to be those that are moderately difficult and a good discriminator through covering a wide sample of the prescribed curriculum. Even trainee centred methods of standard setting, such as the borderline group method and the contrasting groups method (described below and expanded upon in Appendix 2), depend on the discriminatory power of the items - individually in the borderline group method and across the exam as a whole in contrasting groups. The content of the assessment will be determined by the curriculum, in accordance with Principle 2 (1). In the case of workplace based assessment, these might be described in terms of competencies and other observable behaviours and rated against descriptions of levels of performance. In formal assessments as part of set piece examinations, the content should reflect the relative importance of aspects of the curriculum, so that essential and important elements predominate. The standard in workplace based assessment is therefore usually determined by specific levels of performance for items often prescribed on a checklist and this is discussed below. In formal examination style assessments the standard will take account of the importance and difficulty of the individual items of assessment and a method is described which includes this consideration.The methods described below are principally used as assessment as part of formal examinations, but this guide has also discussed some issues of standard setting in workplace based assessment.
  15. 15. 15Developing and maintaining an assessment system - a PMETB guide to good practice Standard setting methods There are several methods for standard setting.They can be seen as falling into four categories: • relative methods; • absolute methods based on judgments about the trainees; • absolute methods based on judgments about the test items; • combined and compromise methods. As indicated above, this guide will not consider relative methods in any more detail. This leaves methods based on judgment about the trainees,individual assessment items and combined and compromise methods.In test centred methods,theoretical decisions based on test content are used to derive a standard,whereas in trainee centred methods judgments regarding actual trainee performance are used to determine the appropriate passing score. Test based and compromise methods consist of the three main methods of standard setting in formal knowledge based exams, though there are several variations of each method.Trainee based methods are currently gaining in popularity. The simplest test based method is Angoff’s (32). Ebel’s (33) method is slightly more complicated, yet probably leads to a better examination design. Both are based on judgments about the assessment items.This guide also describes two trainee based methods and the Hofstee method, a combined/compromise method which is more complex and best used with large cohorts of trainees. Test based methods In general, test based methods require assessors to act as subject experts to make judgments regarding the anticipated performance of ‘just passing’ trainees (i.e. a ‘borderline pass’) on defined content or skills.The Angoff procedure is probably the best known example of assessment centred method and has subsequently undergone various modifications. 1) Angoff’s method Originally developed for standard setting in multiple choice examinations, this method has also been used to set standards on the history taking and physical examination checklist items that are often used for scoring cases in skills assessments. Here, the assessors are required to make judgments as subject experts as to the probability of a ‘just passing’ trainee answering the particular question or performing (correctly) the indicated task.The assessors’ mean scores are used to calculate a standard for the case. However, this method is better suited to standard setting in knowledge tests as it has some significant disadvantages for performance testing. For example, it is very time consuming and labour intensive, especially when there are multiple checklists. It may also yield too stringent standards.Thirdly, and more importantly, since the resulting standard is a mean assessment across items and/or tasks, the use of this method makes the implicit assumption that ratings on tasks are independent. However, this assumption is often untenable with performance assessments because of the phenomenon of case specificity.This means that essentially, individual checklist items are often interrelated within a task. As a result, the assessors’ judgments are not totally independent, potentially invalidating the use of this method for setting standards (34). An alternate method is to instruct the subject experts to make assessments regarding the number of checklist items that a ‘just passing’ trainee would be expected to obtain credit for.While this may reduce the problem of checklist item dependencies and substantially shorten the time taken to set standards, the task of deciding how many items constitute a borderline pass remains challenging with regard to rules of combination and compensation. As a result, the precision of the standard derived may be compromised. See Appendix 2 for further details. 2) Ebel’s method Only slightly more complicated than Angoff’s method, Ebel’s method can be considerably more useful in practice, especially when building and managing question banks. Holsgrove and Kauser Ali’s (35) modification of Ebel’s method was developed for a large group of postgraduate medical exams, namely the Membership and Fellowship exams of the College of Physicians and Surgeons, Pakistan.This modification has the advantage of not only helping examiners to set a passing standard, but also to produce an examination with appropriate coverage of essential, important and supplementary material, with a balance of difficult, moderate and easy items.These three factors are important in improving the stability of examinations when repeated many times. In its original form, Ebel’s method was suitable for simple right/wrong items, such as multiple choice questions (MCQs) of the ‘one best answer’ type.The Holsgrove and Kauser Ali modification allows it to be used for more complex items such as OSCE stations and work is underway to explore its potential for workplace based assessment. See Appendix 2 for more details.
  16. 16. 16 Developing and maintaining an assessment system - a PMETB guide to good practice Trainee or performance based methods Trainee or performance based methods have been used increasingly as the standard setting method of choice in clinical skills and performance assessments.These methods are more intuitively appealing to assessors as they afford greater ease when making judgments about specific performances. Additionally, assessors find the process and results more credible because the standard is derived from judgments based on the actual test performances (30). Instead of providing judgments based on test materials, the panel of subject experts is invited to review a series of trainee performances and make judgments about the demonstrated level of proficiency. 1) Borderline group method Described by Livingston and Zieky in 1982 (23), this method requires expert judges to observe multiple trainees on a single station or case (rather than following a single trainee around the circuit) and give a global rating for each on a three point scale: • pass; • borderline; • fail. The performance is also scored, either by the same assessor or another, on a checklist.Trained simulated patients might be considered sufficiently expert to serve as assessors, especially in communication and interpersonal skills.The global ratings using the three point scale are used to establish the checklist ‘score’ that will be used for the passing standard. A variety of modifications of this method exist (36),but it is important in all of them that the examiners must be able to determine a borderline performance level of skills in the domain sampled.See Appendix 2 for more detail. 2) Contrasting groups method Procedures, such as the contrasting groups method (37) and associated modifications, have focused on the actual performance of contrasting groups of trainees identified by a variety of methods.This method requires that the trainees are divided into two groups which can be variously labelled as pass/fail; satisfactory/unsatisfactory; competent/not competent, etc.There are various ways of doing this; for example as external criteria or specific competencies set out in the curriculum and specified on the multiple item score sheet. However, the group into which they are placed depends on their global rating across the performance criteria. Assessors rate each trainee’s performance at each station or case, using a specific score sheet for each. After the assessment, scores from each of the two contrasting groups are expressed graphically and the passing standard is provisionally set where the two groups intersect. In practice this almost always produces an overlap in the score distributions of the contrasting groups. However, this method allows for further scrutiny and adjustment so that if the point of intersection is found to allow trainees to pass who should have rightly failed (or vice versa) the pass mark can be adjusted appropriately. See Appendix 2 for more detail. Combined and hybrid methods - a compromise There are various approaches to standard setting that combine aspects of other methods or, in the case of the Hofstee method described below, both relative and absolute methods. A proposed hybrid method is described later in this section. Hofstee’s method This is probably the best known of the compromise methods, which combines aspects of both relative and absolute standard setting. It takes account of both the difficulty of the individual assessment items and of the maximum and minimum acceptable failure rate for the exam and was designed for use in professional assessments with a large number of trainees. See Appendix 2 for more detail. Hybrid standard setting in performance assessment Unlike knowledge assessments, which have been extensively researched to guide their standard setting methods and whose standards can be determined by a defined group of modest size, performance assessments have not been so well informed by a standard setting evidence base. Neither test nor trainee based approaches can readily be applied at the case level to a high fidelity clinical assessment in which standards are implicit in the grading of the individual cases.
  17. 17. 17Developing and maintaining an assessment system - a PMETB guide to good practice A hybrid approach has been proposed (38) and is described below. Assessors are required to identify a passing standard of performance on each individual assessment that they observe, using generic and case specific guidance. Each case would thus be passed or failed.The standard setting issue then relates to how many individual assessments need to be passed (and/or not failed) in order to pass the assessment as a whole. In order to address this, the panel of assessors as subject experts would collectively agree on the standard to pass, converting the individual assessment grades to pass/fail overall, and it will do this by agreeing a decision algorithm.The assessors could carefully review the pass/fail algorithm and collectively support it. Methods such as the Delphi technique, which helps avoid over influencing of decisions by powerful or dominant characters, could facilitate this process. The standard may then be verified and refined by the application of either performance based approach to examples of individual assessments and/or the trainee’s overall performance in the full battery of assessments. Standard setting for skills assessments Numerous standard setting methods have been proposed for knowledge tests, such as multiple choice examinations (32). However, whilst the underlying principles, such as identification of the borderline or ‘just passing’ trainee, are common, they are not necessarily appropriate for standard setting in performance assessments. Although a range of standard setting methods have been used in skills assessments, this has been shown to yield different (30) and inconsistent results (39), particularly if different sets of assessors are also used (30).This is the point at which idealism meets reality and it presents us with a problem. Performance or clinical skills assessments are playing an increasingly important role in making certification or licensing decisions, for example, using standardised patient assessments in simulated encounters or OSCEs. In order for the pass/ fail decisions to be fair and valid, justifiable standards must be set. However, the methodology is not as well developed for performance standard setting as it is for knowledge based assessments and the influence that the assessor panel has on the process appears to be a greater factor in this form of assessment.Thus, on the one hand there is a requirement for the standards to be defensible, explicable and stable (28), yet reported problems with both the methods and their implementation (30, 39). Therefore, assessments of complex and integrated skills, which need to include assessments on the performance of whole tasks, pose considerable standard setting challenges - in particular ensuring that the standard is stable over time (‘linear test equating’). In order to set acceptable standards, it is a prerequisite that due care is taken to ensure that the assessments are standardised, the scores are accurate and reliable, and the resulting decisions regarding competence are realistic, fair and defensible. The borderline group and contrasting groups methods of standard setting are well suited to standard setting in controlled assessment systems testing skills and performance. A further method is reviewed below. A proposed method for standard setting in skills assessments using a hybrid method Assessors must observe either trainees’ actual performance - for example using a DVD,VCR or an authentic simulation which includes a suitable breadth of trainee’s performance - and then make a direct assessment concerning competence based on such observations. It is imperative to emphasise the need to concentrate the assessor’s attention on the pass/fail decision to ensure that they are properly informed as to the definition of ‘just passing’ behaviour. For clinician standard setters, this task is intuitively appealing, as it articulates with their clinical experience. However, one potential shortcoming of trainee centred methods, usually attributable to insufficient training of experts as assessors, is the tendency to attribute performance based on skills or factors that are not directly targeted by the assessment.The attribution of positive ratings based on irrelevant factors (halo effects - ‘he or she was very kind and polite’) is one such phenomenon. This, and other potential sources of assessor bias, can be addressed by offering adequate training to assessors about making judgments, including the provision of suitable performance descriptors. In addition, it is imperative that the assessor’s task is clear and unambiguous and that any misinterpretation of the task is rectified. Working as a group, assessors can discuss and establish a collective and defensible recommendation for what constitutes a passing standard.The standard may then be modified by the application of additional criteria (see below) or appropriate statistical management.The passing standard for each individual assessment, a clinical case, for example, is set by the assessors as a result of their expertise, training and insights into the performance of ‘just passing’ trainees in real life.
  18. 18. 18 Developing and maintaining an assessment system - a PMETB guide to good practice 1) Grading system for individual assessment of clinical cases This section is based on the paper by Wakeford and Patterson (38) mentioned above, to whose work David Sales contributed. For pass/fail licensing decisions, designed to confirm that a doctor is sufficiently safe and proficient to undertake unsupervised independent practice, the trainee’s performance on each of the skills assessments should be assessed in a specified number of domains (such as history taking, examination, communication or practical skills, etc) and each of these graded on a scale, for illustrative example, using four points as follows: • clear pass; • bare (marginal) pass; • bare (marginal) fail; • clear fail. There will also be a global, overarching judgment for each individual assessment, which will be the overall grade for that particular assessment.This overall individual assessment grade is not determined by the simple aggregation of the domain grades, as that would imply equal weight to each. Although the individual domain grades may be taken into account, the fact that one domain was weakly represented in that individual assessment will also need to be accounted for. For example in resuscitation assessment, communication with the patient would be far less important than ensuring a clear airway. The overall assessment grade for a particular case or scenario will use the same four grades but, since there is no borderline grade, marginal fails and marginal passes can be seen as fails and passes respectively, by the assessor. The essence of such a grading system is that it is based on expert assessments of what is acceptable behaviour in the overall passing criteria of the assessment.The essential focus of the assessment is upon the trainees’ global performance during a particular overall unit of assessment and not on their performance on the necessarily artificial constructs of the domains within that individual assessment. It is an accurate overall assessment grade which is the key to its producing a credible overall result. 2) Scoring methods Such an assessment would produce a number of scores based on the four point system for any individual trainee. Converting a number of scores into an overall result is not straightforward. It raises the following issues: • psychometric - especially of compensation vs combination; • stakeholder - including what constitutes competence; • institutional - including financial issues of pass rates; • trainee - such as ‘case blackballing’. In view of the complexity of these issues, it is not surprising that there is no easy answer to the problem of converting a series of individual assessments scores into overall assessment results, particularly where there are a number of ‘marginal’ grades involved. 3) Is turning the results of assessments into a series of numbers likely to help? It is clearly necessary to combine the total number of assessment scores in some way so as to produce an overall pass/fail standard.This process must be fair and must ensure that unacceptable combinations of individual assessment ‘grades’ do not lead to a pass. The end can be achieved in two ways: • A set of rules can be devised, which might say something like ‘to pass, a trainee must have at least n clear passes and no clear fails’ or ‘allowing compensation between clear passes and clear fails, and between marginal passes and marginal fails, the trainee shall pass if they have a neutral ‘score’ or above’, with possible codicils such as ‘… n clear fails will fail’. This might be termed a categorical approach. • Alternatively, a scoring system could be devised with different marks being given for each grade (such as 0, 1, 2, 3, or 4, 8, 10, 12) and a pass mark set. The main difficulty with the former is that it does not produce a ‘score’ that could subsequently be processed statistically. The difficulties with the latter are that, being non-specific, it may well not prevent the unacceptable combinations - there will be argument about the relative scores attached to individual assessments.
  19. 19. 19Developing and maintaining an assessment system - a PMETB guide to good practice In practice, if the passing ‘score’ approaches the maximum score, the numerical approach can possess all the advantages of the categorical approach. In this situation the examiners could consider the scoring approach; otherwise a categorical combination algorithm may be preferable. Standard setting for workplace based assessment Assessment in the workplace Assessment of a trainee’s performance in the workplace is extremely important in ensuring competent practice and good patient care, and in monitoring their progress and attainment. However, it is a relative newcomer to the assessment scene, particularly in the UK, and is probably not particularly well suited to the kind of standard setting methods described above and conventionally applied in well controlled ‘examining’ environments. As mentioned earlier, work is underway to evaluate the contribution that the Holsgrove and Kauser Ali modification to Ebel’s method might make in this area, but at present the most promising approach is probably to use anchored rating scales with performance descriptors.This is particularly appropriate with competency based curricula where intended learning outcomes are described in terms of observable behaviours, both negative and positive. Anchored rating scales An anchored rating scale is essentially a Likert-type scale with descriptors at various points - typically at each end and at some point around the middle. For example, the rating scales used in most of the assessment forms in the Foundation Programme, performance for each item at point 1 (the poorest rating), 6 (the highest) and 4 (the standard for completion) might be deemed sufficient.To take an example from the assessment programme in the specialty curriculum of the Royal College of Psychiatrists (http://www.rcpsych.ac.uk/training/workplace-basedassessment/wbadownloads.aspx) the descriptors for performance at ST1 level in the mini-CEX (which RCPsych have modified for psychiatry training and renamed mini-Assessed Clinical Encounter [mini-ACE]) describes performance in history taking at points 1, 4 and 6 on the rating scale: 1) Very poor, incomplete and inadequate history taking. 4) Structured, methodical, sensitive and allowing the patient to tell their story; no important omissions. 6) Excellent history taking with some aspects demonstrated to a very high level of expertise and no flaws at all. However, to help assessors to achieve accuracy and consistency, the performance descriptors used also describe the other three points on the scale: 1) Very poor, incomplete and inadequate history taking. 2) Poor history taking, badly structured and missing some important details. 3) Fails to reach the required standard; history taking is probably structured and fairly methodical, but might be incomplete though without major oversights. 4) Structured, methodical, sensitive and allowing the patient to tell their story; no important omissions. 5) A good demonstration of structured, methodical and sensitive history taking, facilitating the patient in telling their story. 6) Excellent history taking with some aspects demonstrated to a very high level of expertise and no flaws at all. It seems inevitable that standard setting in workplace based assessment will be an area of considerable research and development activity over the next few years. Decisions about borderline trainees It is very important that there is a proper policy, agreed in advance, regarding the identification of borderline trainees and, having identified them, what to do about them.The traditional practices will no longer suffice. Making decisions about borderline trainees Some common current practices for making decisions about borderline trainees are unacceptable and indefensible. For example, the only groups of borderline trainees usually considered at present are those within, say, a couple of percentage points of the pass mark (the range is typically arbitrary rather than evidence based). Borderline ‘passes’ are usually ignored - they are treated as clear passes even though mathematically there is always a group who happen to fall on the ‘pass’ side of the cutting point who, in terms of confidence intervals, are interchangeable with a similar group on the ‘fail’ side. Decisions about what happens to borderline trainees once they have been correctly identified should rest with individual assessment boards. However, they must be fair, transparent and defensible, and it is essential that borderline trainees on both sides of the pass mark are treated in exactly the same way.
  20. 20. 20 Developing and maintaining an assessment system - a PMETB guide to good practice The criteria of fairness, transparency and defensibility would probably exclude one of the most common ways of making decisions about borderline trainees, which is to conduct a viva voce examination.The vast majority of vivas are plagued with problems, such as inconsistency within and between assessors, variability in material covered (not infrequently asking about things that are not in the curriculum) and, above all, extremely poor reliability. It is clearly inappropriate to use perhaps the least reliable assessment method to make pass/fail decisions that even the most reliable methods have been unable to make. Summary This chapter is concerned with issues arising from the requirement for assessments to comply with the PMETB Principles (1), and, in particular, the two questions that must be addressed in meeting Principle 4: i) What is the measurement error around the agreed level of proficiency? ii) What steps are taken to account for measurement error, particularly in relation to borderline performance? The first step, identified in the title itself, is to establish what the level of proficiency actually should be - in other words, how is the standard agreed? This chapter has outlined some of the principles of standard setting and described, in Appendices 1 and 2, some methods for establishing the pass mark for assessments. However, PMETB’s Principles require more than having a standard that has been properly set. Clear and defensible procedures are needed for identifying and making decisions about borderline trainees.This chapter also described how, even when the pass mark has been correctly set, there will almost certainly be a group of trainees with marks on either side of it who cannot be confidently declared to have either passed or failed.This is because all assessments, like all other measurement systems, inevitably have an element of measurement error. Appendices 1 and 2 decribe how this measurement error can be calculated and noted that it is often surprisingly large.The appendices serve to illustrate how measurement error can be reduced by improving the reliability of the assessment having calculated the measurement error for their assessment - assessors can identify the borderline trainees. It also pointed out that they will need to have agreed in advance how decisions about the borderline trainees will be made. By breaking the requirements of Principle 4 down into three elements: • How is the standard set? • How is the measurement error calculated? • How are borderline trainees identified and treated? PMETB hopes that this chapter will be helpful in assisting those responsible for assessing doctors, both in the examination hall and in the workplace, to ensure that their assessments meet the required standards. Conclusion The majority of methods of standard setting have been developed for knowledge (MCQ-type) tests and address the need for setting a passing score within a distribution of marks in which there is no notional pre-existing standard. Much of the currently published evidence relating to standard setting in performance assessments relates to undergraduate medical examinations, which produce checklist scores that may have little other than conceptual relevance to skills tests which assess global performance relevant to professional practice. Regardless of the method used to set standards in performance assessments, it is imperative that data are collected both to support the assessment system that was used and to establish the credibility of the standard. Generalisability theory (7, 8, 40, 41) can be used to inform standard setting decisions by determining conditions (e.g. number of assessors, number of assessments and types of assessments) that would minimise sources of measurement error and result in a more defensible pass/fail standard. Where performance assessments are used for licensing decisions, the responsible organisation must ensure that passing standards achieve the intended purposes (e.g. public protection) and avoid any serious negative consequences.
  21. 21. 21Developing and maintaining an assessment system - a PMETB guide to good practice Chapter 3: Meeting the PMETB approved Principles of assessment Introduction PMETB’s Principles of assessment (1) are now well established.This chapter provides some of the background source material which will permit interested groups to begin to understand some of the thinking behind the Principles. PMETB is at pains to explain why this amount of work is demanded of already hard pressed groups who are trying, based on a background of established practice, to provide robust evidence which supports what they do. For the purposes of this section ‘quality management’ (QM) replaces the traditional term ‘quality control’. Given the complexities of training and education as part of the delivery of medical services, our ability to directly control quality is inevitably challenging. Quality and the risks of falling short of achieving the required standards inherent in the Principles, however, must be managed. It is the role of postgraduate deaneries with training programme directors to manage quality. It is the role of PMETB to assure QM takes place at the highest level achievable in the circumstances. Quality assurance and workplace based assessment Workplace based assessment should comply with universal principles of QM and quality assurance that one might expect in any assessment. However, the position of workplace based assessment is somewhat different.This is not to say it should not be quality managed but because of its nature there are issues concerning its best use.This might be summarised as being an assessment methodology that has demonstrably high validity, whilst being more challenging in terms of its reliability. However, in practice the evidence is that some workplace based assessments have very reasonable reliability. Norcini’s work has demonstrated that mini-CEX can have acceptable reliability with six to ten separate but similar assessments (based on 95% CIs using generalisability) (42). He emphasises the need to re-examine measurement characteristics in different settings and the need for sampling across assessors and clinical problems on which the assessments are based.The use of the 95% CI emphasises the need for more interactions where performance is borderline in order to establish whether the trainee is performing safely or not. It is very important that the assessors are trained; Holmboe described a training method which was in practice very simple in that it consisted of less than one day of intensive training (43). A number of groups have demonstrated that both multi-source feedback (MSF) from colleagues and patients can also be defensibly reliable, although larger numbers of patient assessors are needed than colleagues in these assessments (44- 49). In the case of borderline trainees, however, it makes sense that more assessments are required to distinguish between trainees who are in fact safe and those where doubts remain. Extensive sampling for borderline trainees may be needed to precisely identify the problems behind their difficulties so that a plan can be formed to find remedial solutions where possible. The main value of workplace based assessments is that they provide immediate feedback.The information acquired during a workplace based assessment can also provide evidence of progression of a trainee and therefore contribute evidence suitable for recording in their learning portfolio.This can then be compared to the agreed outcomes set by trainer and trainee in the educational agreement. It is essential that both trainer and trainee are aware that both feedback and assessment of performance that contributes to their learning portfolio of evidence are simultaneously taking place during workplace based assessment. Although PMETB acknowledges that this dual role of a workplace based assessment is in some ways not ideal because it may inhibit the learning opportunity from short loop feedback, it is necessary to be pragmatic, particularly because the number of assessors and the time available for assessment is precious.Therefore, educational supervisors will sometimes be tutors or mentors and on other occasions will actually be assessors.The agreement reached within the PMETB Assessment Working Group is that this is acceptable provided it is entirely transparent to trainer and trainee in what circumstances they are meeting on a particular occasion.
  22. 22. 22 Developing and maintaining an assessment system - a PMETB guide to good practice The learning agreement Workplace based assessments taken in isolation are of limited value.They should be contextualised to a learning agreement that refers to the written curriculum and sets the agenda for a particular training episode. A series of learning agreements must ensure that the whole of the curriculum is covered by the end of training. There will inevitably be small gaps and more overlaps, but by and large what trainer and trainee are creating is an agreement on a direction of travel which is usefully thought of as an educational trajectory.The aim and objective of the trajectory is to cover the whole of the curriculum to a level of competence defined by a series of outcomes, in this case, based on GMP (19). Assessments will be used to provide evidence that the direction and pace of travel is timely, appropriate and valid.They will inform the educational appraisal which in turn informs the specialty trainee assessment process (STrAP) - referred to throughout this document as the annual assessment - which determines whether or not a trainee is able to proceed to the next stage of their training. Appraisal After a suitable period of training, usually every four or six months, depending on the structure of training rotations, a formative, low stakes, educational appraisal must take place based on the evidence provided by learning agreements and a large number of in-the-workplace based assessments. PMETB recognises there will be different ways of achieving the educational appraisal process that precedes submission of evidence to the annual assessment, which is a high stakes event determining whether progression in training can take place, or remediation is required. Normally, assessments will occur annually but in the case of remedial action being required for a trainee they may occur more frequently. It is important, therefore, to have an opportunity for the trainee and the group of trainers a trainee has worked with during the specified training period to review the evidence during an appraisal which is distinct and unequivocally separate from the annual assessment described below.This, therefore, has to take place at school or programme level and is best carried out locally where training during that period has actually taken place.This is an indisputably formative review with the specific objective of ensuring there have been no immediate problems, such as the inability to have sufficient training or assessment opportunities and to provide timely feedback when difficulties of any kind have arisen.The aim is to resolve problems as soon as they are identified, rather than presenting difficulties which were otherwise remediable to the annual assessment. The reason for separation of appraisal from review is to ensure the subsequent annual assessment, which has external members relative to the training process, does not simply consider the future of a trainee on the basis of the raw data of a series of scores. It has been made clear elsewhere in this document that assessment must not simply be a ‘summation of the alphas’ (referring to Cronbach’s alpha). The way in which appraisal is separated from review will vary from programme to programme and from discipline to discipline. For example, in anaesthesia, where there is a big pool of trainers with trainees frequently rotating amongst them, the educational supervisor may not act as assessor for the whole or any of the training period. In many ways, this is ideal as it keeps the mentoring role and the assessing activity completely divorced to the particular advantage of effective mentorship. In trauma and orthopaedic surgery, on the other hand, trainees spend six months with a trainer and the vast majority of the time the trainer will also be the workplace based assessor. Provided it is clear when the trainer/assessor is in which role, then a practical way forward can be envisaged. Some assessments may be carried out by other assessors, other than the principal designated trainer, during a particular training period and over the year a trainee will have worked for at least two trainers who will also be acting as assessors. A structured trainer’s report of the whole period of work ahead of the evidence being submitted to the annual assessment must be made by the designated educational supervisor.This must contain evidence about the development of the trainee in the round and not just a list of completed assessments; the latter are really presented as supporting evidence.This means that the trainee is aware that the individual workplace based assessments contribute to the whole judgment and they would not be ‘hung out to dry’ over one less than satisfactory assessment event. PMETB hopes this encourages the trainees and the trainers to develop an adult-to-adult learning and teaching style, so maximising the learning opportunities inherent in work based assessment.
  23. 23. 23Developing and maintaining an assessment system - a PMETB guide to good practice The annual assessment PMETB expects deans and programme directors to ensure that the annual assessment (currently encompassed in RITA processes (50)) is a consistently well structured and conducted process.The annual review process should have stakeholders from the deaneries, training programmes, external assessors and lay membership. An example of good practice is when a member of the appropriate SAC from out with the training programme would attend the annual review to give the process externality. Clearly, the postgraduate dean gives educational externality from a particular programme and internally the training programme director has an overview of a particular trainee. The exact composition of annual assessment panels is laid out in the shortly to be published Gold Guide which replaces the current Orange Guide. The annual assessment is a high stakes event for a trainee,but should contain no surprises if the educational appraisal process has worked effectively.The annual assessment should be a quality assuring exercise ensuring that the conclusions that the Head of Training and the trainers have reached about a particular trainee during the training interval are reasonable and the trainee has achieved the standards expected, as described in the structured trainer’s report.The panel would look at the evidence and would either confirm or might differ from the decision reached by the training programme director and their committee about a particular trainee.Certainly,the annual assessment panel would have to assure themselves that the evidence provided was appropriate.This will be very high stakes because the exercise might result in a trainee being removed from training,being asked to repeat training or to have focused training.In the most part,this will be a paper or virtual exercise, although the Gold Guide suggests the option of reviewing borderline trainees should always be applied. The second part of the annual review will be a facilitatory event where the training programme director, in conjunction with members of his or her training committee and the trainee, would agree on the content of the next period of training, based on an overview of the trainee’s educational trajectory with the intention of completing more parts of the written curriculum. This must be based on a face-to-face discussion with at least one designated trainer or mentor. The role of PMETB in this process is to be assured that the QM mechanisms and the decision making are appropriate. It is important that there is a demonstrably transparent and fair process for trainees and one that assures the general public that those treating them are fit for their purpose. Additional information The annual assessment for trainees needs to take into account issues around health and probity not dealt with elsewhere. In doing this, the review in effect provides all the required evidence for NHS appraisal processes. MSF may provide information about health and probity. Quality assuring summative exams Colleges set exit examinations as a quality managing and assuring process, which triangulates with evidence from a learning portfolio which includes workplace assessments and accumulated experience such as may be found in log books. Inevitably, the artificially created environment of a summative college exam has less intrinsic validity.The corollary is that there is considerably more control over reliability.This means that correlating exam results with workplace based assessment is one indicator that the process is working well (an example of a process described as triangulation). Note, however, exams and workplace based assessments are not at all comparable assessments. It would be wrong therefore to assume one is the control or check on the other; they provide complementary information invaluable as part of a triangulation exercise. The exam must be as reliable as possible.What many formal exams are best at doing is testing knowledge and its application. Psychometric analysis demonstrates clearly that MCQs of various types do this most reliably.With structuring, clinicals and orals can contribute to formal examinations. Assessment experts such as Geoff Norman are not intrinsically against orals and clinicals, but simply point out that they require considerable work to make them robust and they need prolonged examination time to be sure they are reliable.There is one reason why there is a trend to move to much longer assessments of knowledge, application of knowledge and clinical decision making. Appraisal Assessment Annual review Evidence
  24. 24. 24 Developing and maintaining an assessment system - a PMETB guide to good practice In order to quality assure an assessment system based on formal examinations, it is necessary to address the following: • Purpose of exam - should be explicit to examiners and trainees, and available in comprehensible form to the general public. • Content of exam - should simply match the agreed syllabus and look to test at the level of ‘competent’, but might also encourage excellence. • Selection of assessment instruments used in the exam - should meet the utility criteria laid out in Chapter 1. • Question, answers and marking scheme - should be clear that the marking is either normative or criteria based.The standard should also be related to methodology and purpose. Most professional exams are criteria based. • Standard setting procedures - should be selected and applied as explained above in Chapter 2. • Examination materials - need to be of proper quality and available to all examiners. Materials and props should be approved by examination boards and the introduction of new materials by individual examiners put through the same standard setting processes as any other exam material or question. For example, unshared computer images of which an individual examiner may be fond should only be permitted if the image meets criteria of standard, quality and viability agreed by all assessors. • Running the exam - should be reasonable and of equal quality for all trainees so that they can perform to the best of their ability.Where clinical environments are used, the standards must not be compromised by everyday service activities going on around the exam. It is better to have reserved or purpose designed facilities and not impose an exam, for example, on a busy ward or clinic where unexpected events or schedules such as mealtimes or visiting impinge on the exam environment. • Conduct of assessors - should be scrutinised regularly in terms of behaviour and performance.An example of good practice is to appoint examiner/exam assessors.These individuals should be experienced examiners who are appointed in open competition and with the approval of their peers.Their role would include: • making multiple visits to inspect the venue or observe trainee assessments; • sitting unobtrusively as observers and not interfering; • evaluating assessors against transparent criteria. Useful feedback comments might concern: • quotes and examples of positive and negative behaviour; • interpersonal skills of assessors with trainee; • level and appropriateness of questions; • assessment technique. They would be expected to prepare a report on the whole process and on individual assessors, which can be fed back to conveners and other assessors to assure consistency of performance. Assessors of exams must, of course, be trained to be fit for purpose. The exam should also undertake a review of written policies available to trainees, assessors and as far as possible the general public, which determines: • marking and analysis of results; • provision of feedback; • selection of examiners and officials; • test development. It is also necessary to have policies on the following: • examination security; • data protection; • documents, computers, firewalls and buildings; • checking and distribution of results; • plagiarism and cheating; • malpractice by college or trainees; • mobile phones and electronic devices; • examiner training.
  25. 25. 25Developing and maintaining an assessment system - a PMETB guide to good practice Summary PMETB recognises there is no perfect solution. Assessing professionals in the exacting working environment of healthcare means making some compromises, which PMETB accepts make some educators uncomfortable. PMETB assessment principles are predicated on the overriding value that, provided assessment instruments are valid and reliable, they need to assess people holistically and not just represent them as a set of results based on a battery of assessments.The intellectual thrust of the Assessment Working Group in PMETB is to respect the role of peer assessment from genuine experts. Provided these experts acknowledge they also need to learn how to be expert assessors and trainers as well as expert clinicians, there is a genuine way forward.
  26. 26. 26 Developing and maintaining an assessment system - a PMETB guide to good practice Chapter 4: Selection, training and evaluation of assessors Introduction The role of assessor is an important and responsible one for which individuals should be properly selected, trained and evaluated.Very importantly, individuals undertaking assessment should recognise that they are professionally accountable for the decisions they make. All assessments, including work based assessments, must be taken seriously and their importance for the trainee and in terms of patient safety fully acknowledged. Submission of assessment judgments which are not actually based on direct observation/discussion by the assessor with the trainee (e.g. handing the form to the trainee to fill in themselves, filling in a form on the basis of ‘I know you are OK’) is a probity issue with respect to GMP. Honest and reliable assessment is also essential in enabling assessors to fulfil their responsibilities in relation to GMP. Given that assessors are often trainers in the same environment, it is important that it is made clear to trainees when they are acting as an assessor rather than a trainer. Selection of assessors Selection of assessors should be undertaken against a transparent set of criteria in the public domain and therefore available to both assessors and trainees. Particularly in relation to work based assessment this may include guidance in relation to assessor characteristics such as grade or occupational group. Criteria for selection of assessors may include: • commitment to the assessment process they are participating in; • willingness to undergo training; • willingness to have their performance as an assessor evaluated and to respond to feedback from this; • up-to-date, both in their field and in relation to assessment processes; • non-discriminatory and able to provide evidence of diversity training; • understanding of assessment principles; • willing and able to contribute to standard setting processes; • wiling and able to deliver feedback effectively; • willing and able to undertake assessments in a consistent manner. Assessor training There is evidence that assessor training enhances assessor performance in all types of assessment (43, 51). All assessment systems should include a programme of training for assessors. It is recognised that for some types of assessment - in particular, large scale work based assessment - delivery of face-to-face training for all assessors is likely to take some time but as a minimum, written guidance and an explicit plan to deliver any necessary additional training should be provided. All assessor training should be seen as a natural part of Continuing Professional Development (CPD) and based on evidence (52). Evaluation of assessor training should be integral to the training programme and where concerns/gaps are raised in relation to training, these should be responded to and training modified if necessary. Cascade models for training of assessors where centralised training is provided and then cascaded out at a more local level are attractive and cost efficient, but ensuring standardisation of training is more difficult in this context. Provision of written/visual training materials and observation of local training, will help achieve as much consistency as possible. Assessor training should include: • an overview of the assessment system and specifics in relation to the particular area that is the focus of the training; • clarification of their responsibilities in relation to assessment, both specifically and more generally in terms of professional accountability; • principles of assessment, particularly with reference to the assessment process they are participating in, e.g. assessors participating in a standard setting group will need training specifically in standard setting methodologies. Assessors for work based assessment will need to understand the principles behind work based assessment; • diversity training to ensure that judgments are non-discriminatory (or a requirement for this in another context); • where assessors have a role which requires them to give face to face feedback to trainees, the importance of the quality of this feedback should be emphasised and provision for training in feedback skills made. Ongoing training for assessors should be provided to ensure that they are up-to-date and CPD approval should be sought for this.
  27. 27. 27Developing and maintaining an assessment system - a PMETB guide to good practice Feedback for assessors Evaluation of assessor performance and provision of feedback for assessors should be planned within the development of the assessment system.This should include feedback both on their own performance as an assessor and feedback on the QM of the assessment process they are involved in. Feedback for assessors should include formal recognition of their contribution (i.e. a ‘thank you’). Assessors are largely unpaid and give their time in the context of many other conflicting pressures. Planning for evaluation of the assessors should include mechanisms for dealing with assessors about whom concerns are raised. In the first instance, this would usually involve the offer of additional training targeted at addressing the area(s) of concern.

×