Guidance

Appendix B: validity framework

Published 3 October 2024

Summary

This validity framework is an appendix to the national curriculum test handbook and provides validity evidence gathered throughout every stage of the development of the national curriculum assessments (NCAs). It has been produced to help those with an interest in assessment to understand the validity argument that supports the NCAs. The tests specifically detailed in this validity framework are the:

  • optional key stage 1 (KS1) tests in mathematics, English reading and English grammar, punctuation and spelling
  • statutory KS1 phonics screening check[footnote 1]
  • statutory key stage 2 (KS2) tests in mathematics, English reading and English grammar, punctuation and spelling

Who is this publication for?

This publication is for test developers and others with an interest in assessment.

Claim 1: Tests are representative of the subject and national curriculum

1.1 The assessable areas of the curriculum are clearly defined as content domains

The following list explains how the content domains were developed to ensure they are clearly defined:

  • The Standards and Testing Agency (STA) developed the content domains for the national curriculum assessments (NCAs) based on the 2014 national curriculum.
  • The content domains are defined in the national curriculum assessments test frameworks and the assessment framework for the development of the year 1 phonics screening check.
  • The content domains set out the elements of the programme of study that are assessed in the tests.
  • STA’s expert test development researchers (TDRs) developed the content domains in consultation with the Department for Education (DfE) curriculum division. STA also appointed independent curriculum advisors to support the development of the NCAs.
  • STA asked a panel of education specialists to review a draft of the content domains before they were finalised. The range of stakeholders who were involved in producing the content domains gives assurance that they are appropriate.
  • STA published the draft frameworks in March 2014 and the final version in June 2015. The phonics assessment framework was published in February 2012. No concerns have been raised with STA about the content domains.

1.2 There are areas that cannot be assessed in paper and pencil tests and are better suited to different forms of assessment

Not all areas of the 2014 national curriculum are assessable in paper-based tests. Therefore, there are aspects that are not assessed in the NCAs but left to teacher assessment. These areas were discussed in the same way, with the same people as the assessable content domains. No concerns were raised with STA regarding the non-assessable content of the tests. The non assessable areas of the mathematics and English grammar, punctuation and spelling tests are listed in the national curriculum assessments test frameworks.

1.3 The rating scales within the cognitive domains provide an accurate reflection of the intended scope of teaching and learning outlined within the national curriculum

The following list explains how the cognitive domains were developed to ensure they are an accurate reflection of the intended scope of teaching and learning outlined within the 2014 national curriculum:

  • The cognitive domains are defined in the national curriculum assessments test frameworks and the assessment framework for the development of the year 1 phonics screening check.
  • Before developing the cognitive domains, STA reviewed the domains for similar sorts of tests. The cognitive domains were based on the research by Hughes et al (1998)[footnote 2], Webb (1997)[footnote 3] and Smith and Stein (1998)[footnote 4], Bloom (1956)[footnote 5], and ACER (2012)[footnote 6].
  • STA synthesised and amended these existing models to take account of the specific demands of the subjects and the cognitive skills of primary-aged children. The models that resulted allow TDRs to rate items across different areas of cognitive demand.
  • Panels of teachers reviewed the test frameworks to validate the cognitive domains. STA asked the teachers to comment on the extent to which the cognitive domains set out the appropriate thinking skills for the subject and age group. In addition, pairs of TDRs independently classified items against the cognitive domains and compared their classifications.
  • TDRs made refinements to the cognitive domains based on both the consistency between TDR judgements and the comments gathered from the teacher panels. This ensured the cognitive domains published in the test frameworks were valid and usable.

1.6 Test items are rigorously reviewed and validated by a range of appropriate stakeholders

All items in the NCAs and phonics screening check are developed according to our rigorous test development process, which was designed to ensure a range of stakeholders review and validate items throughout development. These stages are:

  • Item writing: STA item writers, TDRs and external curriculum advisors review items. The reviewers suggest improvements to items and STA make the improvements before the next stage.
  • Expert review 1 and 2: A wide range of stakeholders review the items to confirm they are appropriate. This stakeholder group includes teachers, subject experts, special educational needs (SEND) and disability experts, inclusion experts, equity experts and local authority staff. TDRs collate the feedback and decide on the amendments to the items in a resolution meeting with STA staff and curriculum advisors.
  • Item finalisation after trialling: STA test development researchers and psychometricians review items after each trial using the evidence of how the item performed. TDRs can recommend changes to items based on this evidence. Items that are changed may be considered ready to be included in a technical pre-test (TPT), however no changes to items are allowed following that TPT. If the change is more significant, TDRs may decide that they need to review the item further.
  • Expert review 3: STA holds a final expert review after constructing the live test. At this meeting, STA asks stakeholders to review the completed test. If the panel identifies a problem with any items, STA may replace these items.

Not every item requires expert review 1 or item validation trial (IVT), but all items are taken through expert review 2 and TPT.

Appendix A: technical appendix of the test handbook contains information about the item-writing agencies and expert review panels.

STA keeps the evidence relating to the review and validation of individual items in its item bank.

1.7 Test items and item responses from trialling are suitably interrogated to ensure only the desired construct is being assessed (and that construct irrelevant variance is minimised) - a range of questions are included that are appropriate to the curriculum and classroom practice

Following each trial, an item finalisation meeting takes place involving TDRs and psychometricians. The purpose of the meeting is to review all available evidence and make decisions on the most appropriate next stage for each item. Examples of the evidence that is reviewed includes:

  • classical analysis and item response theory (IRT) analysis of the performance of items, including difficulty and discrimination
  • differential item functioning (DIF) analysis, by gender for the IVT and by gender and English as an additional language (EAL) for the TPT
  • analysis of coding outcomes and coder feedback
  • reviews of children’s scripts to evaluate responses to items and understand how children are interacting with questions

Phonics screening check responses are not analysed as these responses are given orally. Therefore, references to coding and the script archive in this validity framework are not applicable to the phonics screening check.

After the IVT (where applicable), the following outcomes are available for each item:

  • Proceed to expert review 2 stage unamended, since there is sufficient evidence that the question is performing as intended.
  • Proceed to expert review 2 stage with amendments since, although there is some evidence that the item is not performing as intended, the issue has been identified and corrected.
  • Revert to expert review 1 stage with amendments, since the issues identified are considered major and the item will need to be included in an additional IVT.
  • Archive the item, as major issues have been identified that cannot be corrected.

After the TPT, the following outcomes are available for each item:

  • Item is available for inclusion in a live test, since the evidence shows it is performing as intended.
  • Item requires minor amendments and will need to be re-trialled before inclusion in a live test.
  • Item is archived, as a major issue has been identified that cannot be corrected.

Any item that is determined to be available for inclusion in a live test has therefore demonstrated that it assesses the appropriate construct. Evidence related to individual items is stored within the item bank and is not repeated here, although it is available should specific issues be identified.

Items selected for the tests could be taken from the TPTs from up to 3 years ago.

1.8 The final tests adequately sample the content of the assessable national curriculum (whilst meeting the requirements within the test frameworks) - a range of questions are included that are appropriate to the curriculum and classroom practice

At test construction, the constraints and specifications listed in the national curriculum assessments test frameworks and year 1 phonics screening check assessment framework for each test are taken into account, to ensure coverage across the curriculum and, in terms of item type, marks awarded, and cognitive domains.  

Teachers, subject experts, markers, inclusion experts and independent curriculum advisors review the final tests at expert review 3.

The TDR presents the tests, along with the comments provided by experts and panels at STA’s project board 3 and the deputy director for assessment development signs off the test.

Claim 2: Test results provide a fair and accurate measure of pupil performance

2.1 Item-level data is used when the tests are constructed to ensure only items that are functioning appropriately in terms of psychometric properties and qualitative information are included in the tests

The following list indicates how STA collects and uses item level data:

  • STA trials all test materials in a TPT in which approximately 1,000 pupils from a stratified sample of schools see each item. This trial provides STA with enough item-level data to be confident it knows how an item will perform in a live test.
  • STA reviews qualitative and quantitative data from the TPT and reports on each item’s reliability and validity as an appropriate assessment for its attributed programme of study.
  • TDRs remove from the pool of available items any items that do not function well or that had poor feedback from teachers or pupils. These items may be amended and re-trialled in a future trial.
  • STA holds a test construction meeting to select the items for the live test booklets. The meeting’s participants consider:
    • the item’s facility (its level of difficulty)
    • the ability of the item to differentiate between differing ability groups
    • the accessibility of the item
    • the item type
    • presentational aspects
    • question contexts
    • coverage in terms of assessing the content and cognitive domains – for each year and over time
    • conflicts between what is assessed within test, or check booklets, and across the test as a whole
  • At this stage, TDRs and psychometricians may swap items in or out of the test to improve its overall quality and suitability.
  • TDRs and psychometricians use a computer algorithm and item-level data to construct a test that maximises information around the expected standard, as well as across the ability range, while minimising the standard error of measurement (SEM) across the ability range. The TDRs and psychometricians consider the construction information alongside the test specification constraints and their own expertise to make a final decision on test construction.

2.2 Qualitative data is used in test construction to ensure only items that are effectively measuring the desired construct are included in the tests

STA collects qualitative data from a range of stakeholders throughout the test development cycle and uses the data to develop items that are fit for purpose. STA consults stakeholders through:

  • three independent expert review panels:
    • teacher panel (at expert reviews 1*, 2 and 3)
    • inclusion panel (at expert review 1*)
    • test review group panel (at expert reviews 1*, 2 and 3)
  • teacher and administrator questionnaires
  • responses captured by codes at trialling
  • reviews of pupil responses
  • observations of trialling
  • pupil focus groups during trial administrations at item-writing stage conducted by the item-writing agency and at IVT and TPT conducted by administrators or teachers
  • coding and marker meetings including their reports
  • curriculum expert reports

*Tests (including the phonics screening check) that do not have an IVT will not have an expert review 1.

TDRs and psychometricians analyse qualitative data at each stage of the process in preparation for trials and live tests alongside the quantitative data gathered. TDRs revisit quantitative and qualitative data throughout the development process to ensure they are making reliable judgements about the item and the construct it is measuring. STA considers the results of the analysis at key governance meetings:

  • item finalisation
  • resolution
  • project board

Following the TPT, a range of qualitative data has been collected and analysed, including:

  • pre-trial qualitative data from previous expert reviews and trials
  • coded item responses from trialling
  • script archive trawl based on codes captured at trialling
  • teacher and administrator questionnaires, which include evidence given by focus groups of pupils
  • coders’ reports from trialling
  • curriculum advisor report from resolution
  • modified agency report comments

TDRs and psychometricians analyse this data alongside quantitative data before item finalisation. The TDR summarises the information and presents it at an item finalisation meeting.

The lead test development researcher (LTDR), the TDR, the senior psychometrician and either the head of psychometrics, head of assessment development or the deputy director for assessment development attend item finalisation meetings. The attendees consider the information the TDR presents and decide whether items are suitable for live test construction.

The TDRs and psychometricians select items for live test construction based on the outcomes of item finalisation. They use qualitative data to confirm that the items selected are suitable. The TDRs and psychometricians consider:

  • each item’s suitability in meeting the curriculum reference it is intended to assess
  • stakeholders’ views on the demand and relevance of the item
  • any perceived construct-irrelevant variance
  • curriculum suitability
  • enemy checks – items that cannot appear in the test together
  • context
  • positioning and ordering of items
  • unintentional sources of easiness or difficulty

A combination of stakeholders review the proposed live tests at expert review 3. This group includes teachers, inclusion, curriculum, assessment and subject experts. At this meeting, panellists can challenge items and the TDRs may use the item data to either defend that challenge or support it. If the panel deems an item unacceptable, the TDRs may swap it with a suitable item from the TPT.

The TDR collates the data from expert review 3 and presents it alongside the quantitative data for the live test at project board 3. The purpose of this meeting is to scrutinise and critically challenge the data to ensure the test meets the expectations published in the national curriculum assessments test frameworks and year 1 phonics screening check assessment framework.

For optional KS1 tests, STA hold a mark scheme finalisation meeting. At this meeting, curriculum advisors and TDRs go through the mark scheme thoroughly to ensure all exemplars are appropriate. Minor tweaks to the wording of the mark schemes are allowed but there must be no changes to the way the mark schemes are applied.

In addition, STA hold a one-day mark scheme user acceptance testing (UAT) meeting, at which approximately 10 panellists, who are current KS1 teachers, trial the proposed mark scheme on pupil responses to ensure the mark scheme is fit for purpose.

For KS2 tests, STA hold a one-to-two-day mark scheme finalisation meeting. At this meeting, an expert group of senior markers review the live test and responses from trialling and suggest improvements to the mark scheme to ensure that markers can apply it accurately and reliably. These amendments do not affect the marks awarded for each question.

After this meeting, STA and the external marking agency use the amended mark scheme and the trialling responses to develop marker training materials. The purpose of these materials is to ensure that markers can consistently and reliably apply the mark scheme.

The data collected from expert review 3 is then presented alongside the quantitative data for the live test at project board 3. At this board meeting, the data is scrutinised and critically challenged to ensure the test meets the expectations as stated in the test frameworks.

2.3 A range of items that are age appropriate and cover the full ability range are included in the final tests

The following list outlines examples of the evidence sources used to ensure that an appropriate range of items are included in the final test:

  • External item-writing agencies and expert STA test development researchers write the items that make up the tests.
  • STA gives item writers a clear brief and item writing guidance to support the creation of items and selection of appropriate texts.
  • During the item-writing stage, where items that have more variation in wording, or interact with a text, agencies conduct very small-scale trials with approximately 30 pupils. This helps to gauge whether children can interpret texts and items correctly. This also provides the item-writing agency with insights into the most age-appropriate language to use in the items.
  • The TDR reviews the texts and items after the small-scale trials have been completed to ensure that they meet the requirements of the national curriculum. A range of experts, including independent curriculum advisors, review the materials at this stage as part of expert review 1. STA gives the panel members a terms of reference document that asks them to consider whether the materials are appropriate for children at the end of KS1 and KS2.
  • STA also invites test administrators and teachers to give feedback on the test materials in a questionnaire. The questionnaire has a specific area for feedback on whether the items are appropriate for the intended age group.
  • The tests are made up of a range of different cognitive domains, as specified in the test and assessment frameworks.
  • TDRs and psychometricians place items in the test booklet in order of difficulty as part of the test construction process. Where possible, the easiest items are placed at the beginning of the test and the most difficult ones are at the end. The TDRs and psychometricians make decisions on the difficulty of each item using information from both classical analysis and IRT (and, for reading tests, where the responses can be found in the texts). The data on individual items helps to make up a picture of the overall test characteristics.
  • Most of the test information on ability is focused around the expected standard, although items are selected to ensure there is information at both the lower end and at the higher end of the ability range.

2.4 Qualitative and quantitative information is used to ensure the tests do not disproportionately advantage or disadvantage any subgroups

The following list demonstrates the ways STA ensures tests do not disproportionately advantage or disadvantage any subgroups:

  • TDRs interpret a wide range of evidence to ensure the tests do not disproportionately advantage or disadvantage the following subgroups: non-EAL and EAL, girls and boys, non-SEN and SEN, pupils with visual impairments (modified paper), and braillists (modified paper).
  • Expert panels of teachers, educational experts and inclusion specialists review the items and consider whether they are suitable for inclusion in a trial. The inclusion panels consist of representation from hearing and visual impairment experts, a SEND representative, an EAL representative, dyslexia and dyscalculia representatives, equity advisors and an educational psychologist. Within this review process, panellists highlight any potential bias and suggest ways to remove it. The TDRs consider all the available evidence and present it in a resolution meeting to decide which recommendations to implement.
  • Data relating to the performance of EAL and non-EAL, and girls and boys, are identified in classical analysis after the TPT. The TDRs use this quantitative information (facility and per cent omitted), along with the qualitative evidence from the teacher questionnaires and administrator reports, to flag any items that appear to be disproportionately advantaging or disadvantaging a group. STA acknowledges that pupils in these groups have a wide range of ability so treats this information with some caution during the decision making process for each item.
  • STA also conducts a statistical analysis – differential item functioning (DIF) – after the trial. The purpose of this is to identify differences in item performance based on membership in EAL and non-EAL, and girls and boys, groups. Moderate and large levels of DIF are flagged. As DIF only indicates differential item performance between groups that have the same overall performance, the test development team considers qualitative evidence from the teacher questionnaires and previous expert review panels to help determine whether the item is biased or unfair.
  • The TDRs and psychometricians consider the balance of items with negligible DIF at test construction alongside all other test constraints.
  • Alongside the development of the standard test, STA works closely with a modified test agency to produce papers that are suitable for pupils who require a modified paper. TDRs and modifiers carefully consider any modification to minimise the possibility of disadvantaging or advantaging certain groups of pupils who use modified papers. STA and the modifier make these modifications and ensure minimal change in the item’s difficulty.
  • If an item cannot be modified in a way that maintains the construct of the original question, it is replaced with an item from the same content domain with similar characteristics.

2.5 Pupil responses are interrogated to ensure pupils are engaging with the questions as intended

The following list demonstrates how STA interrogates pupil responses for the end of key stage tests:

  • STA collects pupil responses for the tests in the trials.
  • STA codes responses for each item to collect information on the range of creditworthy and non-creditworthy responses pupils might give. TDRs develop coding frames. Independent curriculum advisors and senior coders review the coding frames. TDRs refine the coding frames both before and during trialling based on this feedback.
  • When coding is complete, the trialling agency provides STA with a PDF script archive of the scanned pupil scripts and a report from the lead coders.
  • STA psychometricians provide classical and distractor analysis to TDRs at IVT and TPT, plus IRT analysis at TPT.
  • TDRs analyse the data, review the report and scrutinise pupil scripts. TDRs may target specific items that are behaving unexpectedly and use the pupil scripts to provide insight into whether pupils are engaging with the questions as intended. TDRs can request script IDs to help them target specific responses from children based on the codes awarded.
  • At TPT, TDRs also randomly select scripts across the ability range and aim to look through the majority of the 1,000 responses – particularly for the extended response items. TDRs present the information they have collected from script reviews with other evidence at the item finalisation meeting. TDRs use this evidence to make recommendations for each item.

2.6 The rationale for what is creditworthy is robust and valid and can be applied unambiguously

The following list demonstrates how STA determines what is creditworthy in the end of key stage tests:

  • TDRs include indicative mark allocations in the coding frames they have developed for IVT and TPT. TDRs discuss creditworthy and non-creditworthy responses with stakeholders at the expert review panels. Senior coders review the coding frames during the coding period. It if is necessary, TDRs may add codes or examples to the coding frames to reflect pupil responses.
  • TDRs draft mark schemes for each question after constructing the tests. TDRs use the trialling coding frames to inform the content of the mark schemes and select pupil responses from the trial to use as examples in the mark scheme. These responses are clear examples of each mark point. TDRs may also include responses that are not creditworthy.
  • At KS2, STA holds a mark scheme finalisation meeting, composed of TDRs, psychometricians, independent curriculum advisers and senior trialling coders. The participants review the live test and responses from trialling and suggest improvements to the mark scheme so that markers can apply it reliably and consistently.
  • Optional KS1 tests are marked internally in schools. As part of the expert review 3 meeting, a panel of teachers and subject experts conduct UAT of the mark schemes. TDRs collate pupil scripts for each question from the trialling process and allocates marks according to the proposed mark scheme. The panel members mark the pupil scripts and their marking is compared with that done by TDRs to see whether the mark scheme can be applied consistently and unambiguously.

The phonics screening check is marked by the staff member administering the check within school. As the phonics screening check has such a specific focus in terms of the construct being assessed, there is no need for coding response types at TPT. The scoring guidance for the pseudo-words within the phonics check is therefore refined through qualitative reviews by test developers which includes input from expert review panellists.

2.7 For end of KS1 and KS2 tests, mark schemes are trialled to ensure that all responses showing an appropriate level of understanding are credited and that no responses demonstrating misconceptions or too low a level of understanding are credited

The following list demonstrates how STA trialled the mark schemes:

  • STA develops mark schemes alongside their associated items.
  • Item-writing agencies and TDRs draft mark schemes during the initial item-writing stage. TDRs and external curriculum reviewers review these mark schemes.
  • TDRs refine the mark schemes through 2 rounds of large-scale trialling. Approximately 300 pupils see each item in the IVT. TDRs draft coding frames so they can group pupil responses into types rather than marking them correct or incorrect. Coding allows TDRs to understand how pupils are responding to questions and whether their answers are correct or incorrect. TDRs and psychometricians consider the qualitative data gathered from coding along with quantitative data to make recommendations for changes to the mark schemes. This ensures the mark scheme includes an appropriate range of acceptable responses and examples of uncreditworthy responses.
  • The trialling agency provides STA with a digital script archive of all the pupil answer booklets. TDRs are able to review pupil scripts to view example pupil responses. Reviewing the script archive in this way enables TDRs to ensure coding frames reflect pupil responses.
  • A second trial is administered – the TPT – during which approximately 1,000 pupils see each item. TDRs amend coding frames using the information gathered during the IVT. After TPT administration is complete and before coding commences, a group of lead coders reviews a subset of TPT scripts to ensure the coding frames reflect the range of pupil responses. TDRs and lead coders agree amendments to the coding frames before coding begins.
  • When coding is complete, lead coders write a report for STA that contains their reflections on the coding process, highlights any specific coding issues and makes recommendations on whether each item could be included in a live test. This report forms part of the qualitative evidence reviewed by TDRs.
  • After TPT coding is complete, TDRs consider the lead coder reports and other statistical and qualitative information to make recommendations on which items are performing as required. At this stage, TDRs review pupil scripts and consider the data gathered from coding to ensure all responses that demonstrate the required understanding are credited and responses that do not demonstrate the required understanding are not credited.
  • When TDRs and psychometricians have constructed the live test, for KS1 and KS2 tests, TDRs use the coding information and pupil responses from TPT to draft mark schemes. The wording of the mark scheme is finalised. In a small number of cases, STA may need to partially or wholly re-mark a question in the live test to account for changes to the mark scheme after finalisation.

2.8 The mark schemes and scoring guidance provide appropriate detail and information to enable markers to mark reliably

The following list demonstrates how STA ensures the mark schemes and phonics screening check scoring guidance are appropriate:

  • TDRs develop the mark schemes and scoring guidance using coding frames that were used in the trialling process. STA uses coding frames to capture the range of responses that pupils give, both creditworthy and non-creditworthy. This allows TDRs to understand how effective an item is and to identify any issues that could affect the accuracy of marking.
  • TDRs draft initial coding frames, which are refined during expert review and trialling. A range of stakeholders reviews the coding frames before they are used. This group includes the STA curriculum advisors, psychometricians and some senior coders.
  • TDRs may make further amendments to the coding frames during coding to reflect the range of pupil responses seen. They may also include additional codes to capture previously unexpected responses. TDRs may amend the wording of codes to better reflect how pupils are responding or to support coders in coding accurately.
  • Following the IVT, TDRs update coding frames to include exemplar pupil responses and to reflect the qualitative data that the senior coders provide. Their feedback focuses on whether the coding frames proved fit for purpose, identifying any issues coders faced in applying the coding frames and making suggestions for amendments.
  • Following each trial, the trailing agency provides an archive of scanned pupil scripts and psychometricians provide analysis of the scoring of each item. After IVT, TDRs receive classical and distractor analysis. After TPT, TDRs receive classical, distractor and IRT analysis. TDRs analyse this data and review pupil responses in the script archive in preparation for an item finalisation meeting, where they make recommendations about each item and comment on the effectiveness of the coding frames.
  • After live test construction, TDRs use the coding information and pupil responses from the TPT to draft mark schemes. To maintain the validity of the data collected from the TPT, STA makes only minor amendments between the TPT coding frame and the live mark scheme. The TDRs may refine the wording of the mark scheme or the order of the marking points for clarity and they may include exemplar pupil responses from the script archive.
  • For KS2, STA holds a mark scheme finalisation meeting, composed of TDRs, psychometricians, independent curriculum advisers and senior coders from the trials. The focus of the meeting is to agree that the mark scheme is a valid measure of the test construct and that markers can apply it consistently and fairly. Optional KS1 tests are marked internally in schools. As part of the expert review 3 meeting, a UAT is conducted on the mark scheme by a panel of current KS1 teachers, who apply the mark scheme to a range of scripts selected from the TPT archive by the TDR. The outcomes of this test may result in further amendments for clarification and the addition of further exemplification to the mark scheme to ensure it is accessible and can be applied consistently in schools.
  • For the phonics screening check, schools are provided with a list of acceptable pronunciations for the pseudo-words drawn from the grapheme-phoneme correspondences listed in the assessment framework. The scoring guidance states that “when decoding a pseudo-word, all plausible alternative and regional pronunciations are acceptable”. The scoring guidance is checked by teachers as part of the expert review 3 panel.

2.9 Markers are applying the mark schemes as intended

The optional KS1 tests are marked internally in schools and the results are not reported, therefore STA does not have evidence that the markers apply the mark schemes as intended. However, STA designed the test development process to result in marking that is as consistent as possible. This is done through the thorough development of mark schemes with expert feedback at various stages, the input of lead coders who provide feedback on the process of using the coding frames and UAT to provide evidence that KS1 teachers can apply the mark scheme as intended.

The phonics screening check is also marked internally in schools using the scoring guidance provided. The check is administered to children individually, and the answers are marked as correct or incorrect on the answer sheet at the time of administration.

At KS2, to ensure that markers apply the mark scheme as intended, STA follows these processes:

  • the development of mark schemes, as previously outlined
  • the training process, as previously outlined
  • the quality assurance process during marking
  • the quality assurance process following marking

At the training meeting, the markers see examples for each item with accompanying commentary of what is, and what is not, in scope when awarding credit for responses (numbers vary according to item difficulty) and receive training on how the mark scheme should be applied.

Each marker then completes a number of practice scripts for each item. Supervisory markers provide feedback on their performance to ensure that they understand how to apply the mark scheme.

Before they are allowed to mark, markers must complete a set of 5 standardisation (also known as qualification) scripts for each item. This checks that their marking agrees with the agreed marks for that item. Supervisory markers provide feedback on their performance to address any issues and, if necessary, markers may complete a second set of 5 standardisation (qualification) scripts.

The external marking agency undertakes quality assurance during live marking by inserting seed items in marker allocations. These items have a predetermined code agreed by TDRs and lead coders and deputy coders and they appear randomly in each batch of marking. The marking agency will suspend a marker from that item if there is a difference between the agreed mark and the mark awarded by the marker. The marker cannot continue marking that item until they have received further training on the mark scheme. If a marker fails for a third time, the external marking agency stops them from marking that item. Check marking also takes place to ensure marking is accurate.

Supervisory markers quality assure the marking and provide feedback to markers if they have made errors in applying the mark scheme. The supervisory marker may review complete batches at any time, if necessary, and may ask markers to review their submitted batches to update marks after receiving additional training. The supervisory marker may follow the agreed procedures for stopping a marker at any point if they have concerns about accuracy.

During live marking, markers may flag any responses that do not fit the mark scheme, for guidance from their supervisory marker. A marker may also check their marking with their supervisory marker. If necessary, supervisory markers may escalate queries to lead markers or TDRs to be resolved.

After live marking is complete, the external marking agency provides STA with a report on marking. This report contains some qualitative data on the quality of marking for each item.

If schools dispute the application of the mark scheme, they can send pupil scripts for review with a senior marker.

Claim 3: Pupil performance is comparable within and across schools

3.1 Potential bias against particular subgroups is managed and addressed when constructing tests

The following list demonstrates how STA considers potential bias:

  • In test development, bias is identified as any construct-irrelevant element that results in consistently different scores for specific groups of test takers. The development of the NCAs explicitly takes into account such elements and how they can affect performance across particular subgroups, based on gender, SEND, disability, whether English is spoken as a first or additional language, and socioeconomic status.
  • STA contract equity advisors who comment on the materials at relevant stages throughout development to ensure no materials are biased against or in favour of particular subgroups. For English reading, the texts are reviewed before being accepted as suitable materials for the tests.
  • Quantitative data is collected for each question to ensure bias is minimised. DIF is calculated for each question to show whether any differential functioning is present for or against pupils of particular genders or who are or are not native English speakers. This statistical measure helps test developers to determine whether bias is present for any items flagged with DIF and guide test construction in order to minimise bias.
  • The fairness, accessibility and bias of each test question are also assessed in expert review sessions. Texts, items, contexts and illustrations are scrutinised in teacher panels, test review groups (TRGs: comprising senior academic and educational experts) and inclusion panels (visual and audio impairment, SEND, EAL, culture and religion, and educational psychology experts). Questions that raise concerns about bias or unfairness are identified and further examined in-house, to either minimise the identified bias or remove the question from the test if no revision is possible.
  • For those pupils who are unable to access the NCAs as they are, alternative test versions are made available - for example, braille versions and large print versions. While it is essential that tests are made available in modified formats, the content of the modified test is kept as close to the original as possible to rule out test-critical changes or any further bias introduced through modification. To ensure this is the case, modification experts are consulted throughout the test development process.

Further information about diversity and inclusion in the NCAs can be found in in section 7 of the relevant subject’s test framework.

3.2 Systems are in place to ensure the security of test materials during development, delivery and marking

The following list demonstrates how STA ensures security:

  • All staff within STA who handle test materials have undertaken security of information training and have signed confidentiality agreements.
  • Throughout the test development process, external stakeholders are asked to review test items, predominantly as part of expert reviews. All those involved in expert review panels are required to sign confidentiality forms, and the requirements on them for maintaining security are clearly and repeatedly stated at the start and throughout the meetings. Teacher panels are provided with a pack of items in the meeting to comment on, which are signed back into STA at the end of the day. TRGs review the items in advance of the meeting. Items are sent to TRG members via STA’s approved, secure methods of delivery and they are provided with clear instructions on storing and transporting materials. Materials are collected back in via a sign-in process after the TRG meeting.
  • When items are trialled as part of IVT or TPT, the trialling agency must adhere to the security arrangements within the trialling framework. This includes administrators undertaking training at least every 2 years, with a heavy emphasis on security. Administrators and teachers present during trialling sign confidentiality agreements. Administrators receive the items for trialling visits, via an approved courier service, and take the items to the school. They are responsible for ensuring all materials are collected after the visit before returning them to the trialling agency via the approved courier.
  • All print, collation and distribution services for NCAs are outsourced to commercial suppliers. Strict security requirements are part of the service specifications and contracts. STA assesses the supplier’s compliance with its security requirements by requiring suppliers to complete a Departmental Security Assurance Model (DSAM) assessment, which ensures all aspects of information technology, physical security and data handling are fit for purpose and identifies any residual risk. These arrangements are reviewed during formal STA supplier site visits. All suppliers operate a secure track and trace service for the transfer of proofs and final live materials between suppliers and STA, and the delivery of materials to schools.
  • STA provides schools with guidance about handling NCA test materials securely, administering the tests, using access arrangements appropriately and returning KS2 test scripts for marking. Local authorities have a statutory duty to make monitoring visits to at least 10% of their schools which are participating in the phonics screening check and KS2 tests. These visits are unannounced and may take place before, during or after the test or check periods. The monitoring visits check that schools are storing materials securely, administering tests correctly and packaging and returning materials as required. At the end of administration, headteachers must complete and submit a statutory headteacher’s declaration form (HDF) to STA. The HDF confirms that the tests or checks have been administered according to the published guidance and that any issues have been reported to STA.
  • Each year approximately 4,600 markers are involved in the marking of KS2 tests. The marking agency restricts the number of markers who have access to live materials prior to test week in order to maintain the confidentiality of the test material, while still allowing for an adequate test of the developed marker training materials and also ensuring a high-quality marking service is achieved. Approximately 20 senior supervisory markers have access to live test content from the November before the test in May, in order to develop marker training material. Around 60 for English reading, or 6 for mathematics and English grammar, supervisory and non-supervisory markers take part in UAT of the developed training materials during January and February. The external marking agency gives around 500 supervisory markers access to materials in March and April, before the tests are administered in May. This is to enable supervisory markers to receive training in their use, ahead of their training of marker teams. The remaining markers are trained following the administration of the tests.
  • Markers must sign a contract that stipulates the requirement to maintain the confidentiality of materials before they have sight of any material. Confidentiality of material is emphasised to all markers at the start of every meeting and training session. The external marking agency will not offer a supervisory role to markers if their own child is to sit the KS2 tests, although they may receive a standard marking contract. This ensures that they do not see materials until after the tests have been administered.
  • The external marking agency holds marker training events at venues that meet agreed venue specifications, which ensures that they comply with strict security procedures.
  • For the phonics screening check, STA provides schools with guidance about handling the check materials securely, administering the tests, and using access arrangements appropriately. Local authorities have a statutory responsibility to monitor the administration of the phonics screening check. At least 10% of schools administering the phonics screening check in a local authority must receive a visit.

3.3 Guidance on administration is available, understood and implemented consistently across schools

STA publishes guidance on GOV.UK throughout the test cycle to support schools with test orders, pupil registration, keeping test materials secure, test administration and packing test scripts. This guidance is developed to ensure consistency of administration across schools.

The KS1 tests in mathematics, English reading and English grammar, punctuation and spelling are non-statutory. The KS1 phonics screening check is statutory.

For KS2 tests and the phonics screening check, local authorities make unannounced monitoring visits to a sample of schools administering the tests. The local authority will check whether the school is following the published test administration guidance on:

  • keeping the materials secure
  • administering the tests or check
  • packaging and returning KS2 test scripts

STA will carry out a full investigation if a monitoring visitor reports:

  • administrative irregularities
  • potential maladministration

These investigations are used to make decisions on the accuracy or correctness of pupils’ results.

3.4 The available access arrangements are appropriate

The following list provides details on access arrangements:

  • Access arrangements are adjustments that can be made to support pupils who have issues accessing the test and ensure they are able to demonstrate their attainment. Access arrangements are included to increase access without providing an unfair advantage to the pupil. The support given must not change the test questions and the answers must be the pupil’s own.
  • Access arrangements address accessibility issues rather than specific SEND. They are based primarily on normal classroom practice and the available access arrangements are, in most cases, similar to those for other tests such as GCSEs and A levels.
  • STA publishes guidance on GOV.UK about the range of access arrangements available to enable pupils with specific needs to take part in the KS1 and KS2 tests. Access arrangements can be used to support pupils:
    • who have difficulty reading
    • who have difficulty writing
    • with a hearing impairment
    • with a visual impairment
    • who use sign language
    • who have difficulty concentrating
    • who have processing difficulties
  • The range of access arrangements available includes:
    • early opening to modify test materials - for example, photocopying on to coloured paper
    • additional time
    • transcripts
    • word processors or other technical or electronic aids
    • rest breaks
    • written or oral translations
    • scribes, readers and prompters in mathematics and English grammar, punctuation and spelling tests
    • apparatus in mathematics tests
    • compensatory marks in English grammar, punctuation and spelling tests for pupils with a hearing impairment and unable to access the spelling test
  • Headteachers and teachers must consider whether any of their pupils will need access arrangements before they administer the tests.
  • Schools can contact the national curriculum assessments helpline for specific advice about how to meet the needs of individual pupils.
  • Ultimately, however, a small number of pupils may not be able to access the tests, despite the provision of additional arrangements.

3.5 The processes and procedures that measure marker reliability, consistency and accuracy are fit for purpose – information is acted on appropriately, effectively and in a timely fashion

3.5.1  Optional end of key stage 1 tests

Optional KS1 assessments are internally marked in schools. Owing to the stage of assessment, the mark schemes are more straightforward, and reliability is easier to achieve than with complex mark schemes. Section 2.8 contains information on how STA seeks to maximise reliability and usability during the development of the mark schemes. Those marking the tests participate in external moderation activities provided by local authorities.

3.5.2  Key stage 2 tests

The external marking agency carries out a range of checks to ensure that only markers who demonstrate acceptable marking accuracy and consistency mark NCAs:

  • Following training, markers must complete a set of 5 standardisation (qualification) scripts for each item before receiving permission to mark that item. This checks that their marking matches the agreed marks for that item. If their Absolute Mark Difference (AMD)[footnote 7] for any one item is outside the agreed level of tolerance, the marker will have failed standardisation (qualification). A supervisory marker will provide feedback and the marker may complete a second set of 5 standardisation (qualification) scripts. This step ensures that markers who cannot demonstrate accurate application of the mark scheme and the marking principles will not take part in the live marking.
  • The external marking agency undertakes quality assurance during live marking through the placement of seed items. These items have a predetermined code agreed by TDRs, lead coders and deputy coders and appear randomly in each batch of marking. The external marking agency will suspend a marker from marking an item if there is a difference between the agreed mark and the mark awarded by the marker. The marker cannot continue marking that item until they have received further training on the mark scheme. If a marker fails for a third time, the external marking agency stops them from marking that item. If the marking agency stops a marker from marking an item, they will redistribute that marker’s items for other markers to re-mark.
  • This process ensures that all markers are applying the mark scheme accurately and consistently throughout the marking window and is the standard approach to ensuring the reliability of marking.
  • The external marking agency and STA’s TDRs and psychometricians set AMD bands and validity thresholds for each item so that markers can be monitored to ensure that marking errors are minimised. The AMD bands and validity thresholds are set by the marking programme leader and agreed with the test development manager and the psychometrician. These are based on the complexity of the items and the number of marks available.

3.5.3 Phonics screening check

The phonics screening check is marked internally using the comprehensive scoring guidance provided. Marks are submitted to the headteacher who then submits this information to the local authority.

3.6 The statistical methods used for scaling, equating, aggregating and scoring are appropriate

Methods that are used for scaling and equating NCAs are described in section 13.5 of the test handbook.

These methods have been discussed and agreed at the Test Development Technical Board meeting and agreed to be appropriate by the STA Technical Advisory Group, consisting of international experts in the field of test development and psychometrics.

There are no statistical methods used for scoring NCAs. The tests are scored or marked as described in section 12 of the test handbook. The processes for training markers and quality assuring the marking ensure that the mark schemes are applied consistently across pupils and schools.

Claim 4: Differences in test difficulty from year to year are taken account of, allowing for accurate comparison of performance year on year

4.1 STA ensures appropriate difficulty when constructing tests

STA has detailed test specifications that outline the content and cognitive domain coverage of items. Trial and live tests are constructed using this coverage information to construct balanced tests. Live tests and some of the trial tests will be constructed using a computer algorithm with constraints on specific measurement aspects to provide a starting point for test construction. This is further refined using STA’s subject and psychometric expertise.

TPTs are conducted to establish the psychometric properties of items. STA is able to establish robust difficulty measures for each item (using a two-parameter IRT analysis model) and, consequently, it is possible to predict overall test difficulty. These difficulty measures are anchored back to the 2016 test (not applicable for the phonics screening check), which allows both new and old items to be placed on the same measurement scale and thereby ensures a like-for-like comparison.

4.2 TPT data accurately predicts performance on the live test

IRT is a robust model used for predicting performance of the live test. It allows STA to use the item information from a TPT and to estimate item parameters via linked items. Furthermore, D2 analysis[footnote 8] is used to compare item performance across 2 tests, booklets or blocks. This allows STA to look at potential changes in performance of the items between 2 occurrences.

As long as sufficient linkage is maintained and the model fits the data, based on meeting stringent IRT assumptions, pre-test data can give a reliable prediction of item performance on a live test.

At project board 3, STA sets the threshold of the expected standard for optional KS1 assessments and the phonics screening check and predicts the threshold of the expected standard for KS2.

For KS2, once the test has been administered and the live data is available, this analysis is run again (this time including the live data from the current year and previous years) to obtain the final scaled score outcome, which also helps STA to judge the accuracy of its previous estimation.

4.3 When constructing the test, the likely difficulty is predicted and the previous year’s difficulty is taken into account

The first test of the new 2014 national curriculum for English reading, grammar, punctuation and spelling and mathematics at KS1 and KS2 occurred in 2016. The first phonics screening check was administered in 2012. STA aims for all tests following that to have a similar level of difficulty.

This is ensured by developing the tests according to a detailed test specification and by trialling items. Based on the TPT data, STA constructs tests that have similar test characteristic curves to the tests of previous years. The expected score is plotted against ability. Differences are examined at key points on the ability axis: near the top, at the expected standard, near the bottom, and at 2 additional mid-points in between. The overall difficulty with respect to these 5 points is monitored during live test construction, with differences from one year to the next minimised as far as possible.

As another measure of difficulty comparability, the scaled score range is also estimated and is checked to ensure that it covers the expected and appropriate range compared with previous years. The scaled score range for optional KS1 subjects is 85 to 115. The scaled score range for KS2 subjects is 80 to 120. There are no scaled scores for the phonics screening check. Scaled score representation is monitored year on year and conversion tables are available on GOV.UK.

4.4 When constructing the test, the approach for predicting the likely standard is fit for purpose

Using the IRT data from the TPT, STA is able to estimate the expected score for every item at the expected standard (an ability value obtained from the 2016 standard-setting exercise). This estimation is possible because the IRT item parameter estimates have been obtained using a model that also includes previous years’ TPT and live items, allowing STA to place the parameters on the same scale as the 2016 live test. So, during test construction, the sum of the expected item scores at that specific ability point is an estimate of where, in terms of raw score, the standard (scaled score of 100) will be.

Once a final test is established, additional analysis is conducted to scale the parameters to the 2016 scale in order to produce a scaled score conversion table, which estimates the standard for the test.

The process was approved by the STA Technical Advisory Group in 2017.

Standard setting for the phonics screening check took place in 2011 to set the expected standard, using TPT data ahead of the first phonics screening check in 2012. Previous years’ phonics screening check TPT items are included in an IRT model with the current year’s TPT items to obtain expected scores for test construction. However, scaling of the parameters to the 2011 scale is not conducted as no scaled score conversion table is provided.

4.5 STA maintains the accuracy and stability of equating functions from year to year

The expected standard was set in 2016 using the ‘Bookmark’ method, with panels of teachers, as outlined in section 13 of the test handbook.

The standard set in 2016 has been maintained in subsequent years using IRT methodology, as outlined in section 13.5 of the test handbook. This means the raw score equating to a scaled score of 100 (the expected standard) in each year requires the same level of ability, although the raw score itself may vary according to the difficulty of the test. If the overall difficulty of the test decreases, then the raw score required to meet the standard will increase. If the overall difficulty increases, then the raw score needed to meet the standard will decrease. Similarly, each raw score point is associated with a point on the ability range, which is converted to a scaled score point from 85 to 115 for optional KS1 tests and 80 to 120 for KS2. As described above, the standard for the phonics screening check was set in 2011, but scaled scores are not applied for the development of live tests.

In order to relate the new tests in each year to the standard determined in 2016, a two-parameter graded response IRT model with concurrent calibration is used. The IRT model includes data from the 2016 live administration and data from TPTs, including anchor items repeated each year and the items selected for the live test. The parameters from the IRT model are scaled using the Stocking-Lord scaling methodology to place them on the same scale as used in 2016 to determine the standard and scaled scores. These scaled parameters are used in a summed score likelihood IRT model to produce a summed score conversion table, which is then used to produce the raw to scaled score conversions. This methodology was reviewed by and agreed with the STA Technical Advisory Group in 2017.

In order to ensure the methodology used is appropriate, assumption checking for the model is undertaken. Evidence for the following key assumptions is reviewed annually to ensure the model continues to be appropriate. Evidence from assumption checking analysis is presented at standards maintenance meetings to inform the sign-off of the raw score to scaled score conversion tables. The assumptions are as follows:

  • Item fit: that the items fit the model. An item fit test is used and, owing to the very large numbers of pupils included in the model, results are often significant. Item characteristic curves, modelled against actual data, are inspected visually to identify a lack of fit.
  • Local independence: that all items perform independently of one another and probability of scoring on an item is not impacted by the presence of any other item in the test. This assumption is tested using the Q3 procedure, where the difference between expected and actual item scores is correlated for each pair of items. Items with a correlation of higher than 0.2 (absolute value) are examined for a lack of independence.
  • Unidimensionality: that all items relate to a single construct. Unidimensionality is examined using both exploratory and confirmatory factor analysis, with results compared against key metrics.
  • Anchor stability: that anchor items perform in similar ways in different administrations, given any differences in the performance of the cohort overall. Anchor items are examined for changes in facility and discrimination. The D2 statistic is used to identify any items that differ in terms of their IRT parameters, by looking at differences in expected score at different points in the ability range.

Additionally, detailed logs are maintained recording any changes to anchor items. Following a review of this evidence, any anchor items thought to be performing differently are unlinked in the subsequent IRT analysis.

Claim 5: The meaning of test scores is clear to stakeholders

5.1 Appropriate guidance is available to ensure the range of stakeholders – including government departments, local government, professional bodies, teachers and parents – understand the reported scores

Before the introduction of the new NCAs and scaled scores in 2016, STA had a communication plan to inform stakeholders of the changes taking place. This included speaking engagements with a range of stakeholders at various events and regular communications with schools and local authorities through assessment update emails.

STA provides details on GOV.UK about scaled scores at KS1 and scaled scores at KS2. This information is available to anyone but is primarily aimed at headteachers, teachers, governors and local authorities. STA also produces leaflets about the phonics screening check and results at the end of KS2 for teachers to use with parents.

The evidence above confirms that appropriate guidance is available to ensure the range of stakeholders understand the reported scores.

5.2 Media coverage is monitored to ensure scores are reported as intended and unintended reporting is addressed where possible

Media coverage is monitored following live test administration, and coverage of NCAs and scores are captured as part of this. Social media is monitored during test week, in part to identify any potential cases of maladministration.

  1. The phonics screening check will be referred to as a test throughout this validity framework to aid precision. 

  2. Hughes, S., Pollit, A. and Ahmed, A. (1998). ‘The development of a tool for gauging demands of GCSE and A-Level exam questions’. Paper presented at the BERA conference, The Queens University Belfast. 

  3. Webb, L.N. (1997). ‘Criteria for alignment of expectations and assessments in mathematics and science education’. Research Monograph, No. 8., Council of Chief School Officers. 

  4. Smith, M.S. and Stein, M.K. (1998). ‘Selecting and creating mathematical tasks: from research to practice’. Mathematics Teaching in Middle School, 3, pp344–350. 

  5. Bloom, B., Engelhart, M., Furst, E., Hill, W., and Krathwohl, D. (1956). Taxonomy of educational objectives: The classification of educational goals. Handbook I: Cognitive domain. New York: David McKay Company. 

  6. ‘A Framework for Predicting Item Difficulty in Reading Tests’ by Tom Lumley, Alla Routitsky, Juliette Mendelovits and Dara Ramalingam from the Australian Council for Educational Research (ACER). 

  7. The AMD is the difference between the marks awarded to an item on a standardisation set by a marker and the predetermined definitive mark assigned by the senior marking team. 

  8. O’Neil, T., Arce-Ferrer, A. (2012). Empirical Investigation of Anchor Item Set Purification Processes in 3PL IRT Equating. Paper presented at NCME Vancouver, Canada.