How are test scores affected by day-to-day changes of a student? Do different people rate students’ performances the same? These questions are addressed through the understanding of reliability. This lesson will define reliability, explain how reliability is measured, and explore methods to enhance reliability of assessments in the classroom.
Student One: I’m glad that is over.
It’s nerve racking to perform and be evaluated by three teachers.Student Two: I agree. I also worry about how each individual teacher will score us. I hope they use the same criteria!Student One: Oh, you are referring to reliability of the scores. Do you know about reliability?Student Two: Not really. I’ve never used that term before.Student One: Oh! I’ll explain!Reliability is defined as the extent to which an assessment yields consistent information about the knowledge, skills, or abilities being assessed.
A reliable assessment is replicable, meaning it will produce consistent scores or observations of student performance.For example, our singing performances should result in similar scores from the three teachers. If one teacher gives us a score of 10 out of 10, and the other gives us a score of 2 out of 10, the scores are not considered reliable.Student Two: Oh, okay.
So it seems like many factors could impact the reliability of a test or performance.Student One: You are right.
Conditions That Impact Reliability
Student One: There are many conditions that impact reliability. They include:
- Day-to-day changes in the student (such as energy level, motivation, emotional stress, and hunger)
- Physical environment (which includes classroom temperature, outside noises, and distractions)
- Administration of the assessment (which includes changes in test instructions and differences in how the teacher responds to questions about the test)
- Test length (generally, the longer the test, the lower the reliability)
- Subjectivity of the test scorer
Measurement of Reliability: Reliability Coefficient
Student Two: So, how is reliability measured?Student One: Reliability is determined by comparing two sets of scores for a single assessment (such as two rater scores for the same person) or two scores from two tests that assess the same concept. These two scores can be derived in different ways depending on the type of reliability being assessed.
Once we have two sets of scores for a group of students or observers, we can determine how similar they are by computing a statistic known as the reliability coefficient.The reliability coefficient is a numerical index of reliability, typically ranging from 0 to 1. A number closer to 1 indicates high reliability.
A low reliability coefficient indicates more error in the assessment results, usually due to temporary factors that we previously discussed. Reliability is considered good or acceptable if the reliability coefficient is .80 or above.
Types of Reliability
Student One: There are multiple types of reliability.
In other words, do different people score students’ performances similarly? This type of reliability is used to assess the degree to which different observers or scorers give consistent estimates or scores. For example, we performed in front of three teachers who scored us individually.
High inter-rater reliability would indicate each teacher rated us similarly.
It is used to assess the consistency of scores of an assessment from one time to another. The construct to be measured does not change – only the time at which the assessment is administered changes. For example, if we are given a test in science today and then given the same test next week, we could use those scores to determine test-retest reliability. Test-retest reliability is best used to assess things that are stable over time, such as intelligence.
Reliability is typically higher when little time has passed between administrations of assessments.
This type of reliability is determined by comparing two different assessments that were constructed using the same content domain. For example, if our science teacher created an assessment with 100 questions that measure the same science content, she would divide the test up into two versions with 50 questions each and then give two versions of the test to her students. She would use a score from version 1 and a score from version 2 to assess parallel-forms reliability.
Internal Consistency Reliability
This form of reliability is used to assess the consistency of scores across items within a single test. For example, if our science teacher wants to test the internal consistency reliability of her test questions on the scientific method, she would include multiple questions on the same concept.
High internal consistency would result in all of the scientific method questions being answered similarly. However, if students’ answers to those questions were inconsistent, then internal consistency reliability is low.
Increasing Reliability of Classroom Assessments
Student One: Educators can increase or enhance the reliability of their assessments.
- They can give several similar tasks or questions in an assessment to look for consistency of student performance.
- They must define each task clearly so temporary factors, such as test instruction, does not impact performance.
- If possible, educators should avoid assessing students’ learning and performance when they are sick or there are external factors, such as uncontrollable noise in the classroom.
- One final way of increasing reliability of classroom assessments is for educators to identify specific concrete criteria and use a rubric with which to evaluate student performance.
Reliability ensures the consistency of scores or observations of student performance. External and internal temporary factors may impact reliability, such as day-to-day changes in the student, physical environment factors, and subjectivity of the scorer.Reliability is measured through the reliability coefficient with a numerical index range from 0 to 1. 1 indicates high reliability, while 0 would indicate lower.
The different types of reliability – inter-rater, test-retest, parallel-forms, and internal consistency – measure different aspects, but all use the standard reliability coefficient range. Generally, a reliability of .80 or above indicates good or acceptable reliability.
After watching this lesson, you should be able to:
- Define reliability and list conditions that influence it
- Explain how reliability is measured and how it can be increased in assessments
- Identify and describe the types of reliability