(AR) Making Better Tests With The Rasch Measurement Model (2018)
(AR) Making Better Tests With The Rasch Measurement Model (2018)
This study had two aims. The first was to explain the process of using the Rasch measurement
model to validate tests in an easy-to-understand way for those unfamiliar with the Rasch
measurement model. The second was to validate two final exams with several shared items.
The exams were given to two groups of students with slightly differing English listening
proficiency. The two exams, a low-advanced and a high-advanced exam, were given to 76 and
45 Japanese university students, respectively. Each exam had 56 questions with 26 shared
questions linking the two exams. After conducting a simple Rasch analysis, it was determined
that up to 33 questions needed to be modified or deleted from subsequent versions of the exam.
The unexpected number of recommended modifications and deletions suggests that, even for
experienced teachers, the Rasch measurement model can be of tremendous value by offering
greater precision in the assessment of students, as well as greater assistance in the validation
of tests.
Literature Review
“Tests do not have reliabilities and validities, only test responses do...test
responses are a function not only of the items, tasks, or stimulus conditions but of the
persons responding and the context of measurement” (Messick, 1989, p. 14).
Test validity can be defined as how accurately a test measures what it is
supposed to measure. Is a listening test actually measuring listening ability? Is an
advanced reading test actually measuring advanced reading ability? Are the questions
at the appropriate difficulty level for the students? Are the questions worded clearly,
or are they confusing students? Teachers need to remember Messick's quote whenever
they give their students a test, as it is important to make sure that their test is measuring
what it is supposed to be measuring.
One way to assess the validity of a test is to use the Rasch measurement
model. While this paper will focus on how language teachers might use the Rasch
measurement model, teachers of any subject can use the Rasch measurement model to
better assess their students and/or validate their tests. The same principles of improved
assessment and validation being demonstrated in this paper can be applied to any
subject where testing occurs. Traditionally, language teachers have used Classical Test
Theory (often referred to as CTT) when making and giving tests (Novick, 1966). With
CTT, a person answers questions correctly or incorrectly and gets points for correct
answers. While CTT can be easy-to-score, the imprecise nature of the assessment
makes it best for low-stakes testing (Nunally, 1978). In contrast, the Rasch
measurement model offers teachers several valuable benefits, most importantly, (1) a
means of assessing the validity of a test's questions and (2) a more accurate assessment
76 Volume 13 ȣ 2018
of the ability of students (Andrich, 1988; Bond & Fox, 2007; Linacre, 1997; McNamara,
2011; Runnels, 2012).
Perhaps a good way to summarize the Rasch measurement model is that it is
a method of analyzing response data, in which both the questions on the test (referred
to as items in this paper) and the people taking the test (referred to as persons in this
paper) are incorporated into a predictive mathematical model. The Rasch
measurement model uses the response data from a test's questions to predict how each
person should respond to each question. In this process, ordinal data of correct and
incorrect responses are converted into interval data (examples of interval data are
frequently seen in the physical sciences, such as units of distance, weight, and speed).
For example, rather than answers being marked simply as correct or incorrect (ordinal
data), the Rasch measurement model is able to assign a specific value to each question,
so an easy question might have a difficulty measure of 0.75 logits while a difficult
question might have a difficulty measure of 3.40 logits. The conversion of ordinal data
into interval data is done for both items and persons. Items are given a difficulty
measure, which is a number representing the difficulty of a question. This item
difficulty can be used to assess the appropriateness of questions. Similarly, persons are
given a person ability measure, which is a number representing the ability of people in
the construct that is being measured (in the case of this paper, English listening ability
for university students in Japan). The Rasch measurement model also produces a slew
of other data which indicates how well the real responses matched the model's
predicted responses, and this data can be further used to validate a test.
To illustrate the difference between CTT and the Rasch measurement model,
imagine a physics test with two questions, "What is the formula for force?" and "How does
Einstein's theory of relatively work?". John answers only the first question correctly and
Mary answers both questions correctly. With CTT, John would get a grade of 50% and
Mary a grade of 100%. Does this mean that Mary is twice as smart as John? Because
John answered a basic question and Mary answered a basic and an advanced question,
Mary is probably much smarter than John, but it is difficult to say that she is exactly
twice as smart as John. The Rasch measurement model weighs items based on how
many people answered the questions correctly, and simultaneously produces difficulty
measures for items and person’s ability measures for people. These difficulty and
ability measures give very precise assessments of where items stand in relation to other
items, and where people stand relative to other people (Sadiq, Tirmizi, & Jamil, 2015).
In the previous example with John and Mary, the basic question might have a difficulty
measure of -0.56 and the advanced question might have a difficult measure of 2.40,
while John might have a person ability measure of -0.36 and Mary might have a person
ability measure of 2.80. Based on this, the Rasch measurement model offers a much
more accurate assessment of an item's real difficulty level or a person's true ability
level. This difference in accuracy between CTT and the Rasch measurement model can
have real-life consequences for language teachers. In a study by Weaver, Jones, and
Bulach (2008), several students entering a university as freshmen were placed in
different ability levels depending on whether their placement exam was scored with
CTT or with Rasch measurement, illustrating how more precise assessment methods,
such as the Rasch measurement model, can lead to better student placement when
entering a university.
InSight: A Journal of Scholarly Teaching 77
Another feature of the Rasch measurement model is that it makes it easier for
teachers to improve their tests. One way it does this is by putting the difficulty level
of the items and the ability level of the persons on a shared scale, so the items and
persons can be easily compared, as shown in the Wright Map in Figure 1. The Wright
Map in Figure 1 includes several x's on the left side of the vertical line which represent
the people who took the test. The top x (at 2 logits) represents the person with the
highest ability, and the bottom x (at -1 logits) represents the person with the lowest
ability. On the right side of the vertical line, numbers from 1-56 represent the questions
on the test. The highest item is 20, which was the most difficult question on the test,
and the lowest items are 55 and 56, which were the two easiest questions on the test.
When a person and an item are perfectly matched, such as the top x and item 36, the
person has a 50% chance of answering that question correctly. For the top x, the only
item that was above their ability was question 20. Being able to easily see how the
people and items match can be useful if teachers want to know if their test was too easy
or too difficult. If the test was too easy, the items on the right would be below the
persons on the left. If the test was too difficult, the items would be above the persons.
This visual inspection is one way that the external validity of a test can be confirmed
(Baghaei & Amrahi, 2011).
In the case of Figure 1, items 15, 8, 17, 14, 53, 54, 55, and 56 fell below the
person with the lowest ability, with items 14, 53, 54, 55, and 56 far below the lowest
person's ability, suggesting that these items should be made more difficult or removed
from the test. Related to the visual benefit of seeing how the items and persons match
on the logit scale, the Rasch measurement model places items in a hierarchy along the
logit scale (from difficult at the top too easy at the bottom) which allows test makers to
make a priori hypotheses about the difficulty of questions on the test (Beglar, 2010),
representing another way to confirm the validity of the test.
Finally, the Rasch measurement model is able to measure unintended
constructs within a test. In the earlier example with John and Mary, if a third question
was on the test, such as "What is the composition of water?", the Rasch measurement
model is able to identify this as a chemistry question, and not a physics question (even
if the test-maker has not realized this). This is referred to as dimensionality and can be
especially useful for teachers and researchers who are making tests and surveys that
should focus on one construct. All tests and surveys are multidimensional to some
degree (Baghaei & Amrahi, 2011), but the Rasch measurement model can identify
exactly how much multidimensionality is present in a test, and it is up to the test-maker
to decide if this amount of multidimensionality is tolerable (Baghaei & Amrahi, 2011;
Runnels, 2012).
The use of the Rasch measurement model to assess students or validate tests
and surveys has become more common in the TESOL field (Baghaei & Amrahi, 2011;
Baghaei & Carstensen, 2013; Beglar, 2010; Cox & Clifford, 2014; Huhta, Alanen,
Tarnanen, Martin, & Hirvela, 2014; McNamara, 2011; Runnels, 2012; Tiffin-Richards &
Pant, 2013; Wu & Dou, 2015). For teachers who want to more accurately assess students
or improve the validity of their tests, it is important to understand the basic principles
of the Rasch measurement model. This paper will guide readers through the process
of making and assessing a test with the Rasch measurement model.
78 Volume 13 ȣ 2018
PERSON - MAP - ITEM <more>|<rare>
3 +
|
|
| 20
|
|T
|
|
2 X + 36
| 41
|
|
T|
XX | 22
| 23
XXXX |S 25 39
1 S+ 35
X | 13 37 4 46 48 49 7
XXX | 44
XXXXXX |
X | 19 3 38
XXXXXX M| 10 28 34 40 43 51 9
XXXX | 29
XXXXXXX | 30 32
0 X +M 24 26 45 52
XX S| 1 21 6
XX | 12 2 42 47
XXXX |
| 11 31 5
| 16
T| 18 27 33 50
|
-1 X +
|S
| 15 8
| 17
|
|
| 14
|
-2 + 53
|
|
|T
|
|
| 54
|
-3 +
|
|
|
| 55 56
|
|
|
-4 +
<less>|<frequent>
-----------------------------------------------------------------------
Figure 1. Wright Map for High-Advanced Test
Besides explaining the Rasch measurement model, the goal of this study was
to give an example of test creation and assessment. Two separate exams were created
for this study, for two groups of advanced students.
Having two levels of students within the advanced level (a high-advanced
group and a low-advanced group) created a dilemma in how to fairly assess students.
It was necessary to give all students in the advanced level a final exam, but if the exam
was too difficult, it would punish the low-advanced group. Conversely, if the exam
was too easy, it would not be challenging enough for the high-advanced group. If two
distinct exams were created, one for each group, it would lead to distorted grades when
comparing the two groups of students. For example, should a student in the low-
advanced group who scored a 90% on the easier exam be considered equal to a student
in the high-advanced group who scored a 90% on the more difficult exam? How much
should the former student's exam score be discounted so a fair comparison could be
made with the latter student? Because the Rasch measurement model can collectively
assess the relative difficulty of questions on an exam, if the two exams shared several
items (illustrated in Figure 2), it would be possible to accurately compare the two
groups of students, even if the exams were significantly different in difficulty level
(albeit with some shared items).
Group 1 Group 2
questionss Shared qu
questions
questions
When two tests share items, and all items (shared and non-shared) are
computed simultaneously, it is known as concurrent equating method, one of three ways
to link tests (Masters & Keeves, 1999). The concurrent equating method has been
shown to have higher consistency and better measurement of items (Baker & Al-Karni,
1991).
After the tests were given, a simple Rasch analysis was conducted on the test
data to confirm the validity of the test's questions.
Participants
This research included 121 first-year students in the advanced English level
of an intercultural communication program at a large private university in Tokyo.
Students were drawn from five different listening classes. Within the advanced level,
there were two groups of students: a low-advanced and a high-advanced group. The low-
advanced group included 76 students from three classes and had TOEFL iBT scores
80 Volume 13 ȣ 2018
roughly in the range of 55-65, while the high-advanced group included 45 students (some
of whom were returnees) from two classes and had TOEFL iBT scores roughly in the
range of 65-80. Because students were in the same level (advanced), they needed to be
graded together. However, because there was a significant difference in the ability
between the two groups, they could not take the same test (a single test would be too
difficult for the low-advanced group, or too easy for the high-advanced group). Using
the Rasch measurement model to link two tests with several shared questions would
solve this problem.
Instruments
Separate tests were created for the low-advanced and high-advanced groups
in a listening course with each test including 56 multiple choice questions. There were
26 questions that were shared between the two tests, and there were 30 questions that
were exclusive to each test.
Each test included two vocabulary and seven listening comprehension
sections. The questions that were the same on both tests included the two vocabulary
sections and two listening comprehension sections, which were based on content from
the course textbook. The questions that were exclusive to each test included five
listening comprehension sections and were based on content taken from the website
www.ted.com.
Procedures
Making level-appropriate tests. The criteria for the tests were that they would
take one hour to complete, use some of the textbook's content, test the listening ability
of students, and be easy to grade because over 120 students would need to be assessed.
First, because listening passages would need to be included within the test's
one-hour time limit, only 25 minutes would be available for answering questions (with
35 minutes for listening passages). It was thought that 56 multiple test questions would
be suitable for the test (giving students around 30 seconds to answer each question).
Second, some teachers suggested that a quarter of the questions be vocabulary
questions. A quarter of the 56 questions would be around 13-14, leaving approximately
42 for listening comprehension. If 42 questions were reserved for listening
comprehension, and seven listening passages would be used in the test, then each
listening passage would include six comprehension questions. Ultimately, the test had
56 total questions, of which 14 were vocabulary questions, and 42 were listening
comprehension questions.
Third, five-minute listening passages from the website www.ted.com that
were the appropriate difficulty level for the low-advanced and high-advanced groups
were used in the test. The website at www.ted.com has an extensive library of videos
that are available for copyright-free download. Ten listening passages that were
roughly five minutes in length were used, with the five that seemed to be easier
assigned to the low-advanced test, and the five that seemed to be more difficult
assigned to the high-advanced test.
Finally, each of the 56 questions followed a multiple-choice format, which
allowed for easy scoring of the test.
At the top of the command file is the name of the text file, followed by the title
of the data (neither of these are essential to your analysis). Next are the headings "NI",
which indicates the number of items in the test, "ITEM1", which indicates the space
where the item responses will begin, and "NAME1" which indicates the space where
person names will begin. This is followed by "ITEM", which indicates the term used
for the test's questions, "PERSON", which indicates the term used for the people
completing the test, and "CODES", which indicates the range of possible answer
choices for the test's questions (on the tests in this study, the vocabulary questions had
answer options from A-J while the listening comprehension questions had answer
options from A-D). This is followed by "KEY1", which indicates the correct answer
choices for all of the items on the test (the first 19 answers were for shared questions,
the next 30 answers were for the high-advanced test, the next 30 answers were for the
low-advanced test, and the final 7 answers were for shared questions), "&END;", which
82 Volume 13 ȣ 2018
is necessary code to end this portion of the command file, and, finally, the listing of all
of the items.
In the example command file, only the first nine items on the test were listed
because listing all 86 items would have required too much space for this article. Of
note, spelling does not need to be perfect because these are only labels that will be used
in the data output, and as long as the test-maker can identify the item, items do not
need to be spelled perfectly (hence the spelling error in item nine). If the test-maker
wants, the item can be labelled with a number rather than the full question. When the
list of items is finished, "ENDNAMES;" should be included, followed by the specific
responses for each student on the test. For example, the first student listed was labelled
as "A1 Bob Harris" (a pseudonym). This identified the student as being in class A1 (the
high-advanced group) with the name Bob Harris. Bob answered the first 49 items on
the test as "C" for item 1, "D" for item 2, "E" for item 3, "A" for item 4, and so on, then
did not answer items 50-79 (because these questions were only on the low-advanced
test), and then answered items 80-86. The last response was followed by a space, and
then the students' identifier (in this case, their class and name). In the example
command file, only some students who took the test were listed because listing all 121
students would have required too much space for this article. For an example of a
student from the low-advanced group, the fifth student listed was labelled as "A3 Peter
Venkman" (a pseudonym). This identified the student as being in class A3 (the low-
advanced group) with the name Peter Venkman. Peter answered the first 19 items,
then did not answer items 20-49 (because these questions were only on the high-
advanced test), and then answered items 50-86.
To run the command file in Winsteps, open Winsteps, go to File from the drop-
down menu, then select the Open File option. Next, a dialog box will open, and then
select the command file. Once the command file has been selected, press the Enter key
twice and Winsteps will generate the Rasch data.
When assessing the Rasch data generated by Winsteps, there are several
variables that should be examined. An example of the variables produced by Winsteps
is shown in Table 1 (see pp. 84-92).
Winsteps allows for the Rasch data to be analyzed in several different ways,
such as examining the ability and behaviour of the people who completed the tests or
examining the difficulty and reliability of the items on the test. The data shown in
Table 1 is an examination of the difficulty and reliability of the items on the test. This
data can be obtained by going to the Output Files drop-down menu in Winsteps and
then choosing the ITEM File = IFILE option. Next, a dialog box will open, and the user
will be given some choices on how the output should be generated (such as in an Excel
file, a text file, or an SPSS file). Unless the user has experience with SPSS, it is probably
easiest to choose the Excel file option (a text file will not allow the data to be easily
viewed by the user). The Excel output file will include 17 columns of data. Not all of
this data is essential for analysis, so only ten columns of data have been included in
Table 1.
84 Volume 13 ȣ 2018
Table 1 Cont.
IN IN OUT OUT
Entry Measure Count Score Error Item
MSQ ZSTD MSQ ZSTD
43b According
to the speaker,
73 1.22 76 23 0.25 1.06 0.62 1.10 0.82 what is the
best way to
communicate?
28b According
to the speaker,
58 1.16 76 24 0.25 1.08 0.87 1.12 1.08
her brother
Samuel...
42b What does
the speaker
72 1.16 76 24 0.25 1.08 0.86 1.08 0.76
suggest for the
future?
45b What is
the main
75 1.16 76 24 0.25 1.00 -0.01 1.01 0.13
problem with
using pills?
22a What was
NOT an
22 1.13 45 13 0.34 0.95 -0.29 1.08 0.46 example of
ingenuity by
the prisoners?
39b What are
the two main
opposing
69 1.09 76 25 0.25 0.97 -0.31 1.00 0.04
forces
identified by
the speaker?
23a What is
the speaker's
reason for
23 1.01 45 14 0.33 1.06 0.51 1.11 0.71 many released
criminals
going back to
prison?
33b The air
63 0.97 76 27 0.24 0.97 -0.32 0.98 -0.14 inside
buildings...
25a Why
should society
25 0.90 45 15 0.33 1.02 0.23 1.08 0.54
help prisoners
more?
39a What
experience does
the speaker
39 0.90 45 15 0.33 0.85 -1.19 0.83 -1.16
describe at the
beginning of his
lecture?
86 Volume 13 ȣ 2018
Table 1 Cont.
IN IN OUT OUT
Entry Measure Count Score Error Item
MSQ ZSTD MSQ ZSTD
47b Which
example of
77 0.51 76 35 0.24 0.97 -0.51 0.97 -0.55 lasers is NOT
mentioned by
the speaker?
44a What was
the main
44 0.50 45 19 0.31 0.94 -0.64 0.93 -0.66
theme of this
lecture?
9 Which even
marked the
beginning of
9 0.47 121 55 0.19 0.94 -1.25 0.94 -1.13
mainstream
acceptance of
hip hop?
20b What was
the main
50 0.46 76 36 0.24 1.00 -0.07 0.99 -0.17
theme of this
lecture?
4 0.40 121 57 0.19 0.90 -2.46 0.89 -2.25 4 pundit
13 According
to Dr. Lee, hip
hop culture
has gone
13 0.40 121 57 0.19 1.00 -0.03 0.99 -0.15 beyond the
music to focus
on a lifestyle
which
includes...
29b How does
59 0.35 76 38 0.23 1.04 0.74 1.04 0.73 the speaker
define autism?
10 Which
fashion trend
was NOT
10 0.33 121 59 0.19 1.02 0.49 1.02 0.45 mentioned by
Dr. Lee as part
of hip hop
fashion?
38a What was
the main
38 0.31 45 21 0.31 0.94 -0.74 0.92 -0.89
theme of this
lecture?
28a What is a
negative
28 0.21 45 22 0.31 1.15 1.95 1.18 2.02 aspect to
colonizing
Mars?
88 Volume 13 ȣ 2018
Table 1 Cont.
IN IN OUT OUT
Entry Measure Count Score Error Item
MSQ ZSTD MSQ ZSTD
32a What was
the main
32 -0.08 45 25 0.31 1.09 1.20 1.18 1.81
theme of this
lecture?
38b What was
the main
68 -0.10 76 46 0.24 1.00 0.02 1.01 0.20
theme of this
lecture?
48b Which
process is NOT
described as
78 -0.10 76 46 0.24 1.00 -0.03 1.00 -0.01
part of the
three-headed
device"?"
51
81 -0.13 121 72 0.19 1.04 0.69 1.06 0.97
Contingency
46b According
to the
speakers,
76 -0.16 76 47 0.24 0.95 -0.66 0.95 -0.66
where are HIV
reservoirs
NOT located?
45a What is
the main
purpose of the
45 -0.17 45 26 0.31 1.02 0.24 1.01 0.10 stolen chair
example at the
beginning of
the lecture?
34b Which
activity is
NOT
64 -0.22 76 48 0.24 1.01 0.20 1.01 0.15 mentioned as
part of
mechanical
ventilation?
24a How
many
criminals
commit a
24 -0.27 45 27 0.31 1.01 0.19 1.03 0.31
crime within
five years of
being
released?
26a What was
the main
26 -0.27 45 27 0.31 0.92 -0.88 0.91 -0.75
theme of this
lecture?
90 Volume 13 ȣ 2018
Table 1 Cont.
IN IN OUT OUT
Entry Measure Count Score Error Item
MSQ ZSTD MSQ ZSTD
35b According
to the speaker,
where does the
65 -0.59 76 54 0.26 1.02 0.25 1.02 0.23
healthcare
industry rank
in energy use?
37b Which
government
department
67 -0.59 76 54 0.26 0.96 -0.29 0.94 -0.44 did the
speaker
compare
hospitals to?
11 When was
11 -0.67 121 86 0.21 1.01 0.19 1.04 0.38 the best time
for hip hop?
80 -0.67 121 86 0.21 1.04 0.47 1.02 0.23 50 Appalled
31a How
many planets
does the
31 -0.69 45 31 0.33 1.03 0.27 1.01 0.11
speaker say
are in our
galaxy?
23b The
speaker
mentioned
specific
research
involving the
53 -0.73 76 56 0.27 1.00 0.07 0.97 -0.16 brain. How
much was the
decrease in
pain for the
people in the
research
study?
33a Which
medical
problem does
the speaker
33 -0.91 45 33 0.35 1.10 0.63 1.12 0.63
NOT use as an
example of
research
progress?
1 -0.94 121 92 0.22 0.93 -0.60 0.89 -0.81 1 aspirations
6 -1.03 121 94 0.22 0.86 -1.23 0.77 -1.68 6 revenue
The first column is labelled Entry, and this represents the order that
questions were entered into the command file. Recall that there were 86 total items in
the command file, so the first row, labelled 52, is the 52nd item entered into the
command file.
92 Volume 13 ȣ 2018
The second column is labelled Measure, and this represents the difficulty level
of each item. Because this study is attempting to make a more difficult test for the high-
advanced group, this column's information is very important. In the first row, the 52nd
item entered into the command file, which was question 22 on the low-advanced test,
had a difficulty measure of 2.43. This was the highest difficulty measure for all of the
items on both tests, which means it was the most difficult question. We can
immediately see a problem in that the low-advanced test should not include the most
difficult questions. Of the 13 most difficult questions, ten were from the low-advanced
test (the numbers accompanied with a b in the tenth column Item indicate questions on
the low-advanced test). When we modify this test, these items should either be made
easier, deleted, or switched to the high-advanced test.
The third column is labelled Count, and this represents the number of students
who answered this item. Items either had 45, which was the number of students
answering high-advanced questions, 76, which was the number of students answering
low-advanced questions, or 121, which was the number of students answering shared
questions.
The fourth column is labelled Score, and this represents the total number of
students who answered this question correctly. For example, in the first row, the 52nd
item, which was question 22 on the low-advanced test, was answered correctly by nine
students. Conversely, in the third-last row, the 56nd item, which was question 26 on
the low-advanced test, was answered correctly by 72 students. This column gives some
indication of the difficulty of each item, however, this variable is not weighted and
represents a CTT type of assessment.
The fifth column is labelled Error and this represents the accuracy of the
difficulty measure variable (which is shown in column two). The greater the error in
column five, the less precise the difficulty measure, and high error is usually more
evident in items that are either very easy or very difficult (because these items tend to
be below or above the ability of most people, and as a result, are more difficult to
assess).
The sixth column is labelled IN MSQ and represents the infit mean square,
which indicates how well the actual responses matched the predicted responses of the
Rasch measurement model. Put more simply, the Rasch measurement model can
predict how items will be answered based on the answer patterns within the entire
group. For example, if person A is answering all items correctly, and item 1 is the
easiest item (because everyone is answering it correctly), the Rasch measurement
model will predict that person A has a very good chance of answering item 1 correctly.
Infit and outfit indicate how closely person A's actual responses match the predicted
responses; a value of 1.0 indicates perfect fit (the actual response matches the predicted
response). However, if person A unexpectedly answers item 1 incorrectly, this will be
represented with higher infit and outfit values. A high infit and/or outfit for a person
means that the person is answering unpredictably (perhaps because they are cheating,
guessing, or having a problem). A high infit or outfit for an item means that the item is
being answered unpredictably (maybe the question is worded in a confusing way,
which is causing students to answer it inconsistently). Basically, the item IN MSQ
measures how reliably a question is being answered. If the item IN MSQ is within the
recommended range of 0.70 to 1.30 (Bond & Fox, 2007), then it usually indicates that
InSight: A Journal of Scholarly Teaching 93
people understood the item correctly. However, if the item IN MSQ was outside of the
recommended range, it usually indicates that something strange was happening when
people were answering this item.
The seventh column is labelled IN ZSTD and also represents the infit value of
the item; however, it is standardized to minimize distortion that could occur because
of the sample size. For example, fit problems are sometimes not obvious in the IN MSQ
variable when the sample size is very large, while fit problems are always obvious in
the IN ZSTD variable. IN ZSTD should fall within the range of -2 to +2 (Baghaei &
Amrahi, 2011). If the IN ZSTD falls below this range, it is said to overfit the model,
which indicates items that followed the Rasch model predictions too much (i.e. answer
patterns were too predictable). If the IN ZSTD is above this range, it is said to underfit
the model, which indicates items that did not follow the Rasch model predictions
enough. Underfit is regarded as more of a problem than overfit.
The eighth column is labelled OUT MSQ, and the ninth column is labelled
OUT ZSTD. Like infit, outfit gives an indication of how well the actual responses
matched the Rasch model's predicted responses. The difference between outfit and
infit is that outfit weighs all items equally, whereas infit more heavily weighs nearby
items (Sadiq et al., 2015). As a result, researchers tend to prefer infit over outfit because
infit is not as vulnerable to skewed data that stems from extreme unpredictability (such
as a person with very high ability incorrectly answering a very easy question).
Finally, the tenth column is labelled Item and represents the label given to each
item in the Winsteps command file. For the two tests in this study, shared items were
labelled with a number, low-advanced test items were labelled with a number and a b
(for example, the item in the first row is 22b which represents question 22 on the low-
advanced test), and high-advanced test items were labelled with a number and an a.
Results
To confirm that the tests were set at the appropriate difficulty level, it was
necessary to compare the difficulty estimates of the low-advanced test sections to those
of the high-advanced test. The average difficulty estimates for each section of each test
are shown in Table 2 on p. 95, with higher difficulty estimates indicating more difficult
sections, and lower difficulty estimates indicating easier sections.
Difficulty estimates of the shared item sections of vocabulary 1, listening
comprehension 1, and listening comprehension 2 were relatively similar, at -0.59, -0.15,
and -0.83, respectively. However, the difficulty estimates for the shared item section
of vocabulary 2 was much lower at -2.09, indicating that the questions in this section
might have been too easy.
Looking at the average difficulty estimates of the low-advanced sections, the
listening comprehension 3 (0.69), listening comprehension 5 (0.53), and listening
comprehension 6 (0.99) sections were more difficult than all but one of the high-
advanced sections (listening comprehension 3 at 0.80). In particular, low-advanced's
listening comprehension 6 section was the most difficult section on either test, and
would need to be made easier, deleted, or switched to the high-advanced test.
94 Volume 13 ȣ 2018
Table 2
Average Item Difficulty by Test Section
Item entry Average Difficulty
Type of items Test section
numbers Measure
1-7 Shared vocabulary 1 -0.59
8-13 Shared listening -0.15
comprehension 1
14-19 Shared listening -0.83
comprehension 2
20-25 high-advanced listening 0.80
comprehension 3
26-31 high-advanced listening -0.31
comprehension 4
32-37 high-advanced listening 0.42
comprehension 5
38-43 high-advanced listening 0.45
comprehension 6
44-49 high-advanced listening 0.31
comprehension 7
50-55 low-advanced listening 0.69
comprehension 3
56-61 low-advanced listening -0.39
comprehension 4
62-67 low-advanced listening 0.53
comprehension 5
68-73 low-advanced listening 0.99
comprehension 6
74-79 low-advanced listening 0.29
comprehension 7
80-86 Shared vocabulary 2 -2.09
Discussion
The results of the analysis done on the two tests show why it is important for
teachers to check the validity of their tests. Despite having experience in constructing
listening exams over several years, the researcher still made several incorrect
assumptions about the questions on both tests. The researcher misjudged the difficulty
level of seven items, as well as three entire sections (18 items). Combined, this
represents 25 out of a possible 86 items, almost a third of all items. Further to this point,
the Rasch measurement model indicated that eight items had poor fit, likely indicating
poorly-worded questions or answers. The Rasch measurement model identified these
problems whereas CTT would not have, which should result in an improved second
version of the test.
96 Volume 13 ȣ 2018
Table 3
Summary of Item and Section Violations
Item or Section Violation Course of Action
Item 84 Too easy Make more difficult
Item 85 Too easy Make more difficult
Item 86 Too easy Make more difficult
Item 52 Too difficult Make easier
Item 55 Too difficult Make easier
Item 58 Too difficult Make easier
Listening comprehension
5 section, low-advanced Too difficult Switch to high-advanced test
test
Listening comprehension
6 section, low-advanced Too difficult Switch to high-advanced test
test
Item 75 Too difficult Make easier
Listening comprehension
4 section, high-advanced Too easy Switch to low-advanced test
test
Improve wording of item and
Item 3 Overfit the model
answers
Improve wording of item and
Item 4 Overfit the model
answers
Improve wording of item and
Item 7 Overfit the model
answers
Underfit the Improve wording of item and
Item 19
model answers
Improve wording of item and
Item 46 Overfit the model
answers
Underfit the Improve wording of item and
Item 20
model answers
Underfit the Improve wording of item and
Item 36
model answers
Underfit the Improve wording of item and
Item 28
model answers
While this study focused on the Rasch data concerning items, the Rasch data
concerning persons can also provide valuable insights. The information gleaned from
person fit statistics can help teachers identify students who may be answering
erratically, either in a way that lowers a student's grade (such as nervousness,
carelessness, or lack of focus) or increases a student's grade (such as guessing or
InSight: A Journal of Scholarly Teaching 97
cheating). This information can alert the teacher to a course of action that might be
necessary to help the students. Additionally, a teacher might inspect the Wright Map
and realize that several items are in the same ___location along the vertical axis. This
would indicate redundant items, and the teacher could delete several extraneous items
and still have a valid test. Shorter tests that maintain their validity are more efficient
and can free up class time for other activities that help students learn.
Benefits are not limited to teachers. Rasch can benefit learners by placing
them in a class that is appropriate to their ability level. As indicated earlier, there is
research that has demonstrated that students might be put in a different class based on
whether their placement exam was scored with CTT or the Rasch measurement model.
Being in a class that is too difficult (or too easy) can
have potentially negative effects on a student’s The information gleaned
confidence, anxiety, and motivation, so it is essential from person fit statistics
for placement to be as accurate as possible. can help teachers identify
Additionally, the Rasch measurement model makes it students who may be
easy to customize tests to a specific ability level, as was answering erratically…
illustrated in this article for low-advanced and high-
advanced students. Occasionally, schools will create a single standardized exam that
every student must take, but this can have a negative effect on lower-proficiency
students as their confidence can be damaged when taking a test that is well-beyond
their ability. Linking two tests that place all students on the same grading scale can
help teachers preserve the confidence of lower-proficiency students by giving them a
test in which they can succeed.
Finally, the research community can benefit from the Rasch measurement
model. Many assumptions have been made about how motivation, anxiety,
personality, and other affective variables relate to learning. However, if these
assumptions are based on surveys and tests that had poor validity, then the conclusions
drawn by this research may be false. For example, there has been relatively little
research that has shown that personality influences language learning (Dewaele &
Furnham, 1999), however if the personality surveys that were used to evaluate students
had flawed items (indicated by item fit), or the language tests suffered from
multidimensionality (and were not measuring what they were supposed to measure),
then it is difficult to believe that personality really has no influence on language
learning.
Suffice it to say, teachers, learners, and the research community can all benefit
from greater use of the Rasch measurement model in education.
Conclusion
Testing is used in virtually all educational contexts around the world, in both
limited (such as a class quiz) and broad ways (such as a common exam for an entire
grade of students). With tests occupying such an important role in student assessment,
it is essential that teachers ensure that their tests are as well-constructed as possible.
When comparing raw scores (CTT) versus the information provided with the Rasch
measurement model, there is so much to gain by using a Rasch approach. If it can be
agreed upon that the Rasch measurement model provides better and more accurate
98 Volume 13 ȣ 2018
information than raw scores, then the only excuse for not using the Rasch measurement
model is that the process might be too complicated. Hopefully, this paper has been
able to simplify the process so teachers have a better understanding of how to conduct
a basic Rasch analysis. The potential benefits of using the Rasch measurement model
far outweigh the learning curve associated with the model.
References
Andrich, D. (1988). Rasch models for Huhta, A., Alanen, R., Tarnanen, M.,
measurement. Newbury, CA: Sage. Martin, M., & Hirvela, T. (2014).
Assessing learners' writing skills in a
Baghaei, P., & Amrahi, N. (2011). SLA study: Validating the rating process
Validation of a multiple choice English across tasks, scales and languages.
vocabulary test with the Rasch model. Language Testing, 31(3), 307-328.
Journal of Language Teaching and Research,
2(5), 1052-1060. Linacre, J. M. (1997). KR-20/Cronbach
alpha or Rasch person reliability: Which
Baghaei, P., & Carstensen, C. H. (2013). tells us the truth? Rasch Measurement
Fitting the mixed Rasch model to a Transactions, 11, 580-581.
reading comprehension test: Identifying
reader types. Practical Assessment, Linacre, J. M. (2009). Winsteps (Version
Research & Evaluation, 18(5), 1-13. 3.68). Beaverton, OR: Winsteps.com.
Dewaele, J. M., & Furnham, A. (1999). Novick, M. R. (1966). The axioms and
Extraversion: The unloved variable in principal results of classical test theory.
applied linguistic research. Language Journal of Mathematical Psychology, 3, 1-
Learning, 49(3), 509-544. 18.
InSight: A Journal of Scholarly Teaching 99
Nunally, J. C. (1978). Psychometric theory Tiffin-Richards, S. P., & Pant, H. A.
(2nd ed.). New York, NY: McGraw-Hill. (2013). Setting standards for English
foreign language assessment:
Runnels, J. (2012). Using the Rasch Methodology, validation, and a degree
model to validate a multiple choice of arbitrariness. Educational
English achievement test. International Measurement: Issues and Practice, 32(2),
Journal of Language Studies, 6(4), 141-153. 15-25.
Sadiq, M., Tirmizi, S. H., & Jamil, M. Weaver, C., Jones, A., & Bulach, J. (2008).
(2015). Using Rasch model for the Comparing placement decisions based
calibration of test items in mathematics, on raw test scores and Rasch ability
grade 9. Journal of Research and Reflection scores. The Language Teacher, 32(6), 3-8.
in Education, 9(2), 82-102.
Wu, S., & Dou, T. (2015). Validation of an
oral English test based on many-faceted
Rasch model. Journal of Language
Teaching and Research, 6(4), 866-872.
Dr. Omar Karlin is an Assistant Professor at Toyo University, where he teaches English
language courses. He obtained his Doctorate in Education in 2015 from Temple University,
and his research interests include test construction and validation, the intersection of
personality and language proficiency, and teaching listening.
Sayaka Karlin is an Adjunct Professor at Toyo Gakuen University, where she teaches English
language courses. She obtained her Master’s in Economics from the University of Manchester,
and completed her Master’s in Education from Temple University in 2017. Her research
interests include vocabulary, reading proficiency, and teaching adults.
100 Volume 13 ȣ 2018