Final Mat 701
Final Mat 701
the Test.
Justin Fraser-deHaan
With the increasing role over the past decade of state-wide standardized test
such as the MCAS, much effort has been placed on improving students scores.
Meanwhile much consternation has developed over whether the increasing focus on
these test in the classroom has the opposite of the desired effectwhether, in fact,
spending time teaching to the test harms overall achievement in mathematics.
The purpose of this research was to determine if this concern is valid. The
researchers hypothesis was that students in a classroom where the MCAS was
entirely deemphasized would fare better both in the short and long term on neutral
non-MCASmeasures of student achievement than those students who spent
notable sections of class time preparing for the MCAS. To test the hypothesis,
teachers at five Massachusetts middle schools were given training and instruction
for the coming academic year, being instructed either to refrain from giving their
students any specific preparation for the MCAS or to administer a specific program
of MCAS test-preparation throughout the year. Following the academic year, a two-
way ANOVA was performed to see what effect these two strategies had on students
of various achievement levels. This same test was performed at the end of the
subsequent year, with students no longer divided into classrooms that had specific
test-preparation policies in place, in order to determine whether the two treatments
had lasting effects. It is hoped that the results of this study could lead policy
makers to reexamine the focus that has been placed on tests like the MCAS in
driving and assessing student achievement.
Introduction & Literature Review
curricula around preparing students for a standardized test. Perhaps the most
organize their instruction either around the actual items found on a test or around a
set of look-alike items (Popham, 2001). Depending on the nature of the testfor
specific problem types, fall on the outskirts of teaching to the test as we here define
it. In our research, we will be looking at the short and long term effects of teaching
Previous research has shown that, while test preparation can improve student
assessments in an effort to raise scores, are de-emphasizing topics not tested and
researcher wonders whether any negative effects of teaching to the test might carry
High-stakes testing
Teaching to the test is a matter tied closely to the question of motivation and
incentives. Few teachers would modify their curriculumand fewer students would
modify their study habitsaround a test with little or no cost or reward associated
with the outcome. We need only concern ourselves then with what happens around
high-stakes tests, where there is some real cost or benefit tied to performance
(Volante, 2004). These tests, which have become increasingly common since the
passing of No Child Left Behind in 2001, are a frequent source of debate as the
against the desire of the government and the public to know how schools are
Though it is clear that the public has a right to know how well its educational
system is working, just as it is obvious that there must be some way to assess the
performance of teachers and students alike, accomplishing this task is anything but
a simple matter. At the heart of the matter lies the problem of what to test and who
makes the decision of what to test. Ideally, test problems might be modeled in such
a way that a students ability to solve them could demonstrate just exactly the
mastery of the curriculum (Bond, 2004). In such a case, there would be no concern
over teaching to the test, as this practice would represent exactly the goals of the
curriculum. However, mostif not allstandardized tests miss this lofty goal by a
wide margin and it is not clear that the goal is even realistic (Rothstein, 2011).
Even if a test were to so perfectly capture the essence of the curriculum and test-
makers could construct problems that, in learning how to solve them, would teach
students the very skills that were being tested, we would still be left with an
incomplete way to assess the teachers contribution, as the test would likely not
cover other important roles the teacher should play, such as modeling good
In reality, high-stakes testing has lead to, and continues to lead to, many
negative outcomes. Test taking often take away a week or more of instructional
time directly, and frequently leads teachers to devote even more time to test
curriculum (Jerald, 2006). Using the test as a primary means of evaluating teachers
and administrators also means that teachers face a strong incentive to focus their
do well on the test, without learning how to apply the skill if it were tested in a
different way (Jerald, 2006). Because few viable strategies exist to deter this
practice among teachers, we are left with only intrinsic motivators to prevent it
that is, ethics (Popham, 2001)meaning it would be exactly the bad teachers who
often look good under this system . Add on to this a long history of school districts
scores (Shepard, 1990) and a propensity for otherwise good students to drop-out
under the pressure of passing tests like the MCAS (Capodilupo, et. al., 2000) and we
That said, not all of the research on high-stakes testing points in the negative
direction. One current, large scale study shows that there is some correlation
their performance on more rigorous, higher order tests (Gates Foundation, 2010).
However, in reviewing the data of the same study, Rothstein (2011) points out the
effect is only slightly better than 50/50. Lazear (2005) argues by analogy that the
strong incentives placed on students and teachers are beneficial for what he calls
high-cost learners. The argument is that for students who have little intrinsic
motivation to learn anything at all, the strong incentive placed on learning the
material covered on the high-stakes test will at least get them to learn something,
where they might otherwise learn nothing. Lazear attempts to extend the same
reasoning towards poor teachers, arguing that without the incentive to teach the
items of the high-stakes test they might otherwise teach nothing, though the
argument here sounds particularly hollow and nave. It is worth noting that under
While nobody questions that some teachers get better results from their
students than others, quantifying and measuring this difference has proven a
difficult task. Most recent studies, including the Gates Foundations MET study,
have been built around some variation of what is known as a value-added modeling,
a method which seeks to judge teachers based upon their contribution to gains in
student achievement (Scherrer, 2012). Some studies have shown that a one-
extra instruction (Goldhaber, 2012), while at least one reputable study has shown
the same difference between a good and a poor teacher to be worth a full year of
extra instruction (Hanusheck, 1992). This means that an effective teacher has a
massive impact on a students academic growth, and anything that might impede
worth investigating.
could make a world of difference to a student, it is also worth noting that, under
the same logic, a good teacher following after a poor teacher can more than make
up for the negative effects on growth experienced the year before. In fact, one
study has shown that the impact of a given teacher fades by 50% each year in the
two years following a given instructional year (Kane & Staiger, 2008). It is
therefore, difficult, if not impossible to quantify truly long-term effects that a single
Test Preparation
the validity of the tests themselves. After all, if a student can improve their
score substantially on a given test, simply by preparing for that specific test,
has been shown, for example, that students can raise their SAT scores by
preparing for that test, but that this preparation itself has no effect on the
same students ACT if they then take that test with no additional preparation
(Lazear, 2005). Since both tests claim to be a measure of the same thing,
results like these must call into question the validity of the tests. In fact, if
validity of that test for measuring anything but performance on that test
given test, such as that MCAS, bring up their students scores without
actually having a real positive effect on the thing the test purports to
measure. That is, unless the test itself is a perfect reflection of the goals of
the curriculum, then teaching to the test will always be a factor pulling away
standardized test are made with such high stakes for teacher and students
alike, test-preparation will always exist, tugging away from the intended
Methods
Participants
This study will use a cluster sampling strategy, with an eye towards closely
matching the overall demographics of the region and will include approximately
1000 sixth grade students from across Massachusetts. Five middle schools will be
chosen, such that the sample mirrors the racial and socioeconomic diversity of the
state as a whole. The sample will include every sixth grade student from the
schools chosen to participate, so this sample should accurately reflect the student
students with greater past performance. Students will be divided into four groups
Proficient, Needs Improvement, and Warning), though this will not affect placement
and will be used for date analysis purposes only. Students will however be divided
into three equally sized groups depending on the classroom and math teacher they
are assigned to; for simplicitys sake, these groups will be designated as Group A
and Group B. Students in Group A are those assigned to teachers who have been
instructional time on teaching to the test for the sixth grade MCAS mathematics
test, whereas students in Group B will be assigned to teachers who have been
instructed not to spend any time preparing students specifically for the MCAS. In
year two of the study, students will be randomly assigned again for seventh grade,
with no regard to what group they were in during the sixth grade. As the goal of the
study is to assess the effect that teaching to the test has on long-term academic
achievement in Math, this sample will allow the researcher to see these effects
The researcher recognizes that this sample is biased against students outside
of Massachusetts and that, given the outstanding record of performance of the state
in question, there is a certain likelihood that the results will not be perfectly
generalizable across a broader population. Though the sample size will be large at
the outset, the researcher recognizes that the nature of the study likely introduces a
degree of mortality bias as students from the sample cohort drop out of school or
simply move away. Though students move for many reasons, there is some reason
to suspect that mobility will be higher among low socio-economic status urban
groups and these same groups will also have a greater representation of students
who will drop out. Both factors will certainly affect the results, leaving a more
homogenous than desired sample the second year of the study. However, as these
same students are more also more likely to be part of the two lower-achieving
Measures
this research. A particular challenge for the researcher was finding a method of
measuring the degree to which each teacher teaches to the test. The solution
teachers in MCAS test preparation, while instructing others to treat the MCAS as if it
did not exist. (Details on this training will follow in the Procedures section of this
paper.) A more in-depth discussion of the construct involving student achievement
and learning and the measures used to assess them follow below.
students will be analyzed, in part, upon their Grade 5 Math MCAS results, that is,
While the researcher appreciates the irony of using the MCAS to separate students
to test whether students are learning higher-order problem solving skills, MCAS
scores are, at least, moderately valid for the purposes of separating students into
broad achievement groups. This instrument was also chosen for the great degree of
accessibility, as the vast majority of students in the sample will have already been
students MCAS scores does not play a role in the researchers primary hypothesis,
scores on the MCAS mathematics test will be collected and analyzed both during
the sixth and seventh grade for the students in the sample, as the researcher feels
that they would be remiss in not collecting data on the direct effects of test
teaching to the test has on student achievement in mathematics, finding a valid and
classes and scores on an alternate form of assessment, not prepared for by either
group.
This study will follow the Measures of Effective Teaching Project (Gates
reasoning and problem solving skills. According to the MET Project (Gates
Foundation, 2010), this test is highly reliable, has strong evidence for validity when
evidence that the test is fair across different groups of students. This assessment
was also chosen for its minimal impact on teaching hours, as it can be administered
their respective sixth and seventh grade math classes. While grades may not be
the ideal way to measure student learning, they are widely recognized as the
standard way to measure student achievement, and will here be treated as such.
Procedures
Sampling
The students for this study will be chosen using a cluster sampling technique.
Five middle schools will be chosen, such that the sample mirrors the racial and
socioeconomic diversity of the state as a whole, and all students in the incoming
sixth grade class will be part of the sample. Though schools that have a low-degree
of mobility might be desirable in order to restrict mortality bias, data on this factor
is not readily available, and choosing schools based upon this factor might simply
introduce alternate forms of bias, such as oversampling from more stable and
affluent populations.
Each student in the sample will be assigned a four digit number at random,
allowing the researcher to track the students record anonymously. At this time data
from each students fifth grade MCAS math test will also be collected.
Teacher Training
During the weeks leading up to the beginning of the school year, teachers at
each school will be randomly separated into two equal groups. The first group, the
Group A teachers, will be given two paid training days in August, during which time
they will be taught test preparation strategies for the MCAS, while the second
group, the Group B teachers, will have a shorter training intended to direct them
away from teaching to the test. Teachers in both groups will be assigned a random
progress.
The Group A teachers will be provided with MCAS Middle School Mathematics
test-preparation training and material from Summit Educational Group, a local test-
preparation company. The teachers will be directed to use one-quarter of their total
instructional time in each of their math classes teaching students how to approach
the MCAS specifically and providing practice with clone-problems, problems which
are of the same type and format students are likely to see on the MCAS itself. In
practice tests to their students in advance of the real MCAS. The emphasis of
teachers will also be instructed to not otherwise modify their grading or methods of
instruction and assessment, except where necessary to accommodate the time
advised to instruct as if the MCAS itself did not exist, focusing all of their
Group B will be advised that they will not themselves be assessed on the MCAS
performance of their students during the coming year. (Note that this will, as a
condition of the study, be true for both groups, but only Group B will be notified.)
Students will be assigned as usual to their sixth grade math teachers (NB:
only school districts that do not employ leveling in assigning sixth grade math
classes will be eligible to participate in this study). One week after taking the MCAS
Math test, students in both groups will take the Balanced Assessment of
Mathematics and their scores will be collected anonymously and associated with
their previously assigned number. Students are likely to be aware of the fact that
the BAM does not count in the same way the MCAS does, so there is some risk to
validity here, in that the assessment my underrate the learning of less motivated
The students from both groups will be assigned to their seventh grade math
teachers will not have been instructed in any particular way about test-preparation,
and will instruct and assess the students as usual. At the end of the seventh grade,
the students will again take the Balanced Assessment of Mathematics and their
year-end math grades will again be collected. Any students who have left their
respective schools by the end of seventh grade will be dropped from the final data
analysis.
Data Analysis
cohort in their sixth grade year, the researcher will conduct several tests to examine
the initial results. First, the researcher will conduct a two-way ANOVA on both
groups, analyzing the improvement in each groups MCAS scores. The analysis will
scores from the fifth-grade MCAS to the sixth-grade MCAS will be equal regardless of
hypothesis will be that the mean improvement on the sixth-grade MCAS will be
higher across all MCAS-achievement groups for students in Group A than for Group
than or equal to .05, the researcher will reject the null hypothesis and conclude that
classroom was the reason for the score improvement. If the p-value of this test is
greater than or equal to .05, the researcher will fail to reject the null hypothesis,
The researcher will then conduct a second two-way ANOVA under the same
Mathematics administered at the end of the cohorts sixth-grade year. This analysis
will look at the effect of being in a classroom where the teacher teaches to the
test on overall student achievement in mathematics. Here, the null hypothesis will
be that mean scores on the BAM will depend on achievement level only and be
that the mean scores on the sixth-grade BAM will be higher across all MCAS-
achievement groups for students in Group B than for Group A H 1: A4 < B4 A3 <
B3 A2 < B2 A1 < B1. Other alternative hypotheses, based upon prior research
A3 = B3 < A2 < B2 A1 < B1, H3: B4 < A4 B3 < A3 A2 < B2 A1 < B1.
That is, the researcher anticipates the possibility that only high-achieving students
high achieving Group B and low achieving Group A students will see improvement.
Again, if the p-value of this test is less than or equal to .05, the researcher will reject
the null hypothesis and conclude that their results are statistically significant, that
being in a test-neutral vs. a test-focused classroom will have some effect on overall
equal to .05, the researcher will fail to reject the null hypothesis, meaning that their
results are not statistically significant. Subsequently, the researcher will perform
the same tests with the same hypotheses and the same expected outcomes, but
looking at year-end grades in math as the dependent variable rather than BAM
scores.
After the MCAS and BAM have been administered to the cohort at the end of
their seventh grade year, the researcher will conduct several further tests to
analyze the final results. If the null hypothesis on MCAS score improvement in the
sixth grade were to have been rejected, the researcher will then conduct a paired
classroom in the sixth-grade affects students in the subsequent grade level. The
null hypothesis will be that the mean score of Group A on the sixth-grade MCAS will
seventh-grade. The alternative hypothesis will be that the mean MCAS math score will
be lower for this group in seventh-grade than in sixthH 1: sixth-grade > seventh-grade. If
the p-value of this test is less than or equal to .05, the researcher will reject the null
hypothesis and conclude that their results are statistically significant, that being in a
test-preparation in sixth grade does not have a lasting effect leading to maintained
improvement in the seventh grade. If the p-value of this test is greater than or
equal to .05, the researcher will fail to reject the null hypothesis, meaning that their
Finally, the researcher will compare the seventh-grade BAM scores and
grades of students in Group A against those in Group B. This analysis will examine
a teacher who teaches to the test. The researcher will first look at mean scores
seventh-grade year, conducting essentially the same two-way ANOVA as was done
the preceding year. Here again, the null hypothesis will be that mean scores on the
BAM will depend on achievement level only and be equal regardless whether the
= B2 > A3 = B3 > A4 = B4. The alternative hypothesis will be that the mean
A1 < B1. Other alternative hypotheses, based upon prior research results, will
A2 < B2 A1 < B1, H3: B4 < A4 B3 < A3 A2 < B2 A1 < B1. Again, if the p-
value of this test is less than or equal to .05, the researcher will reject the null
hypothesis and conclude that their results are statistically significant, that being in a
test-neutral vs. a test-focused classroom will have some effect on overall student
achievement in mathematics and that effect will carry over into performance in the
seventh-grade. If the p-value of this test is greater than or equal to .05, the
researcher will fail to reject the null hypothesis, meaning that their results are not
statistically significant. Subsequently, the researcher will perform the same tests
with the same hypotheses and the same expected outcomes, but looking at year-
end grades in math as the dependent variable rather than BAM scores.
REFERENCES
Bill & Melinda Gates Foundation. (2010). Learning about Teaching: Initial Findings
from the Measures of Effective Teaching Project. MET Project Policy Brief. Bill &
Melinda Gates Foundation.
Bond, L., & Carnegie Foundation for the Advancement of Teaching, M. A. (2004).
Teaching to the Test. Carnegie Perspectives. Carnegie Foundation for the
Advancement of Teaching.
Capodilupo, C., Wheelock, A., & National Center for Fair and Open Testing (FairTest).
(2000). MCAS: Making the Massachusetts Dropout Crisis Worse. MCAS Alert.
National Center for Fair and Open Testing (FairTest), C. A. (FairTest)
Goldhaber, D., Liddle, S., Theobald, R., & Walch, J. (2012). Teacher Effectiveness
and the Achievement of Washington Students in Mathematics. The WERA
Educational Journal. 4(2), 6-12
Hanushek, E. A. 1992. The Trade-off Between Child Quantity and Quality. Journal of
Political
Economy. Vol. 100(1). 84-117.
Jerald, C. D. (2006). "Teach to the Test"? Just Say No. Issue Brief. Center for
Comprehensive School Reform and Improvement.
Kane, T.J. & Staiger, D.O. (2008). Estimating Teacher Impacts on Student
Achievement: An Experimental Evaluation. Working paper #14607, National Bureau
of Economic Research.
Lazear, E. P., & California Univ., L. n. (2005). Speeding, Tax Fraud, and Teaching to
the Test. CSE Report 659. National Center For Research On Evaluation, Standards,
And Student Testing (CRESST).
Rothstein, J. (2011). Review of Learning About Teaching: Initial Findings from the
Measures of Effective Teaching Project. National Education Policy Center.
Scherrer, J. (2012). What's the value of VAM?. Phi Delta Kappan, 93(8), 58-60.
Volante, L. (2004). Teaching to the Test: What Every Educator and Policy-Maker
Should Know. Canadian Journal Of Educational Administration And Policy, (35).