History
TIMSS 1999 represents the
continuation of a long series of studies conducted by the International
Association for the Evaluation of Educational Achievement (IEA). Since
its inception in 1959, the IEA has conducted more than 15 studies
of cross-national achievement in the curricular areas of mathematics,
science, language, civics, and reading. The Third International Mathematics
and Science Study (TIMSS), conducted in 1994-1995, was the largest
and most complex IEA study, and included both mathematics and science
at third and fourth grades, seventh and eighth grades, and the final
year of secondary school. In 1999, TIMSS again
assessed eighth-grade students in both mathematics and science to
measure trends in student achievement since 1995. TIMSS 1999 was also
known as TIMSS-Repeat, or TIMSS-R.(1)
To provide U.S. states and school districts with an opportunity to
benchmark the performance of their students against that of students
in the high-performing TIMSS countries, the International Study Center
at Boston College, with the support of the National Center for Education
Statistics and the National Science Foundation, established the TIMSS
1999 Benchmarking Study. Through this project, the TIMSS mathematics
and science achievement tests and questionnaires were administered
to representative samples of students in participating states and
school districts in the spring of 1999, at the same time the tests
and questionnaires were administered in the TIMSS countries. Participation
in TIMSS Benchmarking was intended to help states and districts understand
their comparative educational standing, assess the rigor and effectiveness
of their own mathematics and science programs in an international
context, and improve the teaching and learning of mathematics and
science.
Participants in TIMSS Benchmarking
Thirteen states availed of the opportunity to participate in the
Benchmarking Study. Eight public school districts and six consortia
also participated, for a total of fourteen districts and consortia.
They are listed in Exhibit
1 of the Introduction, together with the 38 countries that took
part in TIMSS 1999.
Developing the TIMSS 1999 Science Test
The TIMSS curriculum framework underlying the science tests was
developed for TIMSS in 1995 by groups of science educators with input
from the TIMSS National Research Coordinators (NRCS). As shown in
Exhibit
A.1, the science curriculum framework contains three dimensions
or aspects. The content aspect represents the subject matter content
of school science. The performance expectations aspect describes,
in a non-hierarchical way, the many kinds of performances or behaviors
that might be expected of students in school science. The
perspectives aspect focuses on the development of students attitudes,
interest, and motivation in science. Because the frameworks were developed
to include content, performance expectations, and perspectives for
the entire span of curricula from the beginning of schooling through
the completion of secondary school, some aspects may not be reflected
in the eighth-grade TIMSS assessment.(2) Working
within the framework, science test specifications for TIMSS in 1995
were developed that included items representing a wide range of science
topics and eliciting a range of skills from the students. The 1995
tests were developed through an international consensus involving
input from experts in science and measurement specialists, ensuring
they reflected current thinking and priorities in the sciences.
About one-third of the items in the 1995 assessment were kept secure
to measure trends over time; the remaining items were released for
public use. An essential part of the development of the 1999 assessment,
therefore, was to replace the released items with items of similar
content, format, and difficulty. With the assistance of the Science
and Mathematics Item Replacement Committee, a group of internationally
prominent mathematics and science educators nominated by participating
countries to advise on subject-matter issues in the assessment, over
300 mathematics and science items were developed as potential replacements.
After an extensive process of review and field testing, 98 items were
selected for use as replacements in the 1999 science assessment.
Exhibit
A.2 presents the six content areas included in the 1999 science
test and the numbers of items and score points in each area. Distributions
are also included for the five performance categories derived from
the performance expectations aspect of the curriculum framework. About
one-fourth of the items were in the free-response format, requiring
students to generate and write their own answers. Designed to take
about one-third of students test time, some free-response questions
asked for short answers while others required extended responses with
students showing their work or providing explanations for their answers.
The remaining questions used a multiple-choice format. In scoring
the tests, correct answers to most questions were worth one point.
Consistent with the approach of allotting students longer response
time for the constructed-response questions than for multiple-choice
questions, however, responses to some of these questions (particularly
those requiring extended responses) were evaluated for partial credit,
with a fully correct answer being awarded two points (see later section
on scoring). The total number of score points
available for analysis thus somewhat exceeds the number of items.
Every effort was made to help ensure that the tests represented the
curricula of the participating countries and that the items exhibited
no bias towards or against particular countries. The
final forms of the tests were endorsed by the NRCS of the participating
countries.(3)
TIMSS Test Design
Not all of the students in the TIMSS assessment responded to all
of the science items. To ensure broad subject-matter coverage without
overburdening individual students, TIMSS used a rotated design that
included both the mathematics and science items. Thus, the same students
participated in both the mathematics and science testing. As in 1995,
the 1999 assessment consisted of eight booklets, each requiring 90
minutes of response time. Each participating student was assigned
one booklet only. In accordance with the design, the mathematics and
science items were assembled into 26 clusters (labeled A through Z).
The secure trend items were in clusters A through H, and items replacing
the released 1995 items in clusters I through Z. Eight of the clusters
were designed to take 12 minutes to complete; 10 of the clusters,
22 minutes; and 8 clusters, 10 minutes. In all,
the design provided 396 testing minutes, 198 for mathematics and 198
for science. Cluster A was a core cluster assigned to all booklets.
The remaining clusters were assigned to the booklets
in accordance with the rotated design so that representative samples
of students responded to each cluster.(4)
Background Questionnaires
TIMSS in 1999 administered a broad array of questionnaires to collect
data on the educational context for student achievement and to measure
trends since 1995. National Research Coordinators, with the assistance
of their curriculum experts, provided detailed information on the
organization, emphases, and content coverage of the mathematics and
science curriculum. The students who were tested answered questions
pertaining to their attitudes towards mathematics and science, their
academic self-concept, classroom activities, home background, and
out-of-school activities. The mathematics and science teachers of
sampled students responded to questions about teaching emphasis on
the topics in the curriculum frameworks, instructional practices,
professional training and education, and their views on mathematics
and science. The heads of schools responded to questions
about school staffing and resources, mathematics and science course
offerings, and teacher support.
Translation and Verification
The TIMSS instruments were prepared in English and translated into
33 languages, with 10 of the 38 countries collecting data in two languages.
In addition, it sometimes was necessary to modify
the international versions for cultural reasons, even in the nine
countries that tested in English. This process represented an enormous
effort for the national centers, with many checks along the way. The
translation effort included (1) developing explicit guidelines for
translation and cultural adaptation; (2) translation of the instruments
by the national centers in accordance with the guidelines, using two
or more independent translations; (3) consultation with subject-matter
experts on cultural adaptations to ensure that the meaning and difficulty
of items did not change; (4) verification of translation quality by
professional translators from an independent translation company;
(5) corrections by the national centers in accordance with the suggestions
made; (6) verification by the International Study
Center that corrections were made; and (7) a series of statistical
checks after the testing to detect items that did not perform comparably
across countries.(5)
Population Definition and Sampling
TIMSS in 1995 had as its target population students enrolled in the
two adjacent grades that contained the largest proportion of 13-year-old
students at the time of testing, which were seventh- and eighth-grade
students in most countries. TIMSS in 1999 used the same definition
to identify the target grades, but assessed students in the upper
of the two grades only, which was the eighth grade in most countries,
including the United States.(6) The eighth grade
was the target population for all of the Benchmarking participants.
The selection of valid and efficient samples was essential to the
success of TIMSS and of the Benchmarking Study. For TIMSS internationally,
NRCS, including Westat, the sampling and data collection coordinator
for TIMSS in the United States, received training in how to select
the school and student samples and in the use of the sampling software,
and worked in close consultation with Statistics Canada, the TIMSS
sampling consultants, on all phases of sampling. As well as conducting
the sampling and data collection for the U.S. national TIMSS sample,
Westat was also responsible for sampling and data collection in each
of the Benchmarking states, districts, and consortia.
To document the quality of the school and student samples in each
of the TIMSS countries, staff from Statistics Canada and the International
Study Center worked with the TIMSS sampling referee (Keith Rust, Westat)
to review sampling plans, sampling frames, and sampling implementation.
Particular attention was paid to coverage of the target population
and to participation by the sampled schools and students. The data
from the few countries that did not fully meet all of the sampling
guidelines are annotated in the TIMSS international reports, and are
also annotated in this report. The TIMSS samples for the Benchmarking
participants were also carefully reviewed in light of the TIMSS sampling
guidelines, and the results annotated where appropriate. Since Westat
was the sampling contractor for the Benchmarking project, the role
of sampling referee for the Benchmarking review was filled by Pierre
Foy, of Statistics Canada.
Although all countries and Benchmarking participants were expected
to draw samples representative of the entire internationally desired
population (all students in the upper of the two adjacent grades with
the greatest proportion of 13-year-olds), the few countries where
this was not possible were permitted to define a national desired
population that excluded part of the internationally desired population.
Exhibit
A.3 shows any differences in coverage between the international
and national desired populations. Almost all TIMSS countries achieved
100 percent coverage (36 out of 38), with Lithuania and Latvia the
exceptions. Consequently, the results for Lithuania are annotated,
and because coverage fell below 65 percent for Latvia, the Latvian
results are labeled Latvia (LSS), for Latvian-Speaking
Schools. Additionally, because of scheduling difficulties, Lithuania
was unable to test its eighth-grade students in May 1999 as planned.
Instead, the students were tested in September 1999, when they had
moved into the ninth grade. The results for Lithuania are annotated
to reflect this as well. Exhibit A.3 also shows that the sampling
plans for the Benchmarking participants all incorporated 100 percent
coverage of the desired population. Four of the 13 states (Idaho,
Indiana, Michigan, and Pennsylvania) as well as the Southwest Pennsylvania
Math and Science Collaborative included private schools as well as
public schools.
In operationalizing their desired eighth-grade population, countries
and Benchmarking participants could define a population to be sampled
that excluded a small percentage (less than 10 percent) of certain
kinds of schools or students that would be very difficult or resource-intensive
to test (e.g., schools for students with special needs or schools
that were very small or located in extremely rural areas). Exhibit
A.3 also shows that the degree of such exclusions was small. Among
countries, only Israel reached the 10 percent limit, and among Benchmarking
participants, only Guilford County and Montgomery County did so. All
three are annotated as such in the achievement chapters of this report.
Within countries, TIMSS used a two-stage sample design, in which
the first stage involved selecting about 150 public and private schools
in each country. Within each school, countries were to use random
procedures to select one mathematics class at the eighth grade. All
of the students in that class were to participate in the TIMSS testing.
This approach was designed to yield a representative sample of about
3,750 students per country. Typically, between 450 and 3,750 students
responded to each achievement item in each country, depending on the
booklets in which the items appeared.
States participating in the Benchmarking study were required to sample
at least 50 schools and approximately 2,000 eighth-grade students.
School districts and consortia were required to sample at least 25
schools and at least 1,000 students. Where there were fewer than 25
schools in a district or consortium, all schools were to be included,
and the within-school sample increased to yield the total of 1,000
students.
Exhibits
A.4 and A.5
present achieved sample sizes for schools and students, respectively,
for the TIMSS countries and for the Benchmarking participants. Where
a district or consortium was part of a state that also participated,
the state sample was augmented by the district or consortium sample,
properly weighted in accordance with its size. Schools in a state
that were sampled as part of the U.S. national TIMSS sample were also
used to augment the state sample. For example, the Illinois sample
consists of 90 schools, 41 from the state Benchmarking sample (including
five schools from the national TIMSS sample), 27 from the Chicago
Public Schools, 17 from the First in the World Consortium, and five
from the Naperville School District.
Exhibit
A.6 shows the participation rates for schools, students, and overall,
both with and without the use of replacement schools, for TIMSS countries
and Benchmarking participants. All of the countries met the guideline
for sampling participation 85 percent of both the schools and
students, or a combined rate (the product of school and student participation)
of 75 percent although Belgium (Flemish), England, Hong Kong,
and the Netherlands did so only after including replacement schools,
and are annotated accordingly in the achievement chapters.
With the exception of Pennsylvania and Texas, all the Benchmarking
participants met the sampling guidelines, although Indiana did so
only after including replacement schools. Indiana
is annotated to reflect this in the achievement chapters, and Pennsylvania
and Texas are italicized in all exhibits in this report.
Data Collection
Each participating country was responsible for carrying out all aspects
of the data collection, using standardized procedures developed for
the study. Training manuals were created for school coordinators and
test administrators that explained procedures for receipt and distribution
of materials as well as for the activities related to the testing
sessions. These manuals covered procedures for test security, standardized
scripts to regulate directions and timing, rules for answering students
questions, and steps to ensure that identification on the test booklets
and questionnaires corresponded to the information on the forms used
to track students. As the data collection contractor for the U.S.
national TIMSS, Westat was fully acquainted with the TIMSS procedures,
and applied them in each of the Benchmarking jurisdictions in the
same way as in the national data collection.
Each country was responsible for conducting quality control procedures
and describing this effort in the NRCs report documenting procedures
used in the study. In addition, the International Study Center considered
it essential to monitor compliance with standardized procedures through
an international program of quality control site visits. NRCS were
asked to nominate one or more persons unconnected with their national
center, such as retired school teachers, to serve as quality control
monitors for their countries. The International Study Center developed
manuals for the monitors and briefed them in two-day training sessions
about TIMSS, the responsibilities of the national centers in conducting
the study, and their own roles and responsibilities. In all, 71 international
quality control monitors participated in this training.
The international quality control monitors interviewed the NRCS about
data collection plans and procedures. They also visited a sample of
15 schools where they observed testing sessions and interviewed school
coordinators.(7) Quality control monitors interviewed
school coordinators in all 38 countries, and observed a total of 550
testing sessions. The results of the interviews conducted by the international
quality control monitors indicated that, in general, NRCS had prepared
well for data collection and, despite the heavy demands of the schedule
and shortages of resources, were able to conduct the data collection
efficiently and professionally. Similarly, the TIMSS tests appeared
to have been administered in compliance with international procedures,
including the activities before the testing session, those during
testing, and the school-level activities related to receiving,
distributing, and returning material from the national centers.
As a parallel quality control effort for the Benchmarking project,
the International Study Center recruited and trained a team of 18
quality control observers, and sent them to observe the data collection
activities of the Westat test administrators in a sample of about
10 percent of the schools in the study (98 schools in all).(8)
In line with the experience internationally, the
observers reported that the data collection was conducted successfully
according to the prescribed procedures, and that no serious problems
were encountered.
Scoring the Free-Response Items
Because about one-third of the written test time was devoted to free-response
items, TIMSS needed to develop procedures for reliably evaluating
student responses within and across countries. Scoring used two-digit
codes with rubrics specific to each item. The first digit designates
the correctness level of the response. The second digit, combined
with the first, represents a diagnostic code identifying specific
types of approaches, strategies, or common errors and misconceptions.
Although not used in this report, analyses of responses based on the
second digit should provide insight into ways to help students better
understand science concepts and problem-solving approaches.
To ensure reliable scoring procedures based on the TIMSS rubrics,
the International Study Center prepared detailed guides containing
the rubrics and explanations of how to implement them, together with
example student responses for the various rubric categories. These
guides, along with training packets containing extensive examples
of student responses for practice in applying the rubrics, were used
as a basis for intensive training in scoring the free-response items.
The training sessions were designed to help representatives of national
centers who would then be responsible for training personnel in their
countries to apply the two-digit codes reliably. In the United States,
the scoring was conducted by National Computer Systems (NCS) under
contract to Westat. To ensure that student responses from the Benchmarking
participants were scored in the same way as those from the U.S. national
sample, NCS had both sets of data scored at the same time and by the
same scoring staff.
To gather and document empirical information about the within-country
agreement among scorers, TIMSS arranged to have systematic subsamples
of at least 100 students responses to each item coded independently
by two readers. Exhibit
A.7 shows the average and range of the within-country percent
of exact agreement between scorers on the free-response items in the
science test for 37 of the 38 countries. A high percentage of exact
agreement was observed, with an overall average of 95 percent across
the 37 countries. The TIMSS data from the reliability studies indicate
that scoring procedures were robust for the science items, especially
for the correctness score used for the analyses in this report. In
the United States, the average percent exact agreement was 94 percent
for the correctness score and 89 percent for the diagnostic score.
Since the Benchmarking data were combined with the
U.S. national TIMSS sample for scoring purposes, this high level of
scoring reliability applies to the Benchmarking data also.
Test Reliability
Exhibit
A.8 displays the science test reliability coefficient for each
country and Benchmarking participant. This coefficient is the median
KR-20 reliability across the eight test booklets. Among countries,
median reliabilities ranged from 0.62 in Morocco to 0.86 in Singapore.
The international median, 0.80, is the median of the reliability coefficients
for all countries. Reliability coefficients among
Benchmarking participants were generally close to the international
median, ranging from 0.82 to 0.86 across states, and from 0.77 to
0.85 across districts and consortia.
Data Processing
To ensure the availability of comparable, high-quality data for analysis,
TIMSS took rigorous quality control steps to create the international
database.(9) TIMSS prepared manuals and software
for countries to use in entering their data, so that the information
would be in a standardized international format before being forwarded
to the IEA Data Processing Center in Hamburg for creation of the international
database. Upon arrival at the Data Processing Center, the data underwent
an exhaustive cleaning process. This involved several iterative steps
and procedures designed to identify, document, and correct deviations
from the international instruments, file structures, and coding schemes.
The process also emphasized consistency of information within national
data sets and appropriate linking among the many student, teacher,
and school data files. In the United States, the creation of the data
files for both the Benchmarking participants and the U.S. national
TIMSS effort was the responsibility of Westat, working closely with
NCS. After the data files were checked carefully by Westat, they were
sent to the IEA Data Processing Center, where they underwent further
validity checks before being
forwarded to the International Study Center.
IRT Scaling and Data Analysis
The general approach to reporting the TIMSS achievement data was
based primarily on item response theory (IRT) scaling methods.(10)
The science results were summarized using a family of 2-parameter
and 3-parameter IRT models for dichotomously-scored items (right or
wrong), and generalized partial credit models for items with 0, 1,
or 2 available score points. The IRT scaling method produces a score
by averaging the responses of each student to the items that he or
she took in a way that takes into account the difficulty and discriminating
power of each item. The methodology used in TIMSS includes refinements
that enable reliable scores to be produced even though individual
students responded to relatively small subsets of the total science
item pool. Achievement scales were produced for each of the six science
content areas (earth science, life science, physics, chemistry, environmental
and resource issues, and scientific inquiry and the nature of science),
as well as for science overall.
The IRT methodology was preferred for developing comparable estimates
of performance for all students, since students answered different
test items depending upon which of the eight test booklets they received.
The IRT analysis provides a common scale on which performance can
be compared across countries. In addition to providing a basis for
estimating mean achievement, scale scores permit estimates of how
students within countries vary and provide information on percentiles
of performance. To provide a reliable measure of student achievement
in both 1999 and 1995, the overall science scale was calibrated using
students from the countries that participated in both years. When
all countries participating in 1995 at the eighth grade are treated
equally, the TIMSS scale average over those countries is 500 and the
standard deviation is 100. Since the countries varied in size, each
country was weighted to contribute equally to the mean and standard
deviation of the scale. The average and standard deviation of the
scale scores are arbitrary and do not affect scale interpretation.
When the metric of the scale had been established, students from the
countries that tested in 1999 but not 1995 were assigned scores on
the basis of the new scale. IRT scales were also created for each
of the six science content areas for the 1999 data. Students from
the Benchmarking samples were assigned scores on the overall science
scale as well as in each of the six science content areas using the
same item parameters and estimation procedures as for TIMSS internationally.
To allow more accurate estimation of summary statistics for student
subpopulations, the TIMSS scaling made use of plausible-value technology,
whereby five separate estimates of each students score were
generated on each scale, based on the students responses to
the items in the students booklet and the students background
characteristics. The five score estimates are known
as plausible values, and the variability between them
encapsulates the uncertainty inherent in the score estimation process.
Estimating Sampling Error
Because the statistics presented in this report
are estimates of performance based on samples of students, rather
than the values that could be calculated if every student in every
country or Benchmarking jurisdiction had answered every question,
it is important to have measures of the degree of uncertainty of the
estimates. The jackknife procedure was used to estimate the standard
error associated with each statistic presented in this report.(11)
The jackknife standard errors also include an error component due
to variation between the five plausible values generated for each
student. The use of confidence intervals, based on the standard errors,
provides a way to make inferences about the population means and proportions
in a manner that reflects the uncertainty associated with the sample
estimates. An estimated sample statistic plus or
minus two standard errors represents a 95 percent confidence interval
for the corresponding population result.
Making Multiple Comparisons
This report makes extensive use of statistical hypothesis-testing
to provide a basis for evaluating the significance of differences
in percentages and in average achievement scores. Each separate test
follows the usual convention of holding to 0.05 the probability that
reported differences could be due to sampling variability alone. However,
in exhibits where statistical significance tests are reported, the
results of many tests are reported simultaneously, usually at least
one for each country and Benchmarking participant in the exhibit.
The significance tests in these exhibits are based on a Bonferroni
procedure for multiple comparisons that hold to 0.05 the probability
of erroneously declaring a statistic (mean or percentage) for one
entity to be different from that for another entity. In the multiple
comparison charts (Exhibit
1.2 and those in Appendix B),
the Bonferroni procedure adjusts for the number of entities in the
chart, minus one. In exhibits where a country or
Benchmarking participant statistic is compared to the international
average, the adjustment is for the number of entities.(12)
Setting International Benchmarks of Student Achievement
International benchmarks of student achievement were computed at
each grade level for both mathematics and science. The benchmarks
are points in the weighted international distribution of achievement
scores that separate the 10 percent of students located on top of
the distribution, the top 25 percent of students, the top 50 percent,
and the bottom 25 percent. The percentage of students in each country
and Benchmarking jurisdiction meeting or exceeding the international
benchmarks is reported. The benchmarks correspond to the 90th, 75th,
50th, and 25th percentiles of the international distribution of achievement.
When computing these percentiles, each country contributed as many
students to the distribution as there were students in the target
population in the country. That is, each countrys contribution
to setting the international benchmarks was proportional to the estimated
population enrolled at the eighth grade.
In order to interpret the TIMSS scale scores and analyze achievement
at the international benchmarks, TIMSS conducted a scale anchoring
analysis to describe achievement of students at those four points
on the scale. Scale anchoring is a way of describing
students performance at different points on a scale in terms
of what they know and can do. It involves a statistical component,
in which items that discriminate between successive
points on the scale are identified, and a judgmental component in
which subject-matter experts examine the items and generalize to students
knowledge and understandings.(13)
Science Curriculum Questionnaire
In an effort to collect information about the content of the intended
curriculum in science, TIMSS asked National Research Coordinators
and Coordinators from the Benchmarking jurisdictions to complete a
questionnaire about the structure, organization, and content coverage
of their curricula. Coordinators reviewed 42 science topics and reported
the percentage of their eighth-grade students for which each topic
was intended in their curriculum. Although most topic descriptions
were used without modification, there were occasions when Coordinators
found it necessary to expand on or qualify the topic description to
describe their situation accurately. The country-specific adaptations
to the science curriculum questionnaire are presented in Exhibit
A.9. No adaptations to the list of topics were necessary for the
U.S. national version. Among Benchmarking participants, seven of the
states and none of the districts or consortia made adaptations, and
these are shown in Exhibit
A.10.
| 1 |
The TIMSS 1999 results for mathematics
and science, respectively, are reported in Mullis, I.V.S., Martin,
M.O., Gonzalez, E.J., Gregory, K.D., Garden, R.A., OConnor,
K.M., Chrostowski, S.J., and Smith, T.A. (2000), TIMSS 1999 International
Mathematics Report: Findings from IEAs Repeat of the Third
International Mathematics and Science Study at the Eighth Grade,
Chestnut Hill, MA: Boston College, and in Martin, M.O., Mullis,
I.V.S., Gonzalez, E.J., Gregory, K.D., Smith, T.A., Chrostowski,
S.J., Garden, R.A., and OConnor, K.M. (2000), TIMSS 1999
International Science Report: Findings from IEAs Repeat of
the Third International Mathematics and Science Study at the Eighth
Grade, Chestnut Hill, MA: Boston College |
| 2 |
The complete TIMSS curriculum frameworks
can be found in Robitaille, D.F., et al. (1993), TIMSS Monograph
No.1: Curriculum Frameworks for Mathematics and Science, Vancouver,
BC: Pacific Educational Press. |
| 3 |
For a full discussion of the TIMSS
1999 test development effort, please see Garden, R.A. and Smith,
T.A. (2000), TIMSS Test Development in M.O. Martin,
K.D. Gregory, K.M. OConnor, and S.E. Stemler (eds.), TIMSS
1999 Benchmarking Technical Report, Chestnut Hill, MA: Boston
College. |
| 4 |
The 1999 TIMSS test design is identical
to the design for 1995, which is fully documented in Adams, R. and
Gonzalez, E. (1996), TIMSS Test Design in M.O. Martin
and D.L. Kelly (eds.), Third International Mathematics and Science
Study Technical Report, Volume I, Chestnut Hill, MA: Boston
College. |
| 5 |
More details about the translation
verification procedures can be found in OConnor, K., and Malak,
B. (2000), Translation and Cultural Adaptation of the TIMSS
Instruments in M.O. Martin, K.D. Gregory, K.M. OConnor,
and S.E. Stemler (eds.), TIMSS 1999 Benchmarking Technical Report,
Chestnut Hill, MA: Boston College. |
| 6 |
The sample design for TIMSS is described
in detail in Foy, P., and Joncas, M. (2000), TIMSS Sample
Design in M.O. Martin, K.D. Gregory, and S.E. Stemler (eds.),
TIMSS 1999 Technical Report, Chestnut Hill, MA: Boston College.
Sampling for the Benchmarking project is described in Fowler, J.,
Rizzo, L., and Rust, K. (2001), TIMSS Benchmarking Sampling
Design and Implementation in M.O. Martin, K.D. Gregory, K.M.
OConnor, and S.E. Stemler (eds.), TIMSS 1999 Benchmarking
Technical Report, Chestnut Hill, MA: Boston College. |
| 7 |
Steps taken to ensure high-quality
data collection in TIMSS internationally are described in detail
in OConnor, K., and Stemler, S. (2000), Quality Control
in the TIMSS Data Collection in M.O. Martin, K.D. Gregory
and S.E. Stemler (eds.), TIMSS 1999 Technical Report, Chestnut
Hill, MA: Boston College. |
| 8 |
Quality control measures for the Benchmarking
project are described in OConnor, K. and Stemler, S. (2001),
Quality Control in the TIMSS Benchmarking Data Collection
in M.O. Martin, K.D. Gregory, K.M. OConnor, and S.E. Stemler
(eds.), TIMSS 1999 Benchmarking Technical Report, Chestnut
Hill, MA: Boston College. |
| 9 |
These steps are detailed in Hastedt,
D., and Gonzalez, E. (2000), Data Management and Database
Construction in M.O. Martin, K.D. Gregory, K.M. OConnor,
and S.E. Stemler (eds.), TIMSS 1999 Benchmarking Technical Report,
Chestnut Hill, MA: Boston College. |
| 10 |
For a detailed description of the TIMSS
scaling, see Yamamoto, K., and Kulick, E. (2000), Scaling
Methods and Procedures for the TIMSS Mathematics and Science Scales
in M.O. Martin, K.D. Gregory, K.M. OConnor, and S.E. Stemler
(eds.), TIMSS 1999 Benchmarking Technical Report, Chestnut
Hill, MA: Boston College. |
| 11 |
Procedures for computing jackknifed
standard errors are presented in Gonzalez, E. and Foy, P. (2000),
Estimation of Sampling Variance in M.O. Martin, K.D.
Gregory, K.M. OConnor, and S.E. Stemler (eds.), TIMSS 1999
Benchmarking Technical Report, Chestnut Hill, MA: Boston College.
|
| 12 |
The application of the Bonferroni procedures
is described in Gonzalez, E., and Gregory, K. (2000), Reporting
Student Achievement in Mathematics and Science in M.O. Martin,
K.D. Gregory, K.M. OConnor, and S.E. Stemler (eds.), TIMSS
1999 Benchmarking Technical Report, Chestnut Hill, MA: Boston
College. |
| 13 |
The scale anchoring procedure is described
fully in Gregory, K., and Mullis, I. (2000), Describing International
Benchmarks of Student Achievement in M.O. Martin, K.D. Gregory,
K.M. OConnor, and S.E. Stemler (eds.), TIMSS 1999 Benchmarking
Technical Report, Chestnut Hill, MA: Boston College. An application
of the procedure to the 1995 TIMSS data may be found in Smith, T.A.,
Martin, M.O., Mullis, I.V.S., and Kelly, D.L. (2000), Profiles
of Student Achievement in Science at the TIMSS International Benchmarks:
U.S. Performance and Standards in an International Context,
Chestnut Hill, MA: Boston College. |