Week 8 DQ1 What factors are most important when synthesizing psychological tests results into comprehensive reports? What is the best method of provi

Week 8 DQ1
What factors are most important when synthesizing psychological tests results into comprehensive reports? What is the best method of providing feedback of the assessment?

The Standardsfor Educational and Psychological Testing (AERA, APA, & NCME, 2014) should be required reading for this course AND a resource on the desk of all practitioners. The Standards have a chapter devoted to providing feedback on assessments. Although this seems simple in theory, in practice how do we provide feedback from personality tests to applicants? For example, what feedback would you provide to someone who achieved a low score on Emotional Stability (ES) and Agreeableness (A)?
Moreover, how would you address the question of “how do I improve my scores on ES and A” from an applicant?

Don't use plagiarized sources. Get Your Custom Assignment on
Week 8 DQ1 What factors are most important when synthesizing psychological tests results into comprehensive reports? What is the best method of provi
From as Little as $13/Page

ORIGINAL RESEARCH
published: 19 March 2019

doi: 10.3389/feduc.2019.00020

Frontiers in Education | www.frontiersin.org 1 March 2019 | Volume 4 | Article 20

Edited by:

Christopher Charles Deneen,

Royal Melbourne Institute of

Technology (RMIT University),

Australia

Reviewed by:

Peter Nystrm,

University of Gothenburg, Sweden

Cassim Munshi,

Nanyang Technological University,

Singapore

*Correspondence:

Mary Roduta Roberts

[emailprotected]

Specialty section:

This article was submitted to

Assessment, Testing and Applied

Measurement,

a section of the journal

Frontiers in Education

Received: 11 October 2018

Accepted: 25 February 2019

Published: 19 March 2019

Citation:

Roduta Roberts M and Gotch CM

(2019) Development and Examination

of a Tool to Assess Score Report

Quality. Front. Educ. 4:20.

doi: 10.3389/feduc.2019.00020

Development and Examination of a
Tool to Assess Score Report Quality
Mary Roduta Roberts1* and Chad M. Gotch2

1 Department of Occupational Therapy, University of Alberta, Edmonton, AB, Canada, 2 Learning and Performance Research

Center, Washington State University, Pullman, WA, United States

The need for quality in score reporting practices is represented in the Standards for

Educational and Psychological Testing (American Educational Research Association

American Psychological Association National Council on Measurement in Education,

2014). The purpose of this study was to introduce a ratings-based instrument to assess

the quality of score reports and examine the reliability of scores obtained. Quality criteria

were derived from best-practices published within the literature (Hambleton and Zenisky,

2013). The rating scale was used to assess a sample of 40 English-language individual

student score reports for K-12 accountability tests representing 42 states and five

provinces in the United States and Canada. A two-facet generalizability study (i.e., sr

x d x r) was completed with an overall reliability coefficient of G = 0.78. Application of the

rating scale may provide a means to support empirical study of relationships between

score report quality and stakeholder outcomes including interpretation, use, and impact.

Keywords: score report, communication, generalizability theory, interrater reliability, accountability testing

INTRODUCTION

Score reporting holds a unique vantage within the enterprise of large-scale testing. A score report
serves as the primary interface between the test developer and stakeholders. The vast resources
devoted to test development, administration, and security, as well as in the development of
educational accountability policy in many large-scale testing contexts, are all funneled into one
medium. From there, stakeholders (e.g., students, teachers, school districts, communities) assess
themselves, assess their educational experiences, and may respond in some way to the information
that has been communicated to them. Such response may be personal, in the form of maintaining
or changing study habits, or it may be broader in focus, with action taken toward school leadership
or living conditions in the community.

The need for quality in score reporting practices is represented in the Standards for Educational
and Psychological Testing (Ryan, 2006; American Educational Research Association American
Psychological Association National Council on Measurement in Education, 2014; Zenisky and
Hambleton, 2015). In addition to the centrality of score reporting to the appropriate score
interpretations and uses discussed under the headings of Validity and Fairness, for example, the
topic of score reporting is taken on directly in Standards 6.0 and 6.106.13. Across these standards,
score reports are clearly identified as essential to the support of valid interpretations and uses
and the minimization of potential negative consequences. An accurate score interpretation or
defensible use of a score is dependent upon the quality of communication of examinee performance
(OLeary et al., 2017a).

https://www.frontiersin.org/journals/education

https://www.frontiersin.org/journals/education#editorial-board

https://www.frontiersin.org/journals/education#editorial-board

https://www.frontiersin.org/journals/education#editorial-board

https://www.frontiersin.org/journals/education#editorial-board

https://doi.org/10.3389/feduc.2019.00020

http://crossmark.crossref.org/dialog/?doi=10.3389/feduc.2019.00020&domain=pdf&date_stamp=2019-03-19

https://www.frontiersin.org/journals/education

https://www.frontiersin.org

https://www.frontiersin.org/journals/education#articles

https://creativecommons.org/licenses/by/4.0/

mailto:[emailprotected]

https://doi.org/10.3389/feduc.2019.00020

https://www.frontiersin.org/articles/10.3389/feduc.2019.00020/full

http://loop.frontiersin.org/people/512001/overview

http://loop.frontiersin.org/people/392127/overview

Roduta Roberts and Gotch Score Report Rating Scale

The centrality of score reporting to the testing enterprise
warrants focused and rigorous scholarly attention to the
development and functioning of reports. To date, much has been
written about the process of report development (e.g., Zapata-
Rivera et al., 2012; Zenisky and Hambleton, 2012). Similarly,
guidelines and demonstrations of report design have proliferated
(e.g., Goodman and Hambleton, 2004; Roduta Roberts and Gierl,
2010; Hambleton and Zenisky, 2013; Zenisky and Hambleton,
2015). Related to design considerations, a strong focus in the
literature has revolved around when it is appropriate to report
subscale scores (e.g., Sinharay, 2010; Lyren, 2012; Feinberg and
Wainer, 2014a,b).

The most prominent set of score reporting guidelines was
introduced in 2013 by Hambleton and Zenisky, and restated
and further articulated 2 years later (Zenisky and Hambleton,
2015). These guidelines consider score reports as the culminating
element of the test development process. Hambleton and Zenisky
foregrounded connections between score reports and validity
within a testing system. The authors organized guidelines into
domains, which we will discuss in greater detail below that
addressed such concerns as needs assessments, design, language,
and interpretive supports. Zenisky and Hambleton (2015) state
that the guidelines reflect considerations gathered from practical
reporting experience, the testing literature and findings from
the literature on communication of statistical results outside
the domain of psychometrics (p. 586). A key characteristic of
the guidelines is that they assume the testing context strongly
influences what content and form of presentation makes a report
useful, understandable, and accessible (i.e., good). Therefore,
the guidelines are presented generally and are intended to be
malleable to fit a wide variety of score reports.

An implicit assumption in score reporting guidelines is
that score reports developed using best practices will be of
a higher quality, which in turn will facilitate more accurate
interpretations and appropriate uses. The current research base,
however, cannot address with confidence the assumption that
higher quality reports relate to more positive outcomes. A
recent review of the literature (Gotch and Roduta Roberts,
2018) documents fundamental issues in score report research,
among them reliance upon small samples bounded within a
particular context. Additionally, approaches to assessing score
report quality have been indirect, often relying on qualitative
user feedback. While published guidelines may be logical and
intuitive, there exists minimal empirical support for score
reporting practices grounded in such guidelines. Given the state
of the current literature base, there is opportunity to strengthen
recommendations for sound practice and to explore unexpected
dynamics that may occur when score reports are sent into the real
world. Indeed the field has called for more empirical work and the
need for feedback loops (Zenisky, 2015).

In this study, as a step toward closing this important gap in
the literature, we introduce and examine a rating-scale tool to
assess the quality of score reports from traditional large-scale
testing systems with typically more summative aims, and to
facilitate understanding of how score report quality functions
within a larger network of inputs and outcomes. We envision the
primary application of this tool to be in research settings, where

variables related to report design, population characteristics,
interpretations, uses, and other outcomes are of interest. The
purpose of the instrument is in congruence with recent work by
OLeary et al. (2017b).

To complement our introduction of the rating-scale tool,
we present an initial analysis in what will need to be a
larger validation effort. In the intended research applications,
reliability of ratings obtained from the tool would be important
to drawing sound conclusions. Therefore, we present an
examination of inter-rater reliability through a generalizability
theory framework. We situate this work within a validity
argument (Kane, 2013), providing necessary but not sufficient
support of a generalization claim that ratings obtained from a
single assessment align with a universe of hypothetical ratings
obtained in the research setting. That is, we examine the extent
to which report ratings can be interpreted as being reproducible
across raters (Cook and Hatala, 2016). In grounding the tool in a
thoroughly documented and widely acknowledged set of design
guidelines, we also support a domain description claim that
the rating-scale tool reflects salient report design considerations
suitable for research on score report functioning. The overall
goal of the proposal and examination of the rating scale tool
is to provide a means for strengthening the robustness of
empirical research in score reporting, an important component
of testing programs.

A LOGIC MODEL FOR SCORE REPORTING

To provide a roadmap for score report research, we present a
basic logic model that represents the connection between score
report development and outcomes of interest (Figure 1). This
logic model is not intended to replace the report development
models already proposed (Zapata-Rivera et al., 2012; Zenisky and
Hambleton, 2012). Rather, it takes on a different perspective, one
oriented toward empirical research. The model first considers
factors related to score report development. Considerations
around content, language, and design reflect the criteria set
forth by Hambleton and Zenisky (2013), and influence the
generation of a productthe score reporting document or
interface (in the case of a computer- or web-based reporting
environment). From this report, proximal outcomes include
stakeholder interpretations and uses. Stemming from proximal
outcomes or directly from the presentation of the reports,
themselves, are the distal outcomes. These include test-taker
outcomes, stakeholder capacity, and consequences of the testing
program. For example, did the test-taker reach improved
levels of achievement based on actions taken in response to
the score report? Did the test-taker or test-takers parents
use the score report to engage in partnership with the local
school district officials to influence the educational system?
What, if any, are the unintended effects of receiving the
score report as part of the assessment experience? Such
questions characterize a focus on the distal outcomes of
score reporting.

This logic model can serve to characterize or plan empirical
score reporting research. For example, a study of the content of

Frontiers in Education | www.frontiersin.org 2 March 2019 | Volume 4 | Article 20

https://www.frontiersin.org/journals/education

https://www.frontiersin.org

https://www.frontiersin.org/journals/education#articles

Roduta Roberts and Gotch Score Report Rating Scale

FIGURE 1 | A logic model representing the production of score reports and

associated outcomes of interest.

reports of English language proficiency assessments (Faulkner-
Bond et al., 2013) focuses on just one element of the logic
model, specifically the centerpiece of score reporting artifacts
and practices. In contrast, (van der Kleij et al., 2014) examined
the interpretations formed from alternative reports developed
within a single framework (i.e., design research approach). Thus,
their study addressed a relationship between elements in the logic
model, specifically score reporting artifacts and the proximal
outcome of user interpretations. Characterizing empirical score
report research using the logic model can also benefit the field
by facilitating identification of unexplored research questions.
The rating scale introduced in this paper further benefits the
field by providing a means to gather quantitative data for the
central element. Essentially, the rating scale allows the treatment
of score report quality to move from categorical (e.g., alternative
forms of a report) to a finer-grained level of measurement
(e.g., assigned scores). Leveraging the use of domains derived
from the Hambleton-Zenisky framework, scores from the rating
scale allow the field to address whether different ways of
considering content, language, and design differ in the quality
of the report they generate. Such an inquiry could be extended
to examine relationships to proximal and distal outcomes. Thus
the empirical studies by Faulkner-Bond et al. (2013) and van der
Kleij et al. (2014) may also contribute toward the development
of score reporting theory. Ideally, such a theory of score
reporting would not only describe principles for score report
development and evaluation but also facilitate explanation for
how design considerations may impact broader educational aims.
The combination of the logic model and rating scale instrument,
therefore, empowers educational researchers to answer the basic
question of whether better score reports do indeed lead to better
outcomes for key stakeholders.

Model of Communication
Any consideration of score report quality must be acknowledged
as rooted in a particular perspective. The score report rating
scale presented in this paper calls upon published guidelines

that come from a test development perspective and follow
a tradition embodied in the Standards for Educational and
Psychological Testing. Further, we have situated the rating scale
in the specified logic model, which posits causal connections
between report content, and outcomes such as interpretation.
Therefore, we suggest the rating scale best represents the
cybernetics tradition within communications (Craig, 1999; Craig
and Muller, 2007; Gotch and Roduta Roberts, 2018). The
cybernetics communication lens, or information processing layer
as described by Behrens et al. (2013), can be described as focused
on the transmission of technical information with the goal of
minimal distortion upon reaching the receiver. Applied to score
reporting, quality of the score reports is judged by the extent to
which the content and design of the reports facilitates translation
and understanding.

METHODS

Sample
For the investigation of the reliability of ratings obtained from
the rating scale instrument, a sample of 40 English-language
individual student score reports for K-12 accountability tests
representing 42 states and five provinces in the United States
and Canada was examined. Reports were in electronic form,
obtained from state and province department of education
websites during the 20132014 school year. The score reports
of interest for this study were those associated with tests
administered to children in grades 3 through 8, covering at
a minimum the topics of mathematics, reading, and writing.
In the United States, such tests served No Child Left Behind
(United States Congress., 2002) accountability mandates. If
multiple sample reports were available across this grade span, the
report closest to grade five was obtained as the report of record,
as this grade level represented a midpoint in the testing years
and was heavily represented in the sample reports. Though not
a census of current reports disseminated in the two countries,
we found the sample to provide sufficient variability to allow for
examination of reliability in the K-12 accountability testing genre
of score reporting.

Instrument
We developed an instrument to assess score report quality
based on the review framework set forth by Hambleton
and Zenisky (2013). This framework, which marks the
most comprehensive attempt to-date to develop a set of
guidelines for score report review, captures the quality of
reports within eight areas: (1) needs assessment, (2) content
report introduction and description, (3) contentscores and
performance levels, (4) contentother performance indicators,
(5) contentother, (6) language, (7) design, and (8) interpretive
guides and ancillary materials. Within each dimension of
the guidelines, specific review questions are asked, such as
Are concrete examples provided for the use of the test score
information? and Is a highlight or summary section included
to communicate the key score information? By grounding
in this framework, the instrument we advance in this study
reflects best practices in test development, experiences in score

Frontiers in Education | www.frontiersin.org 3 March 2019 | Volume 4 | Article 20

https://www.frontiersin.org/journals/education

https://www.frontiersin.org

https://www.frontiersin.org/journals/education#articles

Roduta Roberts and Gotch Score Report Rating Scale

report design, and knowledge of the score reporting literature
(Hambleton and Zenisky, 2013, p. 485).

We intended for the instrument to be used by score report
researchers across a wide range of development (i.e., graduate
students to veteran researchers) who might want to assess
score report quality or engage stakeholders who may not
possess specialized knowledge in educational measurement. The
adaptation of the guidelines for the instrument contains little
jargon, and thus lends itself well to wide accessibility. Further,
such accessibility allows for more democratic investigations
of score report quality without necessarily privileging the
perspective of one group (e.g., testing professionals) over another
(e.g., parents).

Given this orientation, we chose to focus on five core
domains: (1) Report Introduction and Description, (2) Scores
and Performance Levels, (3) Supporting Material, (4) Language,
(5) Design, and a supplemental sixth domain, Interpretive
Guides and Ancillary Materials (see Appendix A). We did not
include a domain for the needs assessment area because users
of the rating scale, particularly those not directly involved with
the test and score report development process, may not have
access to such evidence. Areas 4 and 5 in the Hambleton-
Zenisky framework were collapsed into a single domain, as both
were concerned with support of the communication of scores
and performance levels. Finally the Interpretative Guides and
Ancillary Materials domain was designated as supplemental. We
found such supplemental material to vary greatly across testing
program. Sometimes interpretive information was provided in
a concise (i.e., fewer than four pages), stand-alone document
that presumably accompanied the score report on its delivery
to students and parents. In other cases, interpretive guidance
was contained within a lengthy document (i.e., >80 pages) that
included detailed information on such matters as the schedule
of test administration across grade levels, performance level
descriptors, and revisions to the testing system. Interpretive
guides may have also covered multiple types of score reports
(e.g., individual student, school, district) and multiple testing
programs (e.g., standard and modified). In the case of the
longer, more thorough supplemental guides in practice, a parent
or student would need to actively search a department of
education website for such guidance or type in a URL provided
on the score report document. Further, it is possible brief
interpretive guides may accompany delivery of the score reports
to students and parents, but not be accessible on the department
of education website.

The assignment of ratings in the instrument is given at
the domain level. We reasoned this was an appropriate grain
size given the purpose of the instrument. The intent of the
rating scale is not to diagnose discrete shortcomings in need
of remedy, but rather to enable efficient quantification of score
report quality. We also note that research in rater cognition has
documented that raters may apply both holistic and analytic
processes regardless of the grain size of the intended rating
(Crisp, 2012; Suto, 2012).

We decided to develop the instrument in a rating scale
form. While a true rubric form (Brookhart, 2013) might provide
more robust descriptions of what quality looks like within each

domain, these descriptions could vary greatly across testing
contexts (e.g., accountability, progress monitoring, certification).
Indeed, the guidelines themselves contain much conditional
language about score report content (e.g., If present, are reports
from recent and relevant tests explained). Therefore, we opted
for a rating scale design where the rater assigns a score to
reflect the extent to which a set of criteria were met by the
report. In the scoring process, a report may earn 0 points for
failing to meet any criterion sufficiently and up to three points
for exemplifying excellent quality in the domain. Therefore, a
report may be assigned a total score between 0 and 15 for the
five core domains. To develop the criteria within each domain,
we re-worded the Hambleton-Zenisky framework elements to
function as statements rather than questions, and eliminated
some redundancy to ease rater burden. We also eliminated
reference to procedures for translating the report into another
language. Similar to our rationale for not including a needs
assessment domain, language translation/adaptation procedures
are not necessarily documented in the report or publicly
available through ancillary materials. This issue has been reported
in another application of the Hambleton-Zenisky framework
(Gndara and Rick, 2017).

Procedure
Four advanced graduate students in educational measurement
were recruited to provide ratings in the core domains. These
raters were briefed in a group setting on score report literature
and the purpose of the present study by one of the authors.
The rating scale was presented to them, and they were given
an opportunity to ask questions about rating scale content to
clarify their understandings and interpretations. Then, raters
were assigned three sample score reports to score. These reports
were not a part of the study sample, and represented a previous
version of one states Grade 3 accountability test, a diagnostic
algebra test, and a high school end-of-course examination. The
author then reconvened the group to share ratings and further
discuss rating scale content. The aim of this discussion was not
to obtain consensus on ratings, but rather to assure rating scale
content remained well understood in application and to address
new questions about how elements observed in the score reports
corresponded to rating scale criteria. For example, one point of
conversation concerned what counted as concrete examples for
test score use. The group also discussed how to provide a domain-
level rating when the criteria the domain comprises were met to
varying extents, which led to a review of the descriptors for each
score point. Once the group of raters felt comfortable with their
understanding, the study sample of score reports was provided,
and raters were given 3 weeks to complete their reviews. All
ratings were completed independently, and submitted to one of
the authors, who compiled the data for analysis.

Data Analysis
Report Ratings

We summarized ratings through a series of descriptive statistic
calculations. First, we summarized ratings within each rater-
domain combination. Then we summarized ratings for each
domain, collapsed across rater. Acknowledging that, in practice,

Frontiers in Education | www.frontiersin.org 4 March 2019 | Volume 4 | Article 20

https://www.frontiersin.org/journals/education

https://www.frontiersin.org

https://www.frontiersin.org/journals/education#articles

Roduta Roberts and Gotch Score Report Rating Scale

users may wish to have a total score produced by the rating scale,
we summed domain scores by rater, and calculated descriptive
statistics, and then again repeated the process with ratings
collapsed across rater. To gain an additional perspective on the
ratings assigned, we calculated Spearman rank-order correlations
between each of the domains for each rater.

Reliability

The rating responsibilities of the instrument employed in this
study require judgments about the quality of score report
elements across various domains. In the present study, score
report quality was assessed by multiple individuals. As a first step
toward documenting the consistency of ratings obtained by the
rating scale, within an argument-based approach to validation,
we undertook a study of score reliability. Generalizability Theory
provided a valuable analytical framework, as it allowed for
the decomposition of independent sources of error variance
(Brennan, 1992). A two-facet generalizability study (G-study)
was conducted with score reports fully crossed with rating
scale domains and raters (i.e., sr x d x r) to assess reliability
of rater scores. A peer-reviewed, publicly available SAS macro
(Mushquash and OConnor, 2006) was used to estimate variance
components. The R packages gtheory and boot were then used to
replicate the obtained variance components and obtain standard
errors for these components, using a bootstrap technique (Tong
and Brennan, 2007). The variance component estimates were
subsequently used in a decision study (D-study) using the
aforementioned SAS macro to estimate the effects of the number
of raters employed on reliability. Outcomes from the D-study
could be used by future researchers to plan for sufficient
personnel resources. In this study, we adopted a criterion of 0.80
for acceptable reliability, which is consistent with the intended
research-oriented application of the instrument (Nunnally and
Bernstein, 1994).

RESULTS

Distribution of Ratings
Average ratings within domain and rater ranged from 1.13 to
2.53, indicating a moderate to fairly high level of adherence
to the criteria specified by Hambleton and Zenisky (Table 1).
The standard deviations around these averages were generally in
the range of about 0.75 to 1.00, suggesting the rating scale was
sensitive to variability in the quality of the score reports. The

Language domain saw the highest ratings given with a combined
average of 2.45 (sd = 0.76), and Supporting Material was lowest
at 1.28 (sd = 0.98). The ranking of each domain was consistent
across all raters, and the average ratings provided were all within
half a point, except for Introduction and Description and Support
Material, where the ranges of average scores were 0.55 and
0.57, respectively. Figure 2 shows the shapes of distributions
of ratings varied across domains when collapsed across raters.
The distribution of total scores signals a potential ceiling effect
and negative skew. Correlations between domain ratings for
each rater were generally moderate (i.e., 0.350.65; Table 2). In
some cases the correlations were consistent across raters (e.g.,
Scores and Performance Levels with Supporting Material), while
in other cases (e.g., Introduction and Description with Design)
the coefficients were more spread out. Rater 1 had the lowest
correlations between the Introduction and Description domain
and other domains. Rater 4 also demonstrated some of the
lowest correlations, but these patterns did not hold across all
combinations of domains.

Reliability
The variance components and the proportions of variance from
the G-study analysis are presented in Table 3. The proportions of
variance associated with score reports, domain, and raters were
0.25, 0.19, and 0.02, respectively. For optimal measurement, the
proportion of variance associated with the object of measurement
(i.e., score reports) should be high relative to the proportion of
variance attributable to the other facets. The variance component
for raters was relatively small indicating that the ratings did not
vary substantially between raters (i.e., no rater was particularly
harsh or lenient in their application of the criteria, relative
to other raters). Interpretation of the variance component for
rating scale domain indicates that there was variability in
capturing these aspects of the score reports; the score reports
performed differently across the five domains with regards
to their assessment. A relatively large proportion of variance
associated with the rating scale domains can be interpreted as
a positive finding to the extent that different aspects of the
score reports can be meaningfully assessed and differences in
quality can be captured. Caution should be exercised where
interpretations of score report quality rest on an assumption
of unidimensionality.

The proportions of variance attributable to the two-way
interactions ranged from 0.01 (d x r) to 0.10 (sr x d) to 0.14 (sr

TABLE 1 | Mean ratings by domain and rater.

Rater 1 Rater 2 Rater 3 Rater 4 All raters combined

Introduction and description 1.90 (0.80) 1.73 (1.09) 2.28 (0.89) 1.85 (1.04) 1.94 (0.99)

Scores and performance levels 1.40 (0.80) 1.43 (0.92) 1.75 (0.73) 1.55 (0.77) 1.53 (0.82)

Supporting material 1.13 (0.84) 1.03 (1.01) 1.60 (0.89) 1.38 (1.04) 1.28 (0.98)

Language 2.40 (0.66) 2.45 (0.74) 2.53 (0.74) 2.43 (0.86) 2.45 (0.76)

Design 1.90 (0.73) 1.60 (0.94) 1.95 (0.95) 1.65 (0.85) 1.78 (0.89)

Total 8.73 (2.85) 8.23 (3.85) 10.10 (3.28) 8.85 (3.39) 8.98 (4.43)

Standard deviations are presented in parentheses.

Frontiers in Education | www.frontiersin.org 5 March 2019 | Volume 4 | Article 20

https://www.frontiersin.org/journals/education

https://www.frontiersin.org

https://www.frontiersin.org/journals/education#articles

Roduta Roberts and Gotch Score Report Rating Scale

x r). The low proportion of variance associated with the two-way
interaction between rating scale domain and rater suggests there
was little inconsistency in the average ratings between raters from
one rating scale dimension to the next. The two-way interactions
involving the object of measurement (i.e., score report) account
for 0.24 of the total variance. This finding can be interpreted
as the rankings of score reports varying to some extent by
rating scale domain and rater. Finally, a large proportion of
variance (0.30) was associated with the 3-way interaction, error
term. This finding suggests potential rater idiosyncrasies and
other systematic influences (e.g., unidentified facets, confounded
effects) on the assessment of score report quality that have not
yet been accounted for. As a measure of score reliability, a
generalizability coefficient of G = 0.78 was obtained, just shy of
the 0.80 standard.

The results of the D-study investigating the effect of varying
the number of raters on the generalizability coefficient are
presented in Table 4. As expected, score reliability improves with
an increase in the number of raters with the largest improvement
seen from employing one rater to employing two raters. The gains
in reliability diminish after employing more than three raters; five
raters are suggested to achieve a reliability >0.80.

DISCUSSION

The purpose of this study was to introduce and begin to examine
a tool for assessing score report quality to advance scholarly
work on score reporting. A rating scale was developed based
on quality criteria derived from best-practices published within
the literature. Although quality is n