• Open supplemental data
  • Reference Manager
  • Simple TEXT file

People also looked at

Systematic review article, a critical review of research on student self-assessment.

research paper on assessment

  • Educational Psychology and Methodology, University at Albany, Albany, NY, United States

This article is a review of research on student self-assessment conducted largely between 2013 and 2018. The purpose of the review is to provide an updated overview of theory and research. The treatment of theory involves articulating a refined definition and operationalization of self-assessment. The review of 76 empirical studies offers a critical perspective on what has been investigated, including the relationship between self-assessment and achievement, consistency of self-assessment and others' assessments, student perceptions of self-assessment, and the association between self-assessment and self-regulated learning. An argument is made for less research on consistency and summative self-assessment, and more on the cognitive and affective mechanisms of formative self-assessment.

This review of research on student self-assessment expands on a review published as a chapter in the Cambridge Handbook of Instructional Feedback ( Andrade, 2018 , reprinted with permission). The timespan for the original review was January 2013 to October 2016. A lot of research has been done on the subject since then, including at least two meta-analyses; hence this expanded review, in which I provide an updated overview of theory and research. The treatment of theory presented here involves articulating a refined definition and operationalization of self-assessment through a lens of feedback. My review of the growing body of empirical research offers a critical perspective, in the interest of provoking new investigations into neglected areas.

Defining and Operationalizing Student Self-Assessment

Without exception, reviews of self-assessment ( Sargeant, 2008 ; Brown and Harris, 2013 ; Panadero et al., 2016a ) call for clearer definitions: What is self-assessment, and what is not? This question is surprisingly difficult to answer, as the term self-assessment has been used to describe a diverse range of activities, such as assigning a happy or sad face to a story just told, estimating the number of correct answers on a math test, graphing scores for dart throwing, indicating understanding (or the lack thereof) of a science concept, using a rubric to identify strengths and weaknesses in one's persuasive essay, writing reflective journal entries, and so on. Each of those activities involves some kind of assessment of one's own functioning, but they are so different that distinctions among types of self-assessment are needed. I will draw those distinctions in terms of the purposes of self-assessment which, in turn, determine its features: a classic form-fits-function analysis.

What is Self-Assessment?

Brown and Harris (2013) defined self-assessment in the K-16 context as a “descriptive and evaluative act carried out by the student concerning his or her own work and academic abilities” (p. 368). Panadero et al. (2016a) defined it as a “wide variety of mechanisms and techniques through which students describe (i.e., assess) and possibly assign merit or worth to (i.e., evaluate) the qualities of their own learning processes and products” (p. 804). Referring to physicians, Epstein et al. (2008) defined “concurrent self-assessment” as “ongoing moment-to-moment self-monitoring” (p. 5). Self-monitoring “refers to the ability to notice our own actions, curiosity to examine the effects of those actions, and willingness to use those observations to improve behavior and thinking in the future” (p. 5). Taken together, these definitions include self-assessment of one's abilities, processes , and products —everything but the kitchen sink. This very broad conception might seem unwieldy, but it works because each object of assessment—competence, process, and product—is subject to the influence of feedback from oneself.

What is missing from each of these definitions, however, is the purpose of the act of self-assessment. Their authors might rightly point out that the purpose is implied, but a formal definition requires us to make it plain: Why do we ask students to self-assess? I have long held that self-assessment is feedback ( Andrade, 2010 ), and that the purpose of feedback is to inform adjustments to processes and products that deepen learning and enhance performance; hence the purpose of self-assessment is to generate feedback that promotes learning and improvements in performance. This learning-oriented purpose of self-assessment implies that it should be formative: if there is no opportunity for adjustment and correction, self-assessment is almost pointless.

Why Self-Assess?

Clarity about the purpose of self-assessment allows us to interpret what otherwise appear to be discordant findings from research, which has produced mixed results in terms of both the accuracy of students' self-assessments and their influence on learning and/or performance. I believe the source of the discord can be traced to the different ways in which self-assessment is carried out, such as whether it is summative and formative. This issue will be taken up again in the review of current research that follows this overview. For now, consider a study of the accuracy and validity of summative self-assessment in teacher education conducted by Tejeiro et al. (2012) , which showed that students' self-assigned marks tended to be higher than marks given by professors. All 122 students in the study assigned themselves a grade at the end of their course, but half of the students were told that their self-assigned grade would count toward 5% of their final grade. In both groups, students' self-assessments were higher than grades given by professors, especially for students with “poorer results” (p. 791) and those for whom self-assessment counted toward the final grade. In the group that was told their self-assessments would count toward their final grade, no relationship was found between the professor's and the students' assessments. Tejeiro et al. concluded that, although students' and professor's assessments tend to be highly similar when self-assessment did not count toward final grades, overestimations increased dramatically when students' self-assessments did count. Interviews of students who self-assigned highly discrepant grades revealed (as you might guess) that they were motivated by the desire to obtain the highest possible grades.

Studies like Tejeiro et al's. (2012) are interesting in terms of the information they provide about the relationship between consistency and honesty, but the purpose of the self-assessment, beyond addressing interesting research questions, is unclear. There is no feedback purpose. This is also true for another example of a study of summative self-assessment of competence, during which elementary-school children took the Test of Narrative Language and then were asked to self-evaluate “how you did in making up stories today” by pointing to one of five pictures, from a “very happy face” (rating of five) to a “very sad face” (rating of one) ( Kaderavek et al., 2004 . p. 37). The usual results were reported: Older children and good narrators were more accurate than younger children and poor narrators, and males tended to more frequently overestimate their ability.

Typical of clinical studies of accuracy in self-evaluation, this study rests on a definition and operationalization of self-assessment with no value in terms of instructional feedback. If those children were asked to rate their stories and then revise or, better yet, if they assessed their stories according to clear, developmentally appropriate criteria before revising, the valence of their self-assessments in terms of instructional feedback would skyrocket. I speculate that their accuracy would too. In contrast, studies of formative self-assessment suggest that when the act of self-assessing is given a learning-oriented purpose, students' self-assessments are relatively consistent with those of external evaluators, including professors ( Lopez and Kossack, 2007 ; Barney et al., 2012 ; Leach, 2012 ), teachers ( Bol et al., 2012 ; Chang et al., 2012 , 2013 ), researchers ( Panadero and Romero, 2014 ; Fitzpatrick and Schulz, 2016 ), and expert medical assessors ( Hawkins et al., 2012 ).

My commitment to keeping self-assessment formative is firm. However, Gavin Brown (personal communication, April 2011) reminded me that summative self-assessment exists and we cannot ignore it; any definition of self-assessment must acknowledge and distinguish between formative and summative forms of it. Thus, the taxonomy in Table 1 , which depicts self-assessment as serving formative and/or summative purposes, and focuses on competence, processes, and/or products.

www.frontiersin.org

Table 1 . A taxonomy of self-assessment.

Fortunately, a formative view of self-assessment seems to be taking hold in various educational contexts. For instance, Sargeant (2008) noted that all seven authors in a special issue of the Journal of Continuing Education in the Health Professions “conceptualize self-assessment within a formative, educational perspective, and see it as an activity that draws upon both external and internal data, standards, and resources to inform and make decisions about one's performance” (p. 1). Sargeant also stresses the point that self-assessment should be guided by evaluative criteria: “Multiple external sources can and should inform self-assessment, perhaps most important among them performance standards” (p. 1). Now we are talking about the how of self-assessment, which demands an operationalization of self-assessment practice. Let us examine each object of self-assessment (competence, processes, and/or products) with an eye for what is assessed and why.

What is Self-Assessed?

Monitoring and self-assessing processes are practically synonymous with self-regulated learning (SRL), or at least central components of it such as goal-setting and monitoring, or metacognition. Research on SRL has clearly shown that self-generated feedback on one's approach to learning is associated with academic gains ( Zimmerman and Schunk, 2011 ). Self-assessment of the products , such as papers and presentations, are the easiest to defend as feedback, especially when those self-assessments are grounded in explicit, relevant, evaluative criteria and followed by opportunities to relearn and/or revise ( Andrade, 2010 ).

Including the self-assessment of competence in this definition is a little trickier. I hesitated to include it because of the risk of sneaking in global assessments of one's overall ability, self-esteem, and self-concept (“I'm good enough, I'm smart enough, and doggone it, people like me,” Franken, 1992 ), which do not seem relevant to a discussion of feedback in the context of learning. Research on global self-assessment, or self-perception, is popular in the medical education literature, but even there, scholars have begun to question its usefulness in terms of influencing learning and professional growth (e.g., see Sargeant et al., 2008 ). Eva and Regehr (2008) seem to agree in the following passage, which states the case in a way that makes it worthy of a long quotation:

Self-assessment is often (implicitly or otherwise) conceptualized as a personal, unguided reflection on performance for the purposes of generating an individually derived summary of one's own level of knowledge, skill, and understanding in a particular area. For example, this conceptualization would appear to be the only reasonable basis for studies that fit into what Colliver et al. (2005) has described as the “guess your grade” model of self-assessment research, the results of which form the core foundation for the recurring conclusion that self-assessment is generally poor. This unguided, internally generated construction of self-assessment stands in stark contrast to the model put forward by Boud (1999) , who argued that the phrase self-assessment should not imply an isolated or individualistic activity; it should commonly involve peers, teachers, and other sources of information. The conceptualization of self-assessment as enunciated in Boud's description would appear to involve a process by which one takes personal responsibility for looking outward, explicitly seeking feedback, and information from external sources, then using these externally generated sources of assessment data to direct performance improvements. In this construction, self-assessment is more of a pedagogical strategy than an ability to judge for oneself; it is a habit that one needs to acquire and enact rather than an ability that one needs to master (p. 15).

As in the K-16 context, self-assessment is coming to be seen as having value as much or more so in terms of pedagogy as in assessment ( Silver et al., 2008 ; Brown and Harris, 2014 ). In the end, however, I decided that self-assessing one's competence to successfully learn a particular concept or complete a particular task (which sounds a lot like self-efficacy—more on that later) might be useful feedback because it can inform decisions about how to proceed, such as the amount of time to invest in learning how to play the flute, or whether or not to seek help learning the steps of the jitterbug. An important caveat, however, is that self-assessments of competence are only useful if students have opportunities to do something about their perceived low competence—that is, it serves the purpose of formative feedback for the learner.

How to Self-Assess?

Panadero et al. (2016a) summarized five very different taxonomies of self-assessment and called for the development of a comprehensive typology that considers, among other things, its purpose, the presence or absence of criteria, and the method. In response, I propose the taxonomy depicted in Table 1 , which focuses on the what (competence, process, or product), the why (formative or summative), and the how (methods, including whether or not they include standards, e.g., criteria) of self-assessment. The collections of examples of methods in the table is inexhaustive.

I put the methods in Table 1 where I think they belong, but many of them could be placed in more than one cell. Take self-efficacy , for instance, which is essentially a self-assessment of one's competence to successfully undertake a particular task ( Bandura, 1997 ). Summative judgments of self-efficacy are certainly possible but they seem like a silly thing to do—what is the point, from a learning perspective? Formative self-efficacy judgments, on the other hand, can inform next steps in learning and skill building. There is reason to believe that monitoring and making adjustments to one's self-efficacy (e.g., by setting goals or attributing success to effort) can be productive ( Zimmerman, 2000 ), so I placed self-efficacy in the formative row.

It is important to emphasize that self-efficacy is task-specific, more or less ( Bandura, 1997 ). This taxonomy does not include general, holistic evaluations of one's abilities, for example, “I am good at math.” Global assessment of competence does not provide the leverage, in terms of feedback, that is provided by task-specific assessments of competence, that is, self-efficacy. Eva and Regehr (2008) provided an illustrative example: “We suspect most people are prompted to open a dictionary as a result of encountering a word for which they are uncertain of the meaning rather than out of a broader assessment that their vocabulary could be improved” (p. 16). The exclusion of global evaluations of oneself resonates with research that clearly shows that feedback that focuses on aspects of a task (e.g., “I did not solve most of the algebra problems”) is more effective than feedback that focuses on the self (e.g., “I am bad at math”) ( Kluger and DeNisi, 1996 ; Dweck, 2006 ; Hattie and Timperley, 2007 ). Hence, global self-evaluations of ability or competence do not appear in Table 1 .

Another approach to student self-assessment that could be placed in more than one cell is traffic lights . The term traffic lights refers to asking students to use green, yellow, or red objects (or thumbs up, sideways, or down—anything will do) to indicate whether they think they have good, partial, or little understanding ( Black et al., 2003 ). It would be appropriate for traffic lights to appear in multiple places in Table 1 , depending on how they are used. Traffic lights seem to be most effective at supporting students' reflections on how well they understand a concept or have mastered a skill, which is line with their creators' original intent, so they are categorized as formative self-assessments of one's learning—which sounds like metacognition.

In fact, several of the methods included in Table 1 come from research on metacognition, including self-monitoring , such as checking one's reading comprehension, and self-testing , e.g., checking one's performance on test items. These last two methods have been excluded from some taxonomies of self-assessment (e.g., Boud and Brew, 1995 ) because they do not engage students in explicitly considering relevant standards or criteria. However, new conceptions of self-assessment are grounded in theories of the self- and co-regulation of learning ( Andrade and Brookhart, 2016 ), which includes self-monitoring of learning processes with and without explicit standards.

However, my research favors self-assessment with regard to standards ( Andrade and Boulay, 2003 ; Andrade and Du, 2007 ; Andrade et al., 2008 , 2009 , 2010 ), as does related research by Panadero and his colleagues (see below). I have involved students in self-assessment of stories, essays, or mathematical word problems according to rubrics or checklists with criteria. For example, two studies investigated the relationship between elementary or middle school students' scores on a written assignment and a process that involved them in reading a model paper, co-creating criteria, self-assessing first drafts with a rubric, and revising ( Andrade et al., 2008 , 2010 ). The self-assessment was highly scaffolded: students were asked to underline key phrases in the rubric with colored pencils (e.g., underline “clearly states an opinion” in blue), then underline or circle in their drafts the evidence of having met the standard articulated by the phrase (e.g., his or her opinion) with the same blue pencil. If students found they had not met the standard, they were asked to write themselves a reminder to make improvements when they wrote their final drafts. This process was followed for each criterion on the rubric. There were main effects on scores for every self-assessed criterion on the rubric, suggesting that guided self-assessment according to the co-created criteria helped students produce more effective writing.

Panadero and his colleagues have also done quasi-experimental and experimental research on standards-referenced self-assessment, using rubrics or lists of assessment criteria that are presented in the form of questions ( Panadero et al., 2012 , 2013 , 2014 ; Panadero and Romero, 2014 ). Panadero calls the list of assessment criteria a script because his work is grounded in research on scaffolding (e.g., Kollar et al., 2006 ): I call it a checklist because that is the term used in classroom assessment contexts. Either way, the list provides standards for the task. Here is a script for a written summary that Panadero et al. (2014) used with college students in a psychology class:

• Does my summary transmit the main idea from the text? Is it at the beginning of my summary?

• Are the important ideas also in my summary?

• Have I selected the main ideas from the text to make them explicit in my summary?

• Have I thought about my purpose for the summary? What is my goal?

Taken together, the results of the studies cited above suggest that students who engaged in self-assessment using scripts or rubrics were more self-regulated, as measured by self-report questionnaires and/or think aloud protocols, than were students in the comparison or control groups. Effect sizes were very small to moderate (η 2 = 0.06–0.42), and statistically significant. Most interesting, perhaps, is one study ( Panadero and Romero, 2014 ) that demonstrated an association between rubric-referenced self-assessment activities and all three phases of SRL; forethought, performance, and reflection.

There are surely many other methods of self-assessment to include in Table 1 , as well as interesting conversations to be had about which method goes where and why. In the meantime, I offer the taxonomy in Table 1 as a way to define and operationalize self-assessment in instructional contexts and as a framework for the following overview of current research on the subject.

An Overview of Current Research on Self-Assessment

Several recent reviews of self-assessment are available ( Brown and Harris, 2013 ; Brown et al., 2015 ; Panadero et al., 2017 ), so I will not summarize the entire body of research here. Instead, I chose to take a birds-eye view of the field, with goal of reporting on what has been sufficiently researched and what remains to be done. I used the references lists from reviews, as well as other relevant sources, as a starting point. In order to update the list of sources, I directed two new searches 1 , the first of the ERIC database, and the second of both ERIC and PsychINFO. Both searches included two search terms, “self-assessment” OR “self-evaluation.” Advanced search options had four delimiters: (1) peer-reviewed, (2) January, 2013–October, 2016 and then October 2016–March 2019, (3) English, and (4) full-text. Because the focus was on K-20 educational contexts, sources were excluded if they were about early childhood education or professional development.

The first search yielded 347 hits; the second 1,163. Research that was unrelated to instructional feedback was excluded, such as studies limited to self-estimates of performance before or after taking a test, guesses about whether a test item was answered correctly, and estimates of how many tasks could be completed in a certain amount of time. Although some of the excluded studies might be thought of as useful investigations of self-monitoring, as a group they seemed too unrelated to theories of self-generated feedback to be appropriate for this review. Seventy-six studies were selected for inclusion in Table S1 (Supplementary Material), which also contains a few studies published before 2013 that were not included in key reviews, as well as studies solicited directly from authors.

The Table S1 in the Supplementary Material contains a complete list of studies included in this review, organized by the focus or topic of the study, as well as brief descriptions of each. The “type” column Table S1 (Supplementary Material) indicates whether the study focused on formative or summative self-assessment. This distinction was often difficult to make due to a lack of information. For example, Memis and Seven (2015) frame their study in terms of formative assessment, and note that the purpose of the self-evaluation done by the sixth grade students is to “help students improve their [science] reports” (p. 39), but they do not indicate how the self-assessments were done, nor whether students were given time to revise their reports based on their judgments or supported in making revisions. A sentence or two of explanation about the process of self-assessment in the procedures sections of published studies would be most useful.

Figure 1 graphically represents the number of studies in the four most common topic categories found in the table—achievement, consistency, student perceptions, and SRL. The figure reveals that research on self-assessment is on the rise, with consistency the most popular topic. Of the 76 studies in the table in the appendix, 44 were inquiries into the consistency of students' self-assessments with other judgments (e.g., a test score or teacher's grade). Twenty-five studies investigated the relationship between self-assessment and achievement. Fifteen explored students' perceptions of self-assessment. Twelve studies focused on the association between self-assessment and self-regulated learning. One examined self-efficacy, and two qualitative studies documented the mental processes involved in self-assessment. The sum ( n = 99) of the list of research topics is more than 76 because several studies had multiple foci. In the remainder of this review I examine each topic in turn.

www.frontiersin.org

Figure 1 . Topics of self-assessment studies, 2013–2018.

Consistency

Table S1 (Supplementary Material) reveals that much of the recent research on self-assessment has investigated the accuracy or, more accurately, consistency, of students' self-assessments. The term consistency is more appropriate in the classroom context because the quality of students' self-assessments is often determined by comparing them with their teachers' assessments and then generating correlations. Given the evidence of the unreliability of teachers' grades ( Falchikov, 2005 ), the assumption that teachers' assessments are accurate might not be well-founded ( Leach, 2012 ; Brown et al., 2015 ). Ratings of student work done by researchers are also suspect, unless evidence of the validity and reliability of the inferences made about student work by researchers is available. Consequently, much of the research on classroom-based self-assessment should use the term consistency , which refers to the degree of alignment between students' and expert raters' evaluations, avoiding the purer, more rigorous term accuracy unless it is fitting.

In their review, Brown and Harris (2013) reported that correlations between student self-ratings and other measures tended to be weakly to strongly positive, ranging from r ≈ 0.20 to 0.80, with few studies reporting correlations >0.60. But their review included results from studies of any self-appraisal of school work, including summative self-rating/grading, predictions about the correctness of answers on test items, and formative, criteria-based self-assessments, a combination of methods that makes the correlations they reported difficult to interpret. Qualitatively different forms of self-assessment, especially summative and formative types, cannot be lumped together without obfuscating important aspects of self-assessment as feedback.

Given my concern about combining studies of summative and formative assessment, you might anticipate a call for research on consistency that distinguishes between the two. I will make no such call for three reasons. One is that we have enough research on the subject, including the 22 studies in Table S1 (Supplementary Material) that were published after Brown and Harris's review (2013 ). Drawing only on studies included in Table S1 (Supplementary Material), we can say with confidence that summative self-assessment tends to be inconsistent with external judgements ( Baxter and Norman, 2011 ; De Grez et al., 2012 ; Admiraal et al., 2015 ), with males tending to overrate and females to underrate ( Nowell and Alston, 2007 ; Marks et al., 2018 ). There are exceptions ( Alaoutinen, 2012 ; Lopez-Pastor et al., 2012 ) as well as mixed results, with students being consistent regarding some aspects of their learning but not others ( Blanch-Hartigan, 2011 ; Harding and Hbaci, 2015 ; Nguyen and Foster, 2018 ). We can also say that older, more academically competent learners tend to be more consistent ( Hacker et al., 2000 ; Lew et al., 2010 ; Alaoutinen, 2012 ; Guillory and Blankson, 2017 ; Butler, 2018 ; Nagel and Lindsey, 2018 ). There is evidence that consistency can be improved through experience ( Lopez and Kossack, 2007 ; Yilmaz, 2017 ; Nagel and Lindsey, 2018 ), the use of guidelines ( Bol et al., 2012 ), feedback ( Thawabieh, 2017 ), and standards ( Baars et al., 2014 ), perhaps in the form of rubrics ( Panadero and Romero, 2014 ). Modeling and feedback also help ( Labuhn et al., 2010 ; Miller and Geraci, 2011 ; Hawkins et al., 2012 ; Kostons et al., 2012 ).

An outcome typical of research on the consistency of summative self-assessment can be found in row 59, which summarizes the study by Tejeiro et al. (2012) discussed earlier: Students' self-assessments were higher than marks given by professors, especially for students with poorer results, and no relationship was found between the professors' and the students' assessments in the group in which self-assessment counted toward the final mark. Students are not stupid: if they know that they can influence their final grade, and that their judgment is summative rather than intended to inform revision and improvement, they will be motivated to inflate their self-evaluation. I do not believe we need more research to demonstrate that phenomenon.

The second reason I am not calling for additional research on consistency is a lot of it seems somewhat irrelevant. This might be because the interest in accuracy is rooted in clinical research on calibration, which has very different aims. Calibration accuracy is the “magnitude of consent between learners' true and self-evaluated task performance. Accurately calibrated learners' task performance equals their self-evaluated task performance” ( Wollenschläger et al., 2016 ). Calibration research often asks study participants to predict or postdict the correctness of their responses to test items. I caution about generalizing from clinical experiments to authentic classroom contexts because the dismal picture of our human potential to self-judge was painted by calibration researchers before study participants were effectively taught how to predict with accuracy, or provided with the tools they needed to be accurate, or motivated to do so. Calibration researchers know that, of course, and have conducted intervention studies that attempt to improve accuracy, with some success (e.g., Bol et al., 2012 ). Studies of formative self-assessment also suggest that consistency increases when it is taught and supported in many of the ways any other skill must be taught and supported ( Lopez and Kossack, 2007 ; Labuhn et al., 2010 ; Chang et al., 2012 , 2013 ; Hawkins et al., 2012 ; Panadero and Romero, 2014 ; Lin-Siegler et al., 2015 ; Fitzpatrick and Schulz, 2016 ).

Even clinical psychological studies that go beyond calibration to examine the associations between monitoring accuracy and subsequent study behaviors do not transfer well to classroom assessment research. After repeatedly encountering claims that, for example, low self-assessment accuracy leads to poor task-selection accuracy and “suboptimal learning outcomes” ( Raaijmakers et al., 2019 , p. 1), I dug into the cited studies and discovered two limitations. The first is that the tasks in which study participants engage are quite inauthentic. A typical task involves studying “word pairs (e.g., railroad—mother), followed by a delayed judgment of learning (JOL) in which the students predicted the chances of remembering the pair… After making a JOL, the entire pair was presented for restudy for 4 s [ sic ], and after all pairs had been restudied, a criterion test of paired-associate recall occurred” ( Dunlosky and Rawson, 2012 , p. 272). Although memory for word pairs might be important in some classroom contexts, it is not safe to assume that results from studies like that one can predict students' behaviors after criterion-referenced self-assessment of their comprehension of complex texts, lengthy compositions, or solutions to multi-step mathematical problems.

The second limitation of studies like the typical one described above is more serious: Participants in research like that are not permitted to regulate their own studying, which is experimentally manipulated by a computer program. This came as a surprise, since many of the claims were about students' poor study choices but they were rarely allowed to make actual choices. For example, Dunlosky and Rawson (2012) permitted participants to “use monitoring to effectively control learning” by programming the computer so that “a participant would need to have judged his or her recall of a definition entirely correct on three different trials, and once they judged it entirely correct on the third trial, that particular key term definition was dropped [by the computer program] from further practice” (p. 272). The authors note that this study design is an improvement on designs that did not require all participants to use the same regulation algorithm, but it does not reflect the kinds of decisions that learners make in class or while doing homework. In fact, a large body of research shows that students can make wise choices when they self-pace the study of to-be-learned materials and then allocate study time to each item ( Bjork et al., 2013 , p. 425):

In a typical experiment, the students first study all the items at an experimenter-paced rate (e.g., study 60 paired associates for 3 s each), which familiarizes the students with the items; after this familiarity phase, the students then either choose which items they want to restudy (e.g., all items are presented in an array, and the students select which ones to restudy) and/or pace their restudy of each item. Several dependent measures have been widely used, such as how long each item is studied, whether an item is selected for restudy, and in what order items are selected for restudy. The literature on these aspects of self-regulated study is massive (for a comprehensive overview, see both Dunlosky and Ariel, 2011 and Son and Metcalfe, 2000 ), but the evidence is largely consistent with a few basic conclusions. First, if students have a chance to practice retrieval prior to restudying items, they almost exclusively choose to restudy unrecalled items and drop the previously recalled items from restudy ( Metcalfe and Kornell, 2005 ). Second, when pacing their study of individual items that have been selected for restudy, students typically spend more time studying items that are more, rather than less, difficult to learn. Such a strategy is consistent with a discrepancy-reduction model of self-paced study (which states that people continue to study an item until they reach mastery), although some key revisions to this model are needed to account for all the data. For instance, students may not continue to study until they reach some static criterion of mastery, but instead, they may continue to study until they perceive that they are no longer making progress.

I propose that this research, which suggests that students' unscaffolded, unmeasured, informal self-assessments tend to lead to appropriate task selection, is better aligned with research on classroom-based self-assessment. Nonetheless, even this comparison is inadequate because the study participants were not taught to compare their performance to the criteria for mastery, as is often done in classroom-based self-assessment.

The third and final reason I do not believe we need additional research on consistency is that I think it is a distraction from the true purposes of self-assessment. Many if not most of the articles about the accuracy of self-assessment are grounded in the assumption that accuracy is necessary for self-assessment to be useful, particularly in terms of subsequent studying and revision behaviors. Although it seems obvious that accurate evaluations of their performance positively influence students' study strategy selection, which should produce improvements in achievement, I have not seen relevant research that tests those conjectures. Some claim that inaccurate estimates of learning lead to the selection of inappropriate learning tasks ( Kostons et al., 2012 ) but they cite research that does not support their claim. For example, Kostons et al. cite studies that focus on the effectiveness of SRL interventions but do not address the accuracy of participants' estimates of learning, nor the relationship of those estimates to the selection of next steps. Other studies produce findings that support my skepticism. Take, for instance, two relevant studies of calibration. One suggested that performance and judgments of performance had little influence on subsequent test preparation behavior ( Hacker et al., 2000 ), and the other showed that study participants followed their predictions of performance to the same degree, regardless of monitoring accuracy ( van Loon et al., 2014 ).

Eva and Regehr (2008) believe that:

Research questions that take the form of “How well do various practitioners self-assess?” “How can we improve self-assessment?” or “How can we measure self-assessment skill?” should be considered defunct and removed from the research agenda [because] there have been hundreds of studies into these questions and the answers are “Poorly,” “You can't,” and “Don't bother” (p. 18).

I almost agree. A study that could change my mind about the importance of accuracy of self-assessment would be an investigation that goes beyond attempting to improve accuracy just for the sake of accuracy by instead examining the relearning/revision behaviors of accurate and inaccurate self-assessors: Do students whose self-assessments match the valid and reliable judgments of expert raters (hence my use of the term accuracy ) make better decisions about what they need to do to deepen their learning and improve their work? Here, I admit, is a call for research related to consistency: I would love to see a high-quality investigation of the relationship between accuracy in formative self-assessment, and students' subsequent study and revision behaviors, and their learning. For example, a study that closely examines the revisions to writing made by accurate and inaccurate self-assessors, and the resulting outcomes in terms of the quality of their writing, would be most welcome.

Table S1 (Supplementary Material) indicates that by 2018 researchers began publishing studies that more directly address the hypothesized link between self-assessment and subsequent learning behaviors, as well as important questions about the processes learners engage in while self-assessing ( Yan and Brown, 2017 ). One, a study by Nugteren et al. (2018 row 19 in Table S1 (Supplementary Material)), asked “How do inaccurate [summative] self-assessments influence task selections?” (p. 368) and employed a clever exploratory research design. The results suggested that most of the 15 students in their sample over-estimated their performance and made inaccurate learning-task selections. Nugteren et al. recommended helping students make more accurate self-assessments, but I think the more interesting finding is related to why students made task selections that were too difficult or too easy, given their prior performance: They based most task selections on interest in the content of particular items (not the overarching content to be learned), and infrequently considered task difficulty and support level. For instance, while working on the genetics tasks, students reported selecting tasks because they were fun or interesting, not because they addressed self-identified weaknesses in their understanding of genetics. Nugteren et al. proposed that students would benefit from instruction on task selection. I second that proposal: Rather than directing our efforts on accuracy in the service of improving subsequent task selection, let us simply teach students to use the information at hand to select next best steps, among other things.

Butler (2018 , row 76 in Table S1 (Supplementary Material)) has conducted at least two studies of learners' processes of responding to self-assessment items and how they arrived at their judgments. Comparing generic, decontextualized items to task-specific, contextualized items (which she calls after-task items ), she drew two unsurprising conclusions: the task-specific items “generally showed higher correlations with task performance,” and older students “appeared to be more conservative in their judgment compared with their younger counterparts” (p. 249). The contribution of the study is the detailed information it provides about how students generated their judgments. For example, Butler's qualitative data analyses revealed that when asked to self-assess in terms of vague or non-specific items, the children often “contextualized the descriptions based on their own experiences, goals, and expectations,” (p. 257) focused on the task at hand, and situated items in the specific task context. Perhaps as a result, the correlation between after-task self-assessment and task performance was generally higher than for generic self-assessment.

Butler (2018) notes that her study enriches our empirical understanding of the processes by which children respond to self-assessment. This is a very promising direction for the field. Similar studies of processing during formative self-assessment of a variety of task types in a classroom context would likely produce significant advances in our understanding of how and why self-assessment influences learning and performance.

Student Perceptions

Fifteen of the studies listed in Table S1 (Supplementary Material) focused on students' perceptions of self-assessment. The studies of children suggest that they tend to have unsophisticated understandings of its purposes ( Harris and Brown, 2013 ; Bourke, 2016 ) that might lead to shallow implementation of related processes. In contrast, results from the studies conducted in higher education settings suggested that college and university students understood the function of self-assessment ( Ratminingsih et al., 2018 ) and generally found it to be useful for guiding evaluation and revision ( Micán and Medina, 2017 ), understanding how to take responsibility for learning ( Lopez and Kossack, 2007 ; Bourke, 2014 ; Ndoye, 2017 ), prompting them to think more critically and deeply ( van Helvoort, 2012 ; Siow, 2015 ), applying newfound skills ( Murakami et al., 2012 ), and fostering self-regulated learning by guiding them to set goals, plan, self-monitor and reflect ( Wang, 2017 ).

Not surprisingly, positive perceptions of self-assessment were typically developed by students who actively engaged the formative type by, for example, developing their own criteria for an effective self-assessment response ( Bourke, 2014 ), or using a rubric or checklist to guide their assessments and then revising their work ( Huang and Gui, 2015 ; Wang, 2017 ). Earlier research suggested that children's attitudes toward self-assessment can become negative if it is summative ( Ross et al., 1998 ). However, even summative self-assessment was reported by adult learners to be useful in helping them become more critical of their own and others' writing throughout the course and in subsequent courses ( van Helvoort, 2012 ).

Achievement

Twenty-five of the studies in Table S1 (Supplementary Material) investigated the relation between self-assessment and achievement, including two meta-analyses. Twenty of the 25 clearly employed the formative type. Without exception, those 20 studies, plus the two meta-analyses ( Graham et al., 2015 ; Sanchez et al., 2017 ) demonstrated a positive association between self-assessment and learning. The meta-analysis conducted by Graham and his colleagues, which included 10 studies, yielded an average weighted effect size of 0.62 on writing quality. The Sanchez et al. meta-analysis revealed that, although 12 of the 44 effect sizes were negative, on average, “students who engaged in self-grading performed better ( g = 0.34) on subsequent tests than did students who did not” (p. 1,049).

All but two of the non-meta-analytic studies of achievement in Table S1 (Supplementary Material) were quasi-experimental or experimental, providing relatively rigorous evidence that their treatment groups outperformed their comparison or control groups in terms of everything from writing to dart-throwing, map-making, speaking English, and exams in a wide variety of disciplines. One experiment on summative self-assessment ( Miller and Geraci, 2011 ), in contrast, resulted in no improvements in exam scores, while the other one did ( Raaijmakers et al., 2017 ).

It would be easy to overgeneralize and claim that the question about the effect of self-assessment on learning has been answered, but there are unanswered questions about the key components of effective self-assessment, especially social-emotional components related to power and trust ( Andrade and Brown, 2016 ). The trends are pretty clear, however: it appears that formative forms of self-assessment can promote knowledge and skill development. This is not surprising, given that it involves many of the processes known to support learning, including practice, feedback, revision, and especially the intellectually demanding work of making complex, criteria-referenced judgments ( Panadero et al., 2014 ). Boud (1995a , b) predicted this trend when he noted that many self-assessment processes undermine learning by rushing to judgment, thereby failing to engage students with the standards or criteria for their work.

Self-Regulated Learning

The association between self-assessment and learning has also been explained in terms of self-regulation ( Andrade, 2010 ; Panadero and Alonso-Tapia, 2013 ; Andrade and Brookhart, 2016 , 2019 ; Panadero et al., 2016b ). Self-regulated learning (SRL) occurs when learners set goals and then monitor and manage their thoughts, feelings, and actions to reach those goals. SRL is moderately to highly correlated with achievement ( Zimmerman and Schunk, 2011 ). Research suggests that formative assessment is a potential influence on SRL ( Nicol and Macfarlane-Dick, 2006 ). The 12 studies in Table S1 (Supplementary Material) that focus on SRL demonstrate the recent increase in interest in the relationship between self-assessment and SRL.

Conceptual and practical overlaps between the two fields are abundant. In fact, Brown and Harris (2014) recommend that student self-assessment no longer be treated as an assessment, but as an essential competence for self-regulation. Butler and Winne (1995) introduced the role of self-generated feedback in self-regulation years ago:

[For] all self-regulated activities, feedback is an inherent catalyst. As learners monitor their engagement with tasks, internal feedback is generated by the monitoring process. That feedback describes the nature of outcomes and the qualities of the cognitive processes that led to those states (p. 245).

The outcomes and processes referred to by Butler and Winne are many of the same products and processes I referred to earlier in the definition of self-assessment and in Table 1 .

In general, research and practice related to self-assessment has tended to focus on judging the products of student learning, while scholarship on self-regulated learning encompasses both processes and products. The very practical focus of much of the research on self-assessment means it might be playing catch-up, in terms of theory development, with the SRL literature, which is grounded in experimental paradigms from cognitive psychology ( de Bruin and van Gog, 2012 ), while self-assessment research is ahead in terms of implementation (E. Panadero, personal communication, October 21, 2016). One major exception is the work done on Self-regulated Strategy Development ( Glaser and Brunstein, 2007 ; Harris et al., 2008 ), which has successfully integrated SRL research with classroom practices, including self-assessment, to teach writing to students with special needs.

Nicol and Macfarlane-Dick (2006) have been explicit about the potential for self-assessment practices to support self-regulated learning:

To develop systematically the learner's capacity for self-regulation, teachers need to create more structured opportunities for self-monitoring and the judging of progression to goals. Self-assessment tasks are an effective way of achieving this, as are activities that encourage reflection on learning progress (p. 207).

The studies of SRL in Table S1 (Supplementary Material) provide encouraging findings regarding the potential role of self-assessment in promoting achievement, self-regulated learning in general, and metacognition and study strategies related to task selection in particular. The studies also represent a solution to the “methodological and theoretical challenges involved in bringing metacognitive research to the real world, using meaningful learning materials” ( Koriat, 2012 , p. 296).

Future Directions for Research

I agree with ( Yan and Brown, 2017 ) statement that “from a pedagogical perspective, the benefits of self-assessment may come from active engagement in the learning process, rather than by being “veridical” or coinciding with reality, because students' reflection and metacognitive monitoring lead to improved learning” (p. 1,248). Future research should focus less on accuracy/consistency/veridicality, and more on the precise mechanisms of self-assessment ( Butler, 2018 ).

An important aspect of research on self-assessment that is not explicitly represented in Table S1 (Supplementary Material) is practice, or pedagogy: Under what conditions does self-assessment work best, and how are those conditions influenced by context? Fortunately, the studies listed in the table, as well as others (see especially Andrade and Valtcheva, 2009 ; Nielsen, 2014 ; Panadero et al., 2016a ), point toward an answer. But we still have questions about how best to scaffold effective formative self-assessment. One area of inquiry is about the characteristics of the task being assessed, and the standards or criteria used by learners during self-assessment.

Influence of Types of Tasks and Standards or Criteria

Type of task or competency assessed seems to matter (e.g., Dolosic, 2018 , Nguyen and Foster, 2018 ), as do the criteria ( Yilmaz, 2017 ), but we do not yet have a comprehensive understanding of how or why. There is some evidence that it is important that the criteria used to self-assess are concrete, task-specific ( Butler, 2018 ), and graduated. For example, Fastre et al. (2010) revealed an association between self-assessment according to task-specific criteria and task performance: In a quasi-experimental study of 39 novice vocational education students studying stoma care, they compared concrete, task-specific criteria (“performance-based criteria”) such as “Introduces herself to the patient” and “Consults the care file for details concerning the stoma” to vaguer, “competence-based criteria” such as “Shows interest, listens actively, shows empathy to the patient” and “Is discrete with sensitive topics.” The performance-based criteria group outperformed the competence-based group on tests of task performance, presumably because “performance-based criteria make it easier to distinguish levels of performance, enabling a step-by-step process of performance improvement” (p. 530).

This finding echoes the results of a study of self-regulated learning by Kitsantas and Zimmerman (2006) , who argued that “fine-grained standards can have two key benefits: They can enable learners to be more sensitive to small changes in skill and make more appropriate adaptations in learning strategies” (p. 203). In their study, 70 college students were taught how to throw darts at a target. The purpose of the study was to examine the role of graphing of self-recorded outcomes and self-evaluative standards in learning a motor skill. Students who were provided with graduated self-evaluative standards surpassed “those who were provided with absolute standards or no standards (control) in both motor skill and in motivational beliefs (i.e., self-efficacy, attributions, and self-satisfaction)” (p. 201). Kitsantas and Zimmerman hypothesized that setting high absolute standards would limit a learner's sensitivity to small improvements in functioning. This hypothesis was supported by the finding that students who set absolute standards reported significantly less awareness of learning progress (and hit the bull's-eye less often) than students who set graduated standards. “The correlation between the self-evaluation and dart-throwing outcomes measures was extraordinarily high ( r = 0.94)” (p. 210). Classroom-based research on specific, graduated self-assessment criteria would be informative.

Cognitive and Affective Mechanisms of Self-Assessment

There are many additional questions about pedagogy, such as the hoped-for investigation mentioned above of the relationship between accuracy in formative self-assessment, students' subsequent study behaviors, and their learning. There is also a need for research on how to help teachers give students a central role in their learning by creating space for self-assessment (e.g., see Hawe and Parr, 2014 ), and the complex power dynamics involved in doing so ( Tan, 2004 , 2009 ; Taras, 2008 ; Leach, 2012 ). However, there is an even more pressing need for investigations into the internal mechanisms experienced by students engaged in assessing their own learning. Angela Lui and I call this the next black box ( Lui, 2017 ).

Black and Wiliam (1998) used the term black box to emphasize the fact that what happened in most classrooms was largely unknown: all we knew was that some inputs (e.g., teachers, resources, standards, and requirements) were fed into the box, and that certain outputs (e.g., more knowledgeable and competent students, acceptable levels of achievement) would follow. But what, they asked, is happening inside, and what new inputs will produce better outputs? Black and Wiliam's review spawned a great deal of research on formative assessment, some but not all of which suggests a positive relationship with academic achievement ( Bennett, 2011 ; Kingston and Nash, 2011 ). To better understand why and how the use of formative assessment in general and self-assessment in particular is associated with improvements in academic achievement in some instances but not others, we need research that looks into the next black box: the cognitive and affective mechanisms of students who are engaged in assessment processes ( Lui, 2017 ).

The role of internal mechanisms has been discussed in theory but not yet fully tested. Crooks (1988) argued that the impact of assessment is influenced by students' interpretation of the tasks and results, and Butler and Winne (1995) theorized that both cognitive and affective processes play a role in determining how feedback is internalized and used to self-regulate learning. Other theoretical frameworks about the internal processes of receiving and responding to feedback have been developed (e.g., Nicol and Macfarlane-Dick, 2006 ; Draper, 2009 ; Andrade, 2013 ; Lipnevich et al., 2016 ). Yet, Shute (2008) noted in her review of the literature on formative feedback that “despite the plethora of research on the topic, the specific mechanisms relating feedback to learning are still mostly murky, with very few (if any) general conclusions” (p. 156). This area is ripe for research.

Self-assessment is the act of monitoring one's processes and products in order to make adjustments that deepen learning and enhance performance. Although it can be summative, the evidence presented in this review strongly suggests that self-assessment is most beneficial, in terms of both achievement and self-regulated learning, when it is used formatively and supported by training.

What is not yet clear is why and how self-assessment works. Those of you who like to investigate phenomena that are maddeningly difficult to measure will rejoice to hear that the cognitive and affective mechanisms of self-assessment are the next black box. Studies of the ways in which learners think and feel, the interactions between their thoughts and feelings and their context, and the implications for pedagogy will make major contributions to our field.

Author Contributions

The author confirms being the sole contributor of this work and has approved it for publication.

Conflict of Interest Statement

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/feduc.2019.00087/full#supplementary-material

1. ^ I am grateful to my graduate assistants, Joanna Weaver and Taja Young, for conducting the searches.

Admiraal, W., Huisman, B., and Pilli, O. (2015). Assessment in massive open online courses. Electron. J. e-Learning , 13, 207–216.

Google Scholar

Alaoutinen, S. (2012). Evaluating the effect of learning style and student background on self-assessment accuracy. Comput. Sci. Educ. 22, 175–198. doi: 10.1080/08993408.2012.692924

CrossRef Full Text | Google Scholar

Al-Rawahi, N. M., and Al-Balushi, S. M. (2015). The effect of reflective science journal writing on students' self-regulated learning strategies. Int. J. Environ. Sci. Educ. 10, 367–379. doi: 10.12973/ijese.2015.250a

Andrade, H. (2010). “Students as the definitive source of formative assessment: academic self-assessment and the self-regulation of learning,” in Handbook of Formative Assessment , eds H. Andrade and G. Cizek (New York, NY: Routledge, 90–105.

Andrade, H. (2013). “Classroom assessment in the context of learning theory and research,” in Sage Handbook of Research on Classroom Assessment , ed J. H. McMillan (New York, NY: Sage), 17–34. doi: 10.4135/9781452218649.n2

Andrade, H. (2018). “Feedback in the context of self-assessment,” in Cambridge Handbook of Instructional Feedback , eds A. Lipnevich and J. Smith (Cambridge: Cambridge University Press), 376–408.

PubMed Abstract

Andrade, H., and Boulay, B. (2003). The role of rubric-referenced self-assessment in learning to write. J. Educ. Res. 97, 21–34. doi: 10.1080/00220670309596625

Andrade, H., and Brookhart, S. (2019). Classroom assessment as the co-regulation of learning. Assessm. Educ. Principles Policy Pract. doi: 10.1080/0969594X.2019.1571992

Andrade, H., and Brookhart, S. M. (2016). “The role of classroom assessment in supporting self-regulated learning,” in Assessment for Learning: Meeting the Challenge of Implementation , eds D. Laveault and L. Allal (Heidelberg: Springer), 293–309. doi: 10.1007/978-3-319-39211-0_17

Andrade, H., and Du, Y. (2007). Student responses to criteria-referenced self-assessment. Assess. Evalu. High. Educ. 32, 159–181. doi: 10.1080/02602930600801928

Andrade, H., Du, Y., and Mycek, K. (2010). Rubric-referenced self-assessment and middle school students' writing. Assess. Educ. 17, 199–214. doi: 10.1080/09695941003696172

Andrade, H., Du, Y., and Wang, X. (2008). Putting rubrics to the test: The effect of a model, criteria generation, and rubric-referenced self-assessment on elementary school students' writing. Educ. Meas. 27, 3–13. doi: 10.1111/j.1745-3992.2008.00118.x

Andrade, H., and Valtcheva, A. (2009). Promoting learning and achievement through self- assessment. Theory Pract. 48, 12–19. doi: 10.1080/00405840802577544

Andrade, H., Wang, X., Du, Y., and Akawi, R. (2009). Rubric-referenced self-assessment and self-efficacy for writing. J. Educ. Res. 102, 287–302. doi: 10.3200/JOER.102.4.287-302

Andrade, H. L., and Brown, G. T. L. (2016). “Student self-assessment in the classroom,” in Handbook of Human and Social Conditions in Assessment , eds G. T. L. Brown and L. R. Harris (New York, NY: Routledge), 319–334.

PubMed Abstract | Google Scholar

Baars, M., Vink, S., van Gog, T., de Bruin, A., and Paas, F. (2014). Effects of training self-assessment and using assessment standards on retrospective and prospective monitoring of problem solving. Learn. Instruc. 33, 92–107. doi: 10.1016/j.learninstruc.2014.04.004

Balderas, I., and Cuamatzi, P. M. (2018). Self and peer correction to improve college students' writing skills. Profile. 20, 179–194. doi: 10.15446/profile.v20n2.67095

Bandura, A. (1997). Self-efficacy: The Exercise of Control . New York, NY: Freeman.

Barney, S., Khurum, M., Petersen, K., Unterkalmsteiner, M., and Jabangwe, R. (2012). Improving students with rubric-based self-assessment and oral feedback. IEEE Transac. Educ. 55, 319–325. doi: 10.1109/TE.2011.2172981

Baxter, P., and Norman, G. (2011). Self-assessment or self deception? A lack of association between nursing students' self-assessment and performance. J. Adv. Nurs. 67, 2406–2413. doi: 10.1111/j.1365-2648.2011.05658.x

PubMed Abstract | CrossRef Full Text | Google Scholar

Bennett, R. E. (2011). Formative assessment: a critical review. Assess. Educ. 18, 5–25. doi: 10.1080/0969594X.2010.513678

Birjandi, P., and Hadidi Tamjid, N. (2012). The role of self-, peer and teacher assessment in promoting Iranian EFL learners' writing performance. Assess. Evalu. High. Educ. 37, 513–533. doi: 10.1080/02602938.2010.549204

Bjork, R. A., Dunlosky, J., and Kornell, N. (2013). Self-regulated learning: beliefs, techniques, and illusions. Annu. Rev. Psychol. 64, 417–444. doi: 10.1146/annurev-psych-113011-143823

Black, P., Harrison, C., Lee, C., Marshall, B., and Wiliam, D. (2003). Assessment for Learning: Putting it into Practice . Berkshire: Open University Press.

Black, P., and Wiliam, D. (1998). Inside the black box: raising standards through classroom assessment. Phi Delta Kappan 80, 139–144; 146–148.

Blanch-Hartigan, D. (2011). Medical students' self-assessment of performance: results from three meta-analyses. Patient Educ. Counsel. 84, 3–9. doi: 10.1016/j.pec.2010.06.037

Bol, L., Hacker, D. J., Walck, C. C., and Nunnery, J. A. (2012). The effects of individual or group guidelines on the calibration accuracy and achievement of high school biology students. Contemp. Educ. Psychol. 37, 280–287. doi: 10.1016/j.cedpsych.2012.02.004

Boud, D. (1995a). Implementing Student Self-Assessment, 2nd Edn. Australian Capital Territory: Higher Education Research and Development Society of Australasia.

Boud, D. (1995b). Enhancing Learning Through Self-Assessment. London: Kogan Page.

Boud, D. (1999). Avoiding the traps: Seeking good practice in the use of self-assessment and reflection in professional courses. Soc. Work Educ. 18, 121–132. doi: 10.1080/02615479911220131

Boud, D., and Brew, A. (1995). Developing a typology for learner self-assessment practices. Res. Dev. High. Educ. 18, 130–135.

Bourke, R. (2014). Self-assessment in professional programmes within tertiary institutions. Teach. High. Educ. 19, 908–918. doi: 10.1080/13562517.2014.934353

Bourke, R. (2016). Liberating the learner through self-assessment. Cambridge J. Educ. 46, 97–111. doi: 10.1080/0305764X.2015.1015963

Brown, G., Andrade, H., and Chen, F. (2015). Accuracy in student self-assessment: directions and cautions for research. Assess. Educ. 22, 444–457. doi: 10.1080/0969594X.2014.996523

Brown, G. T., and Harris, L. R. (2013). “Student self-assessment,” in Sage Handbook of Research on Classroom Assessment , ed J. H. McMillan (Los Angeles, CA: Sage), 367–393. doi: 10.4135/9781452218649.n21

Brown, G. T. L., and Harris, L. R. (2014). The future of self-assessment in classroom practice: reframing self-assessment as a core competency. Frontline Learn. Res. 3, 22–30. doi: 10.14786/flr.v2i1.24

Butler, D. L., and Winne, P. H. (1995). Feedback and self-regulated learning: a theoretical synthesis. Rev. Educ. Res. 65, 245–281. doi: 10.3102/00346543065003245

Butler, Y. G. (2018). “Young learners' processes and rationales for responding to self-assessment items: cases for generic can-do and five-point Likert-type formats,” in Useful Assessment and Evaluation in Language Education , eds J. Davis et al. (Washington, DC: Georgetown University Press), 21–39. doi: 10.2307/j.ctvvngrq.5

CrossRef Full Text

Chang, C.-C., Liang, C., and Chen, Y.-H. (2013). Is learner self-assessment reliable and valid in a Web-based portfolio environment for high school students? Comput. Educ. 60, 325–334. doi: 10.1016/j.compedu.2012.05.012

Chang, C.-C., Tseng, K.-H., and Lou, S.-J. (2012). A comparative analysis of the consistency and difference among teacher-assessment, student self-assessment and peer-assessment in a Web-based portfolio assessment environment for high school students. Comput. Educ. 58, 303–320. doi: 10.1016/j.compedu.2011.08.005

Colliver, J., Verhulst, S, and Barrows, H. (2005). Self-assessment in medical practice: a further concern about the conventional research paradigm. Teach. Learn. Med. 17, 200–201. doi: 10.1207/s15328015tlm1703_1

Crooks, T. J. (1988). The impact of classroom evaluation practices on students. Rev. Educ. Res. 58, 438–481. doi: 10.3102/00346543058004438

de Bruin, A. B. H., and van Gog, T. (2012). Improving self-monitoring and self-regulation: From cognitive psychology to the classroom , Learn. Instruct. 22, 245–252. doi: 10.1016/j.learninstruc.2012.01.003

De Grez, L., Valcke, M., and Roozen, I. (2012). How effective are self- and peer assessment of oral presentation skills compared with teachers' assessments? Active Learn. High. Educ. 13, 129–142. doi: 10.1177/1469787412441284

Dolosic, H. (2018). An examination of self-assessment and interconnected facets of second language reading. Read. Foreign Langu. 30, 189–208.

Draper, S. W. (2009). What are learners actually regulating when given feedback? Br. J. Educ. Technol. 40, 306–315. doi: 10.1111/j.1467-8535.2008.00930.x

Dunlosky, J., and Ariel, R. (2011). “Self-regulated learning and the allocation of study time,” in Psychology of Learning and Motivation , Vol. 54 ed B. Ross (Cambridge, MA: Academic Press), 103–140. doi: 10.1016/B978-0-12-385527-5.00004-8

Dunlosky, J., and Rawson, K. A. (2012). Overconfidence produces underachievement: inaccurate self evaluations undermine students' learning and retention. Learn. Instr. 22, 271–280. doi: 10.1016/j.learninstruc.2011.08.003

Dweck, C. (2006). Mindset: The New Psychology of Success. New York, NY: Random House.

Epstein, R. M., Siegel, D. J., and Silberman, J. (2008). Self-monitoring in clinical practice: a challenge for medical educators. J. Contin. Educ. Health Prof. 28, 5–13. doi: 10.1002/chp.149

Eva, K. W., and Regehr, G. (2008). “I'll never play professional football” and other fallacies of self-assessment. J. Contin. Educ. Health Prof. 28, 14–19. doi: 10.1002/chp.150

Falchikov, N. (2005). Improving Assessment Through Student Involvement: Practical Solutions for Aiding Learning in Higher and Further Education . London: Routledge Falmer.

Fastre, G. M. J., van der Klink, M. R., Sluijsmans, D., and van Merrienboer, J. J. G. (2012). Drawing students' attention to relevant assessment criteria: effects on self-assessment skills and performance. J. Voc. Educ. Train. 64, 185–198. doi: 10.1080/13636820.2011.630537

Fastre, G. M. J., van der Klink, M. R., and van Merrienboer, J. J. G. (2010). The effects of performance-based assessment criteria on student performance and self-assessment skills. Adv. Health Sci. Educ. 15, 517–532. doi: 10.1007/s10459-009-9215-x

Fitzpatrick, B., and Schulz, H. (2016). “Teaching young students to self-assess critically,” Paper presented at the Annual Meeting of the American Educational Research Association (Washington, DC).

Franken, A. S. (1992). I'm Good Enough, I'm Smart Enough, and Doggone it, People Like Me! Daily affirmations by Stuart Smalley. New York, NY: Dell.

Glaser, C., and Brunstein, J. C. (2007). Improving fourth-grade students' composition skills: effects of strategy instruction and self-regulation procedures. J. Educ. Psychol. 99, 297–310. doi: 10.1037/0022-0663.99.2.297

Gonida, E. N., and Leondari, A. (2011). Patterns of motivation among adolescents with biased and accurate self-efficacy beliefs. Int. J. Educ. Res. 50, 209–220. doi: 10.1016/j.ijer.2011.08.002

Graham, S., Hebert, M., and Harris, K. R. (2015). Formative assessment and writing. Elem. Sch. J. 115, 523–547. doi: 10.1086/681947

Guillory, J. J., and Blankson, A. N. (2017). Using recently acquired knowledge to self-assess understanding in the classroom. Sch. Teach. Learn. Psychol. 3, 77–89. doi: 10.1037/stl0000079

Hacker, D. J., Bol, L., Horgan, D. D., and Rakow, E. A. (2000). Test prediction and performance in a classroom context. J. Educ. Psychol. 92, 160–170. doi: 10.1037/0022-0663.92.1.160

Harding, J. L., and Hbaci, I. (2015). Evaluating pre-service teachers math teaching experience from different perspectives. Univ. J. Educ. Res. 3, 382–389. doi: 10.13189/ujer.2015.030605

Harris, K. R., Graham, S., Mason, L. H., and Friedlander, B. (2008). Powerful Writing Strategies for All Students . Baltimore, MD: Brookes.

Harris, L. R., and Brown, G. T. L. (2013). Opportunities and obstacles to consider when using peer- and self-assessment to improve student learning: case studies into teachers' implementation. Teach. Teach. Educ. 36, 101–111. doi: 10.1016/j.tate.2013.07.008

Hattie, J., and Timperley, H. (2007). The power of feedback. Rev. Educ. Res. 77, 81–112. doi: 10.3102/003465430298487

Hawe, E., and Parr, J. (2014). Assessment for learning in the writing classroom: an incomplete realization. Curr. J. 25, 210–237. doi: 10.1080/09585176.2013.862172

Hawkins, S. C., Osborne, A., Schofield, S. J., Pournaras, D. J., and Chester, J. F. (2012). Improving the accuracy of self-assessment of practical clinical skills using video feedback: the importance of including benchmarks. Med. Teach. 34, 279–284. doi: 10.3109/0142159X.2012.658897

Huang, Y., and Gui, M. (2015). Articulating teachers' expectations afore: Impact of rubrics on Chinese EFL learners' self-assessment and speaking ability. J. Educ. Train. Stud. 3, 126–132. doi: 10.11114/jets.v3i3.753

Kaderavek, J. N., Gillam, R. B., Ukrainetz, T. A., Justice, L. M., and Eisenberg, S. N. (2004). School-age children's self-assessment of oral narrative production. Commun. Disord. Q. 26, 37–48. doi: 10.1177/15257401040260010401

Karnilowicz, W. (2012). A comparison of self-assessment and tutor assessment of undergraduate psychology students. Soc. Behav. Person. 40, 591–604. doi: 10.2224/sbp.2012.40.4.591

Kevereski, L. (2017). (Self) evaluation of knowledge in students' population in higher education in Macedonia. Res. Pedag. 7, 69–75. doi: 10.17810/2015.49

Kingston, N. M., and Nash, B. (2011). Formative assessment: a meta-analysis and a call for research. Educ. Meas. 30, 28–37. doi: 10.1111/j.1745-3992.2011.00220.x

Kitsantas, A., and Zimmerman, B. J. (2006). Enhancing self-regulation of practice: the influence of graphing and self-evaluative standards. Metacogn. Learn. 1, 201–212. doi: 10.1007/s11409-006-9000-7

Kluger, A. N., and DeNisi, A. (1996). The effects of feedback interventions on performance: a historical review, a meta-analysis, and a preliminary feedback intervention theory. Psychol. Bull. 119, 254–284. doi: 10.1037/0033-2909.119.2.254

Kollar, I., Fischer, F., and Hesse, F. (2006). Collaboration scripts: a conceptual analysis. Educ. Psychol. Rev. 18, 159–185. doi: 10.1007/s10648-006-9007-2

Kolovelonis, A., Goudas, M., and Dermitzaki, I. (2012). Students' performance calibration in a basketball dribbling task in elementary physical education. Int. Electron. J. Elem. Educ. 4, 507–517.

Koriat, A. (2012). The relationships between monitoring, regulation and performance. Learn. Instru. 22, 296–298. doi: 10.1016/j.learninstruc.2012.01.002

Kostons, D., van Gog, T., and Paas, F. (2012). Training self-assessment and task-selection skills: a cognitive approach to improving self-regulated learning. Learn. Instruc. 22, 121–132. doi: 10.1016/j.learninstruc.2011.08.004

Labuhn, A. S., Zimmerman, B. J., and Hasselhorn, M. (2010). Enhancing students' self-regulation and mathematics performance: the influence of feedback and self-evaluative standards Metacogn. Learn. 5, 173–194. doi: 10.1007/s11409-010-9056-2

Leach, L. (2012). Optional self-assessment: some tensions and dilemmas. Assess. Evalu. High. Educ. 37, 137–147. doi: 10.1080/02602938.2010.515013

Lew, M. D. N., Alwis, W. A. M., and Schmidt, H. G. (2010). Accuracy of students' self-assessment and their beliefs about its utility. Assess. Evalu. High. Educ. 35, 135–156. doi: 10.1080/02602930802687737

Lin-Siegler, X., Shaenfield, D., and Elder, A. D. (2015). Contrasting case instruction can improve self-assessment of writing. Educ. Technol. Res. Dev. 63, 517–537. doi: 10.1007/s11423-015-9390-9

Lipnevich, A. A., Berg, D. A. G., and Smith, J. K. (2016). “Toward a model of student response to feedback,” in The Handbook of Human and Social Conditions in Assessment , eds G. T. L. Brown and L. R. Harris (New York, NY: Routledge), 169–185.

Lopez, R., and Kossack, S. (2007). Effects of recurring use of self-assessment in university courses. Int. J. Learn. 14, 203–216. doi: 10.18848/1447-9494/CGP/v14i04/45277

Lopez-Pastor, V. M., Fernandez-Balboa, J.-M., Santos Pastor, M. L., and Aranda, A. F. (2012). Students' self-grading, professor's grading and negotiated final grading at three university programmes: analysis of reliability and grade difference ranges and tendencies. Assess. Evalu. High. Educ. 37, 453–464. doi: 10.1080/02602938.2010.545868

Lui, A. (2017). Validity of the responses to feedback survey: operationalizing and measuring students' cognitive and affective responses to teachers' feedback (Doctoral dissertation). University at Albany—SUNY: Albany NY.

Marks, M. B., Haug, J. C., and Hu, H. (2018). Investigating undergraduate business internships: do supervisor and self-evaluations differ? J. Educ. Bus. 93, 33–45. doi: 10.1080/08832323.2017.1414025

Memis, E. K., and Seven, S. (2015). Effects of an SWH approach and self-evaluation on sixth grade students' learning and retention of an electricity unit. Int. J. Prog. Educ. 11, 32–49.

Metcalfe, J., and Kornell, N. (2005). A region of proximal learning model of study time allocation. J. Mem. Langu. 52, 463–477. doi: 10.1016/j.jml.2004.12.001

Meusen-Beekman, K. D., Joosten-ten Brinke, D., and Boshuizen, H. P. A. (2016). Effects of formative assessments to develop self-regulation among sixth grade students: results from a randomized controlled intervention. Stud. Educ. Evalu. 51, 126–136. doi: 10.1016/j.stueduc.2016.10.008

Micán, D. A., and Medina, C. L. (2017). Boosting vocabulary learning through self-assessment in an English language teaching context. Assess. Evalu. High. Educ. 42, 398–414. doi: 10.1080/02602938.2015.1118433

Miller, T. M., and Geraci, L. (2011). Training metacognition in the classroom: the influence of incentives and feedback on exam predictions. Metacogn. Learn. 6, 303–314. doi: 10.1007/s11409-011-9083-7

Murakami, C., Valvona, C., and Broudy, D. (2012). Turning apathy into activeness in oral communication classes: regular self- and peer-assessment in a TBLT programme. System 40, 407–420. doi: 10.1016/j.system.2012.07.003

Nagel, M., and Lindsey, B. (2018). The use of classroom clickers to support improved self-assessment in introductory chemistry. J. College Sci. Teach. 47, 72–79.

Ndoye, A. (2017). Peer/self-assessment and student learning. Int. J. Teach. Learn. High. Educ. 29, 255–269.

Nguyen, T., and Foster, K. A. (2018). Research note—multiple time point course evaluation and student learning outcomes in an MSW course. J. Soc. Work Educ. 54, 715–723. doi: 10.1080/10437797.2018.1474151

Nicol, D., and Macfarlane-Dick, D. (2006). Formative assessment and self-regulated learning: a model and seven principles of good feedback practice. Stud. High. Educ. 31, 199–218. doi: 10.1080/03075070600572090

Nielsen, K. (2014), Self-assessment methods in writing instruction: a conceptual framework, successful practices and essential strategies. J. Res. Read. 37, 1–16. doi: 10.1111/j.1467-9817.2012.01533.x.

Nowell, C., and Alston, R. M. (2007). I thought I got an A! Overconfidence across the economics curriculum. J. Econ. Educ. 38, 131–142. doi: 10.3200/JECE.38.2.131-142

Nugteren, M. L., Jarodzka, H., Kester, L., and Van Merriënboer, J. J. G. (2018). Self-regulation of secondary school students: self-assessments are inaccurate and insufficiently used for learning-task selection. Instruc. Sci. 46, 357–381. doi: 10.1007/s11251-018-9448-2

Panadero, E., and Alonso-Tapia, J. (2013). Self-assessment: theoretical and practical connotations. When it happens, how is it acquired and what to do to develop it in our students. Electron. J. Res. Educ. Psychol. 11, 551–576. doi: 10.14204/ejrep.30.12200

Panadero, E., Alonso-Tapia, J., and Huertas, J. A. (2012). Rubrics and self-assessment scripts effects on self-regulation, learning and self-efficacy in secondary education. Learn. Individ. Differ. 22, 806–813. doi: 10.1016/j.lindif.2012.04.007

Panadero, E., Alonso-Tapia, J., and Huertas, J. A. (2014). Rubrics vs. self-assessment scripts: effects on first year university students' self-regulation and performance. J. Study Educ. Dev. 3, 149–183. doi: 10.1080/02103702.2014.881655

Panadero, E., Alonso-Tapia, J., and Reche, E. (2013). Rubrics vs. self-assessment scripts effect on self-regulation, performance and self-efficacy in pre-service teachers. Stud. Educ. Evalu. 39, 125–132. doi: 10.1016/j.stueduc.2013.04.001

Panadero, E., Brown, G. L., and Strijbos, J.-W. (2016a). The future of student self-assessment: a review of known unknowns and potential directions. Educ. Psychol. Rev. 28, 803–830. doi: 10.1007/s10648-015-9350-2

Panadero, E., Jonsson, A., and Botella, J. (2017). Effects of self-assessment on self-regulated learning and self-efficacy: four meta-analyses. Educ. Res. Rev. 22, 74–98. doi: 10.1016/j.edurev.2017.08.004

Panadero, E., Jonsson, A., and Strijbos, J. W. (2016b). “Scaffolding self-regulated learning through self-assessment and peer assessment: guidelines for classroom implementation,” in Assessment for Learning: Meeting the Challenge of Implementation , eds D. Laveault and L. Allal (New York, NY: Springer), 311–326. doi: 10.1007/978-3-319-39211-0_18

Panadero, E., and Romero, M. (2014). To rubric or not to rubric? The effects of self-assessment on self-regulation, performance and self-efficacy. Assess. Educ. 21, 133–148. doi: 10.1080/0969594X.2013.877872

Papanthymou, A., and Darra, M. (2018). Student self-assessment in higher education: The international experience and the Greek example. World J. Educ. 8, 130–146. doi: 10.5430/wje.v8n6p130

Punhagui, G. C., and de Souza, N. A. (2013). Self-regulation in the learning process: actions through self-assessment activities with Brazilian students. Int. Educ. Stud. 6, 47–62. doi: 10.5539/ies.v6n10p47

Raaijmakers, S. F., Baars, M., Paas, F., van Merriënboer, J. J. G., and van Gog, T. (2019). Metacognition and Learning , 1–22. doi: 10.1007/s11409-019-09189-5

Raaijmakers, S. F., Baars, M., Schapp, L., Paas, F., van Merrienboer, J., and van Gog, T. (2017). Training self-regulated learning with video modeling examples: do task-selection skills transfer? Instr. Sci. 46, 273–290. doi: 10.1007/s11251-017-9434-0

Ratminingsih, N. M., Marhaeni, A. A. I. N., and Vigayanti, L. P. D. (2018). Self-assessment: the effect on students' independence and writing competence. Int. J. Instruc. 11, 277–290. doi: 10.12973/iji.2018.11320a

Ross, J. A., Rolheiser, C., and Hogaboam-Gray, A. (1998). “Impact of self-evaluation training on mathematics achievement in a cooperative learning environment,” Paper presented at the annual meeting of the American Educational Research Association (San Diego, CA).

Ross, J. A., and Starling, M. (2008). Self-assessment in a technology-supported environment: the case of grade 9 geography. Assess. Educ. 15, 183–199. doi: 10.1080/09695940802164218

Samaie, M., Nejad, A. M., and Qaracholloo, M. (2018). An inquiry into the efficiency of whatsapp for self- and peer-assessments of oral language proficiency. Br. J. Educ. Technol. 49, 111–126. doi: 10.1111/bjet.12519

Sanchez, C. E., Atkinson, K. M., Koenka, A. C., Moshontz, H., and Cooper, H. (2017). Self-grading and peer-grading for formative and summative assessments in 3rd through 12th grade classrooms: a meta-analysis. J. Educ. Psychol. 109, 1049–1066. doi: 10.1037/edu0000190

Sargeant, J. (2008). Toward a common understanding of self-assessment. J. Contin. Educ. Health Prof. 28, 1–4. doi: 10.1002/chp.148

Sargeant, J., Mann, K., van der Vleuten, C., and Metsemakers, J. (2008). “Directed” self-assessment: practice and feedback within a social context. J. Contin. Educ. Health Prof. 28, 47–54. doi: 10.1002/chp.155

Shute, V. (2008). Focus on formative feedback. Rev. Educ. Res. 78, 153–189. doi: 10.3102/0034654307313795

Silver, I., Campbell, C., Marlow, B., and Sargeant, J. (2008). Self-assessment and continuing professional development: the Canadian perspective. J. Contin. Educ. Health Prof. 28, 25–31. doi: 10.1002/chp.152

Siow, L.-F. (2015). Students' perceptions on self- and peer-assessment in enhancing learning experience. Malaysian Online J. Educ. Sci. 3, 21–35.

Son, L. K., and Metcalfe, J. (2000). Metacognitive and control strategies in study-time allocation. J. Exp. Psychol. 26, 204–221. doi: 10.1037/0278-7393.26.1.204

Tan, K. (2004). Does student self-assessment empower or discipline students? Assess. Evalu. Higher Educ. 29, 651–662. doi: 10.1080/0260293042000227209

Tan, K. (2009). Meanings and practices of power in academics' conceptions of student self-assessment. Teach. High. Educ. 14, 361–373. doi: 10.1080/13562510903050111

Taras, M. (2008). Issues of power and equity in two models of self-assessment. Teach. High. Educ. 13, 81–92. doi: 10.1080/13562510701794076

Tejeiro, R. A., Gomez-Vallecillo, J. L., Romero, A. F., Pelegrina, M., Wallace, A., and Emberley, E. (2012). Summative self-assessment in higher education: implications of its counting towards the final mark. Electron. J. Res. Educ. Psychol. 10, 789–812.

Thawabieh, A. M. (2017). A comparison between students' self-assessment and teachers' assessment. J. Curri. Teach. 6, 14–20. doi: 10.5430/jct.v6n1p14

Tulgar, A. T. (2017). Selfie@ssessment as an alternative form of self-assessment at undergraduate level in higher education. J. Langu. Linguis. Stud. 13, 321–335.

van Helvoort, A. A. J. (2012). How adult students in information studies use a scoring rubric for the development of their information literacy skills. J. Acad. Librarian. 38, 165–171. doi: 10.1016/j.acalib.2012.03.016

van Loon, M. H., de Bruin, A. B. H., van Gog, T., van Merriënboer, J. J. G., and Dunlosky, J. (2014). Can students evaluate their understanding of cause-and-effect relations? The effects of diagram completion on monitoring accuracy. Acta Psychol. 151, 143–154. doi: 10.1016/j.actpsy.2014.06.007

van Reybroeck, M., Penneman, J., Vidick, C., and Galand, B. (2017). Progressive treatment and self-assessment: Effects on students' automatisation of grammatical spelling and self-efficacy beliefs. Read. Writing 30, 1965–1985. doi: 10.1007/s11145-017-9761-1

Wang, W. (2017). Using rubrics in student self-assessment: student perceptions in the English as a foreign language writing context. Assess. Evalu. High. Educ. 42, 1280–1292. doi: 10.1080/02602938.2016.1261993

Wollenschläger, M., Hattie, J., Machts, N., Möller, J., and Harms, U. (2016). What makes rubrics effective in teacher-feedback? Transparency of learning goals is not enough. Contemp. Educ. Psychol. 44–45, 1–11. doi: 10.1016/j.cedpsych.2015.11.003

Yan, Z., and Brown, G. T. L. (2017). A cyclical self-assessment process: towards a model of how students engage in self-assessment. Assess. Evalu. High. Educ. 42, 1247–1262. doi: 10.1080/02602938.2016.1260091

Yilmaz, F. N. (2017). Reliability of scores obtained from self-, peer-, and teacher-assessments on teaching materials prepared by teacher candidates. Educ. Sci. 17, 395–409. doi: 10.12738/estp.2017.2.0098

Zimmerman, B. J. (2000). Self-efficacy: an essential motive to learn. Contemp. Educ. Psychol. 25, 82–91. doi: 10.1006/ceps.1999.1016

Zimmerman, B. J., and Schunk, D. H. (2011). “Self-regulated learning and performance: an introduction and overview,” in Handbook of Self-Regulation of Learning and Performance , eds B. J. Zimmerman and D. H. Schunk (New York, NY: Routledge), 1–14.

Keywords: self-assessment, self-evaluation, self-grading, formative assessment, classroom assessment, self-regulated learning (SRL)

Citation: Andrade HL (2019) A Critical Review of Research on Student Self-Assessment. Front. Educ. 4:87. doi: 10.3389/feduc.2019.00087

Received: 27 April 2019; Accepted: 02 August 2019; Published: 27 August 2019.

Reviewed by:

Copyright © 2019 Andrade. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Heidi L. Andrade, handrade@albany.edu

This article is part of the Research Topic

Advances in Classroom Assessment Theory and Practice

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 05 March 2019

Research governance and the future(s) of research assessment

  • Alis Oancea 1  

Palgrave Communications volume  5 , Article number:  27 ( 2019 ) Cite this article

6689 Accesses

48 Citations

33 Altmetric

Metrics details

  • Science, technology and society

This paper explores recent public debates around research assessment and its future as part of a dynamic landscape of governance discourses and practices, and organisational, professional and disciplinary cultures. Drawing reflectively on data from RAE 2001, RAE 2008 and REF 2014 (reported elsewhere), the paper highlights how recent debates around research assessment echo longer-term changes in research governance. The following changes, and several critiques of their implications, are discussed: shifts in the principles for governing research and the rise of multi-purpose assessment; the spread of performance-based funding and external accountability for research; the use of metrics and indicators in research assessment; the boundary work taking place in defining and classifying units or fields for assessment; the emphasis on research impact as a component of research value; organisational recalibration across the sector; and the specialisation of blended professional practice. These changes are underpinned by persistent tensions around accountability; evaluation; measurement; demarcation; legitimation; agency; and identity in research. Overall, such trends and the discursive shifts that made them possible have challenged established principles of funding and governance and have pushed assessment technologies into a pivot position in the political dynamics of renegotiating the relationships between universities and the state. Jointly, the directions of travel identified in this paper describe a widespread and persistent regime of research governance and policy that has become embedded in institutional and individual practices.

Similar content being viewed by others

research paper on assessment

How can funders promote the use of research? Three converging views on relational research

research paper on assessment

Individual excellence funding: effects on research autonomy and the creation of protected spaces

research paper on assessment

Strengthening research integrity: which topic areas should organisations focus on?

Introduction.

In 2021, sector-wide performance-based research funding in UK academia—and arguably worldwide—will be turning 35, an anniversary also to be marked by the next iteration of the national exercise for the assessment of research performance on higher education. The Research Selectivity Exercise (RSE), conducted in 1986, was the first full-scale national exercise that aimed to base funding decisions on a wide-ranging assessment of the quality of research carried out in university departments. The RSE was criticised at the time on almost every aspect, and many of these critiques led to changes in the design and procedures of its descendant—the Research Assessment Exercise (RAE) in 1989. This pattern of opposition, critique, consultation and amendments is recognisable across all cycles of the exercise, from the RAEs (1989, 1992, 1996, 2001, 2008) to the current Research Excellence Framework (2014, 2021). It is also seen in the striking dynamic of assessment hyperactivity and consultation fatigue that seems to keep academics and administrators too busy to act when yet another spinning plate may be added to their daily jobs. What is also striking is the persistence of debate around the key elements of the exercise, including: the criteria for quality; the indicators of performance and assessment procedures, such as the balance between peer review and metrics; the consultative mechanisms feeding into the design of each round; the treatment of different disciplines and of interdisciplinary work; the extent to which the procedures are sufficiently transparent at all levels; and the impacts on different institutions, fields, modes of research, and categories of staff. The arguments summarised in papers such as Phillimore ( 1989 ) and Bence and Oppenheim ( 2002 ) are as alive now and they were then, despite much research having been conducted since on the workings and outcomes of successive exercises. According to a great swathe of the literature, the ‘dragon of evaluation’ (Minogue, 1986 )—or the ‘Frankenstein’ of research assessment (Attwood, 2010 )—seems to have grown no more and no less fierce over the years.

How is it, then, that (collective) learning from the very public debates and direct realities of three decades of national exercises is yet to enable academics, policy makers, administrators, ‘users’ and investors in research to reach agreement on ways to address satisfactorily these recurrent issues?

Part of the answer must be down to the politics and micropolitics of research, higher education, and the ‘knowledge economy’. But part of the answer may also be that some of these problems are intractable: not in the sense of strategic stand-offs between the parties concerned, but in the more fundamental sense of the philosophical and sociological tensions that underpin the vocabulary and procedures of measuring, rewarding and influencing research ‘performance’ or ‘quality’.

This paper explores some of these tensions, grouped under seven headings and framed as persistent ‘problems’ in research assessment as a technology of governance in the shape currently practiced in the UK and to a great extent elsewhere: the accountability; evaluation; measurement; demarcation; legitimation; agency; and identity problems. These problems are philosophical, as well as empirical. Their value as analytic devices derives from the fact that they offer a frame for questioning current and emergent practices and the incentives arising from them. The grounds for this selection are both a priori , as themes identified through conceptual and theoretical inquiry into the notions of research quality, performance and value/ evaluation; and a posteriori , as categories developed from empirical studies of RAE and REF submission data and of interview and survey data spanning four national exercises: RAE 1996, 2001 and 2008, and REF 2014. While the theoretical arguments and the findings of these studies are reported elsewhere (e.g., Oancea, 2008 , 2010 , 2014 ; Oancea et al., 2018 ), this paper attempts to draw them together reflectively in an exploration of ongoing trends and discursive threads underpinning recent public debates around research assessment and its future in the UK and beyond.

The accountability problem and the rise of multi-purpose assessment

The past few decades of research policy have seen the ascension of principles, such as formal accountability, marketization, and competition, in the governing of research at international, national and organisational levels. A symptom of this move is the growing reliance on performance-driven assessment technologies not only to inform public investment in research, but also to steer research activity itself towards aims such as global competitiveness and measurable contribution to the ‘knowledge economy’ (BIS, 2016 ). This regime of governmentality re-describes the place of research in the world in terms of solutions to externally-defined, global challenges and priorities. In creating these solutions, research is expected to be both academically significant and practically (including politically and economically) astute—qualities which, it is further expected, may be proxied by an ever increasing range of measures and indicators of research quality and impact. I suggest that this project has profound ethical and epistemological implications for the textured, overlapping sets of practices that it attempts to frame and shape. For example, an excessive focus on technical measures of research performance within institutions may influence researchers’ perceived freedoms to enact epistemic virtues such as integrity, openness, modesty, circumspection or criticality (see Kerr, 1993 and Battaly, 2013 ), as well as potentially corralling the moral sense of academic responsibility into performative compliance with managerial and other role responsibilities. Indicators and metrics are not ethically and epistemologically neutral, but the very processes of their creation, use, rejection and renewal may marginalise and displace parts of the research community with lower access to resources and academic capital (Sugimoto and Larivière, 2018 ).

The changes, over the years, in the explicit purposes of the UK RAE/REF (as stated in the exercises’ official guidance documents) illustrate a shift in emphasis. For example, the official guidance for submissions to RAE 2001 stated that the purposes of the exercise were ‘to produce ratings of research quality which [would] be used by the higher education funding bodies in determining the main grant for research to the institutions they fund’, and to ‘inform policy development’ (RAE, 1999 ); the exercise thus was intended to influence public funding bodies and governmental policy bodies. By 2014, this statement had morphed into a range of purposes, including ‘to inform the selective allocation of [funding councils’] grant for research’; to ‘provide (…) accountability for public investment in research and produce (…) evidence of the benefits of this investment’; and to ‘provide benchmarking information and establish reputational yardsticks, for use within the higher education (HE) sector and for public information’ (REF, 2011a , 2011b ). The legitimate reach of the exercise was thus extended into organisational governance, while its reputational impact was explicitly put on a par with its financial outcomes. Most recently, the Stern consultation about the future of the REF saw the exercise mainly as a tool for the allocation of public funds for research, but accepted that other purposes were also relevant, such as informing institutional strategy and supporting governmental and funding bodies in ‘driving research excellence and productivity’ (BEIS, 2016a , b ).

A key linguistic change between pre-2014 and post-2014 exercises is the explicit mention of accountability as a key purpose of the REF. No mention of accountability was made in the 2001 statement of purpose; by 2008, accountability had made an appearance in the description of the principles underpinning the conduct of the exercise; but in the 2014 exercise, it became one of its three core purposes: ‘The assessment provides accountability for public investment in research and produces evidence of the benefits of this investment.’ This is not ‘just’ semantics. Strong critiques of the RAE and the REF as assessment technologies have revolved around the particular notion of accountability that is assumed to be at their heart, sometimes summed up as ‘competitive accountability’ (Watermeyer, 2019 ) or ‘performative accountability’ (Oancea, 2008 ), and contrasted to forms of accountability deemed to be better attuned to the values and generative energies of research and researchers. To simplify, two conceptual constellations seem to epitomise this tension: on the one hand, accountability is a formal, mandatory mechanism that is largely vertical (hierarchical) and adversarial, and revolves around (bureaucratic) surveillance, answerability and enforceability. On the other hand, it is conceived as a formative practice, which is horizontal and voluntary, and emphasizes democratic dialogue, communal and collaborative practice, and professional responsibility. Overall, critics and supporters of the REF tend to find affinities with the language associated with one or the other of these contrasting constellations; but note that these concepts do not form a dichotomous choice, but rather create a space populated by a wide range of hybrids, and hence are subject to continued debate. It is this inherent ambiguity that I refer to as the accountability problem.

The role played by the research assessment in the project of governance sketched above is expressed through a versatile balance of powerful practices, including new rituals and routines that affect academic life: ‘mock’ REFs and ‘dry-runs’, user panels for internal allocation of funds at HEI level, REF strategy groups and project boards, benchmarking, and so on. As these practices have become more common, the discourses that legitimise them have also become, in turn, less likely to be questioned. While different notions of accountability may be used by those designing and interpreting the purposes of research assessment, in applying the guidance to the exercise, academics may also implicitly shift towards more formalised notions of accountability. This is a soft and pervasive cultural change, working through ambivalent technologies of discipline and self-discipline (in Foucault’s sense of the word, as argued in Oancea, 2014 ), including techniques of cooption and hospitability—for example, via consultations, award ceremonies, nominations and representation on committees and boards, expert panel surveys, consultancies, and stakeholder events. As a result, performance-based funding, the selective distribution of resources and of research capacity, and institutionalised accountability for academic and non-academic impact have become conditional of professional autonomy and self-regulation in higher education.

The emerging sense of consensus around the legitimacy of research assessment as a key mechanism for research funding allocation, accountability, and steering is, however, at best a grudging consensus—more of a truce than a concert. Looking towards the future of research assessment, it continues to be important to question this truce and constantly re-assess the principles of governance underpinning it, including the tensions surrounding different notions of accountability and the impact of policy-driven definitions of research and research quality on the dual support mechanism.

The evaluation problem and performance-based research funding

The past two decades have seen an unprecedented spread of performance-based research funding across the world. Although many countries now use performance agreements, which are based on expectations of future performance (including, in some cases, research performance) and provision for reporting it, there is also widespread use of ex-post performance funding systems, which base grant allocations on assessments of past performance (see Hicks, 2012 , Jonkers and Zacharewicz, 2016 ). Among the latter, the UK large-scale system has been highly influential.

The REF and RAE’s international appeal as models of performance-based research assessment for funding purposes arises partly from their system-wide scale and their long history, together with a halo effect from the success in international rankings of the UK research system itself. There are also elements of the design of these exercises that help explain their durability and influence. In particular, there have been repeated expressions of confidence in the quality and fairness of the expert review at the core of the exercises, appreciation of the procedural transparency of the assessment, support for the profile-based aggregate focus (as opposed to individual single ratings), and valuing of the perceived contribution of the exercises to legitimising research as part of academic practice in different types of higher education institutions (HEIs) and disciplines, as well as to increased mutual understanding between the stakeholders involved (see e.g., Coryn, 2007 ; Hill, 2016 ).

However, vast swathes of the literature, including some of the literature referenced above, also bring out disadvantages and undesirable consequences of the RAE and REF in particular, and of performance-based research funding systems in general; for example, in terms of the administrative burden involved in running the national exercises, or of the deficit model of academic practice arguably underpinning performance-based steering more generally. This literature points out that the demands set by these exercises through their guidance documents and structures of government and accountability, as they filter through risk-averse layers of organisational management, may generate negative impacts on organisational cultures, on diversity in research and higher education, on the balance between teaching and research, and on individual staff morale and careers (see the diverse positive and negative perspectives reviewed in Oancea, 2010 , 2014 ).

A particularly powerful objection to assessment models underpinning performance-based funding is that they may stimulate ‘gaming’ (Lucas, 2006 ) by perversely incentivising a focus on compliance rather than quality, and reliance on agile gamesmanship (Baird and Elliott, 2018 ) rather than on in-depth experience in developing generative research environments. For example, Lucas ( 2006 ) draws on Bourdieu’s ( 1988 ) idea of academic ‘capital’ and on Slaughter and Leslie’s ( 1997 ) analysis of ‘academic capitalism’ to argue that the ‘status and positioning afforded by success in the (RAE) game’ have become the raison d’être for research in universities, thus cutting to the core values of academic life. Miller ( 2001 , p. 392) argued that ‘calculative practices’ were ‘intrinsic to and constitutive of’ the social relations between agents and institutions shaped by technologies of ‘governing by numbers’, such as costing, standardisation, benchmarking, and performance measurement. Sidhu ( 2008 ) draws on this idea to note the seductive power of savvy compliance with audit technologies like the RAE, and the insidious ways in which the calculative practices incentivized by them conspire to re-mould individual and organisational academic identities. Building on Power’s ( 1997 ) analysis of audits as ‘rationalised rituals of inspection’ (p. 96), Strathern ( 2000 , pp. 313–314) argues that audit technologies premised on the assumption of measurable, visible performance, such as the RAE, prioritise transparency and ‘verification’ over the ‘real’ workings of an institution and its research creativity, thus contributing to ‘a leaking away of trust’ in expert systems. Overly complex formal accountability ‘juggernauts’, of which the RAE and REF are seen as examples, may ‘create perverse incentives’ in the name of transparency and as a result ‘are often a source rather than a remedy for mistrust’ (O’Neill, 2013 : 10, 12; see also Pirrie et al., 2010 ), thus potentially contributing to the rise of anti-expertise sentiments in public life (Nichols, 2017 ).

Such criticisms may reflect the fact that the high stakes involved in the assessment of research for funding allocation exacerbate a heuristic tension that is at the heart of any evaluation as practical judgement (De Munck and Zimmermann, 2015 ): that between valuing something (e.g., instances of original and rigorous research); and the courses of action to be taken as a result of a specific, situated evaluation process in pursuit of purposes that transcend it (e.g., increased competitiveness or productivity, different kinds of institutional and individual behaviours etc.). In Dewey’s ( 1939 ) terms, this points to the tension between ‘prizing’ (‘holding dear’ or esteeming) and ‘appraising’ (assigning comparative value relative to other objects in the same category). The former may engender commitment; the latter, compliance. This tension may also be expressed through differing takes on the forms and sources of knowledge and experience required in order to conduct the evaluation itself.

Table 1 illustrates how in the past two decades the assessment of research has not only become more specialised (as indicated by the range of methods and measures now part of formal assessment mechanisms), but also more stratified. Different actors and approaches have clustered around different levels of interest and organisation: from sub-organisational (such as grant proposal evaluations, or staff performance appraisals) to supra-organisational levels (such as sector-wide assessment exercises). The notion of expertise engaged across these strata is not homogenous, but rather split between in-depth topical and methodological understanding of the field of research being assessed and/or of the areas of application relevant to it (i.e., substantive expertise, akin to Collins and Evans’, 2002 , ‘contributory expertise’) and detailed technical knowledge of the systems of rules, mechanisms, and formalised expectations involved in performing the assessment itself (i.e., procedural expertise, which is distinct from the ‘interactional’ and ‘referred’ expertise noted by Collins and Evans, as it falls in a different area of technical expertise, bounded by the structures and norms of the exercise itself). Note, for example, that most job advertisements for REF-related appointments (such as REF directors, managers, officers etc.) in universities include no or little mention of topical or methodological expertise in a particular field or cluster of fields, but expect instead a clear track record of procedural expertise pertaining to the specific details of running the REF. The opposite seems to be true of adverts for most academic positions (including leadership positions) without specific contractual REF responsibilities.

Both forms of expertise are always present in actual assessment (I make no claims about ideal evaluation situations), but in the current governance landscape the balance between the two shifts across different assessment contexts. Arguably, as suggested by the two arrows on the side of Table 1 , the closer an instance of research assessment is to individual research projects, ‘outputs’ and researchers, the stronger its dependence may be on the depth and quality of substantive expertise; and the closer an instance of research assessment is to the other end of the spectrum, the more dependent its actual conduct may be on procedural expertise. Testing these hypotheses may help understand why assessments based exclusively in substantive expertise are often accused of conservatism or self-serving bias; while those heavily dependent on procedural expertise are accused of misinterpretation, intellectual co-option and dulling of critical scrutiny of the assessment itself.

With increased pressures on limited resources from a growing number of organisations (higher education, non-profit institutes, think tanks, for-profit organisations) come incentives to tighten such assessments even further. Research assessment thus balances the policy appetite for rational allocation of resources (which in the recent decades has been interpreted as selectivity and concentration based on performance) with the academic orientation towards intellectually defensible allocation of research prestige (which customarily translates into the outcomes of various forms of peer review).

The measurement problem and the use of metrics and indicators

The third persistent problem identified in this paper is the ‘measurement problem’, or the problem of whether it is possible to avoid prioritising the metric (or, in semiotic terms, the signifier) over the actual quality (or the signified) in evaluating research (Allan Hanson, 2000 , p. 68). In Baudrillard’s ( 1994 , p. 2) words, this is a situation where the ‘map’ produced for evaluation purposes may no longer be a representation or re-imagining of a ‘territory’ that it purports to describe, but would instead ‘precede’ it: the ‘territory’ becomes purely operational, being ‘produced from miniaturised units, from matrices, memory banks and command models’—or, one may add, from objects such as units of data, metrics, templates and forms, and codes of practice.

Given the diversity of methods and of measures illustrated in Table 1 , the meaning of ‘doing well’ in research assessment is not straightforward. For example, the official profiles or scores awarded in national assessment exercises are mediated by internal governance factors but also by external referents, and in particular by the position of a research unit relative to any number of comparators. A high score for a research unit in the UK REF does not automatically translate into internal recognition, allocation of resources, or strategic commitment to specific research values within the host HE institution. The result is challenge and flux, with numerous proposals for new metrics vying for primacy to legitimise potential shifts in hierarchies (though note that many of these proposals do not start from a conceptualisation of research value or performance and a theory of how it may be measured, but rather from observing or constructing an aspect of research that is amenable to counting).

Table 2 illustrates how the performative vocabulary that has grown around research metrics varies by scope, from the individual to the field level (and beyond–though the national and international levels are not included in the table); and by level of aggregation, from specific measures of research performance (micro-metrics) to global assessments of research success (macro-indicators). None of the blue boxes in the table is a full list of key metrics and indicators in current use; rather, they offer examples of some of the references to metrics and indicators that I have heard mentioned in the interviews, surveys and workshops I conducted over the past decade. This collection illustrates how what is often called ‘metrics’ in everyday institutional language may in fact pertain to a range of different categories and may relate only loosely, if at all, to a theory of measurement.

Micro-metrics of research are what is more commonly meant by the term ‘metric’. They are measurements of the degree to which research inputs, outputs or outcomes display a particular characteristic. They are usually concrete, quantifiable, time-defined, and narrow in scope. Often, they are co-opted by organisations to function as micro-indicators of performance, in which case their legitimacy is inferred from meso-indicators, macro-indicators, and meta-indicators in an attempt to compensate for their own inherent lack of contextual information and normative self-awareness. As snapshots of a particular moment in time, such functional micro-indicators are of limited direct use in summative judgements of research, despite the bewitchingly normative terminology surrounding their use, such as ‘success rate’ or ‘grant value’. Their transient nature means that they are often the object of constant institutional monitoring over time, despite the doubtful meaningfulness of the resulting reports of quarterly and annual figures, and the seriously damaging implications of their misunderstanding or misuse.

Meso-metrics and indicators are also based on measurable quantities, usually through cumulative measurements of single micro-metrics over time, and with variable degrees of validity and reliability. Meso- indicators play a dual role when used in evaluations: first, they may be drafted in to function as targets for micro-performance (see, for example, the use of publication productivity indicators in academic review and promotion procedures); and second, they may be extrapolated to signal, separately or combined, aspects of performance against macro-criteria—see, for example, the intended use by eleven sub-panels in REF 2021 citation data as a ‘potential’(Panel A), ‘part(ial)’ (Panel B) or ‘supplementary’ (sub-panel 16 in Panel C) ‘indicator of academic significance’ (REF, 2018b , pp. 59–60). The assumption underpinning the latter phrase (which is part of the generic guidance but is toned down in the panel criteria) is that of a (stable) relationship between the frequency of indexed citations and the relationships of esteem in academic communities, and further, between these relationships and shared understandings of quality. Citation theorists such as Wouters ( 2016 ) warn that calculated indicators, such as those based on aggregated citation counts, are based on decontextualised information, where meaning is stripped off and then constituted anew in the move from the reference embedded in the original text to the reference in the bibliographic list, and then again to the indexed citation, which may be subsequently recontextualised for evaluation purposes.

Macro-indicators are global, composite criteria, usually defined at national, disciplinary or international level. Their nature, scope and legitimacy may be subject to continued contestation as fields and modes of research develop. As a result, their assessment requires high levels of substantive expertise and trust and is often largely holistic and qualitative, though it may also be informed by the refinement and integration of collections of meso-metrics and indicators.

Finally, meta-descriptors are artifacts of the assessment exercise itself and of the high reputational stakes it raises. They are either post-factum calculations in order to create various league tables out of the results of the RAE/REF (e.g., ‘Grade Point Average’), or normative terms used in internal management talk as shorthand for predicted performance in formal assessments (e.g., the “REF-ability’ of publications and of examples of impact, or the ‘4 by 4’-ness of individual researchers, i.e., researchers with four potentially 4* publications at a particular moment in time—usually a REF dry-run or a recruitment or retention decision). As Keane ( 2003 , p. 413) notes, ‘signs give rise to new signs, in an unending process of signification’; the temptation is great to ignore the ‘variable symbolic significance’ of these new signs and treat them as ‘quasi-objective indicators of quality, impact and esteem’ (Cronin, 2000 , p. 450). Many of these terms have entered everyday language and material practices in higher education, administrative organisations, and in the media and social media, often with damaging consequences for research cultures and individual morale. These performative byproducts of assessment continue to thrive in management vernacular inside and outside the HE system, despite growing expressions of organisational commitment to responsible uses of metrics and/or indicators in response to exhortations such as the San Francisco Declaration on Research Assessment in the US– https://sfdora.org/ , the Leiden Manifesto for Research Metrics in continental Europe– http://www.leidenmanifesto.org/ , or the UK Forum for Responsible Research Metrics– http://www.universitiesuk.ac.uk/policy-and-analysis/Pages/forum-for-responsible-research-metrics.aspx .

The use of metrics in the assessment of research has been subject to heated debate (Rijcke et al., 2016 ). Much of the debate has been about the technical credibility and the fitness for purpose of metrics in research assessment, and revolves, largely, around terminology and method (see Andras, 2011 , for a summary). There is something seductive about fine-grained technical arguments about the robustness, accuracy, standardisation, normalisation, validity, and reliability of particular quantitative measures of individual and aggregated research performance, however they may be defined; as well as about arguments on the quality, integration/ interoperability, openness and cost-effectiveness of systems and procedures for calibrating, recording and organising them. Literature also explores how, in the recent climate for research assessment, metrics and the organisational world they purport to measure may be mutually constitutive. Kelly and Burrows ( 2011 , p. 130) label this process the ‘performative metricisation’ of academic practice, whereby technologies such as the use of metrics ‘recursively defin[e] the practices and subjects of university life’. With a nod to Dickens’ Hard Times , Donovan ( 2009 ) describes excessive reliance on metrics over expertise and interpretation as ‘Gradgrinding’ research activity: over-simplifying the scope and aspirations of research though faith in the objectivity of ‘facts’ and the effectiveness of regulation.

A HEFCE-commissioned review of metrics (Wilsdon, 2016 ) attempted to chart a middle course between supporters and critics of the use of metrics in research Footnote 1 . It found that, particularly in the assessment of research impact and output originality and robustness, current metrics were neither robust, nor a like-for-like replacement for peer review. On this basis, it argued against the wholesale use of metrics for funding, accountability, personnel, strategic and benchmarking purposes, and took a measured view on the benefits of an increased use of metrics to support peer review in the next REF. Instead, the review group recommended the responsible yet restrained use of indicators of aspects of research input, significance and environment, the design, use and interpretation of which must be contextualised to institutional and (inter)disciplinary characteristics and needs, as well as to diverse purposes, levels and scales of assessment. When specific information becomes an indicator of particular ‘qualities’ of research, it takes on both a reference and a purpose; in other words, it becomes relational and contextual. Acknowledgment of these inherent attributes of indicators ought, in the steering group’s view, to stimulate reflexivity, deliberation, a sense of humility, and transparency in key actors’ (government, HEI- higher education institutions’- leaders, managers and administrators, funders, publishers, service and infrastructure providers, researchers) use of metrics for the support of scholarly, institutional and career diversity in research.

Although the recommendations made by this review and by other initiatives for the responsible use of metrics or indicators are worth heeding, the emphasis on the transparent use and distributed understanding of metrics, particularly if they are not coupled with devolved decision-making and bottom-up influence, may lead to protracted and widespread investment by institutions in refining metrics for top-down capture and quantification of increasingly detailed information. This investment can in itself build commitment and thus become an incentive for wider use of metrics in academia, paving the way for more data-driven governance in the future. Along the way, the nuances attached to the concept of ‘indicators’, favoured by the HEFCE review and other initiatives, may become blurred (see Wouters, 2016 ), and organisational practice may gravitate towards more straightforward reporting of quantitative metrics.

This soft and pervasive change tensions academic identities and their political agency. Academic ‘metrics-natives’, whose formative years as academics coincided with the rise of performance monitoring and performance-based funding in research, are pressured (for example, through recruitment and promotion expectations) to assimilate it to their academic habitus from the start of their careers, alongside the outputs-impact-environment and rigour-significance-originality triads of the RAE/REF. Non-natives (either by length of career or by geography) are expected to update and adapt their academic selves, often as a precondition of performing strategic and management roles in their institutions. Some embrace metrics, hoping to make assessment less onerous and more equitable, and to make data about and from research more open. Others oppose them as a threat to quality, diversity and professional judgement, and see their use as out of tune with academic norms of scholarly argumentation, criticality and intellectual integrity. Some go with the tide, while acknowledging that they felt pressured to ‘play safe’ for research assessment in REF 2014 by sticking to the more easily measurable and demonstrable (see interview data reported in Oancea et al., 2018 ), rather than making wider claims for, for example, discursive or cultural contributions from research. Many exercise domesticated resistance while part of the performance management system, and relieved disdain when they no longer need comply.

And so the use of metrics, like that of other assessment technologies, is beset by tensions about what individuals and institutions are trying to get at, how they go about it, from which structural and discursive positions, and to what purpose and effect. That is because, when integrated in particular performance regimes, metrics and indicators become multiply ambivalent technologies (a term I explained in detail in Oancea, 2014 ). These rankings, criteria, metrics and indicators are not meaningful on their own, but are ascribed meaning as part of wider narratives, institutional practices and flows of power at different levels and for different entities and purposes. They play out in distinctive ways in governance processes. The issue is not just technical—which metrics and indicators to throw in the basket and how to fine-tune them—but also substantive and normative: what do they mean, to whom and in what structural conditions, why are they seen to matter, whose view takes precedence, and for what purposes and in what context are they mobilised? The reason behind this ambivalence of metrics, however responsibly used, is that they are inevitably drafted into an ongoing renegotiation of the principles underpinning the relationships between universities and the state, mediated through public funding arrangements. Excessive focus on technical issues of measurement may distract from more fundamental debates around the ways in which highly formalised, complex performance assessment systems may affect these principles.

The demarcation problem and boundary work in research assessment

Sector-wide machineries for research assessment, like the UK REF, are engaged in bounding and curating judgements about epistemic objects such as research outputs and bodies of research. They rely heavily on peer review by academics with substantive expertise in the specific fields or subfields covered by each component unit of the exercise (reflected in the definition of panels and sub-panels), sometimes supplemented by field-relevant metrics and indicators. The mechanics of the REF (see Derrick, 2018 ) tie both the peer review and the use of metrics into definitions and classifications of research fields. The exercise entails decisions about what substantive and methodological content is to be assessed, what expertise that assessment requires, what the yardsticks and comparators ought to be, and what is to be passed on to other experts as not clearly within the remit of a sub-panel. Inevitably, the history of these decisions becomes constitutive of how research is valued, selected and prioritised by HEIs in preparation for submission: the boundaries determined by the assessment machinery through mechanisms of definition and classification are ultimately interpreted, internalised and policed through selection decisions made at research unit level. I have titled this problem the ‘demarcation problem’ as a nod towards the long-standing debates in the philosophy of science about the grounds for distinguishing between science and pseudoscience; but my interest here is to highlight the sociological rather than epistemological implications of classificatory practices concerning epistemic objects.

Examples of such boundary work occurring early in any given REF cycle include the funding councils’ initial decisions about what Units of Assessment are to be evaluated. This initial classificatory work often engages academic voices through a consultation process about what boundaries had or had not worked in a previous round; see, for example the responses to bringing together Geography, Environmental Studies and Archaeology in a single REF 2014 sub-panel, and their separation again in 2021. Further, individual ‘areas of expertise’ are taken into account in the appointment of panel and sub-panel members; a pre-definitional process that no longer directly engages academic voices (except for the appointment of the panel chair), as HEIs are not able to contribute directly to the nomination process and instead this tasks falls to learned societies, professional bodies and other agents. The initial selection of these areas of expertise pre-dates the sub-panel’s work on scoping the field, and has gained more weight in the preparation for REF 2021, as only a small sub-set of the final sub-panel members will be contributing to defining the scope, criteria and ways of working of the sub-panels.

Sub-panels’ work on defining the scope of the unit of assessment is probably the clearest example of definitional boundary work in the REF, as the succession of RAEs/REFs has produced definitions of fields and sub-fields of research enshrined in official guidance documents. Most definitions include lists of sub-fields and approaches that are ‘included’ in a particular sub-panel’s remit (REF, 2012 ). The operation of the actual evaluation, in particular the decisions about allocation of outputs, cross-referrals between panels, the moderation and calibration of assessment, the addition of further sub-panel members or assessors in the assessment phase only, and, in the forthcoming REF 2021, the output ‘flagging’ mechanism (‘interdisciplinary identifier’) and the input from interdisciplinary panel and sub-panel advisors (REF, 2018a , 2018b ), pulls this boundary work in different directions, through competing pressures both to rigidify and to loosen disciplinary boundaries. Although interdisciplinary or multidisciplinary research have always been eligible for submission, there is some evidence that a broadly interdisciplinary submission to a REF 2014 sub-panel that had already reached a consensus view on what its field of assessment encompassed may have been seen as high-risk by strategic institutional leaders (Technopolis/ SPRU, 2016 ; BEIS, 2016a ). The outcomes of the more detailed procedures for the assessment of interdisciplinary research in REF 2021 remain to be seen.

The use of metrics and indicators to inform peer review in some panels is another space for boundary work prior to and during the evaluation. As indicated by the bibliometrics pilot prior to the REF 2014, citation indicators are not seen as technically suitable unless they are field-normalised and also contextualised in relation to protected characteristics. This raises the question of what counts as a ‘field’ of research. The definition of bibliometric fields in the REF is usually constrained by technical decisions already made by database providers in creating the data infrastructure that makes citation indicators possible in the first place—pre-defined subject categories, research areas and/or research fields are used by commercial databases of research publications to classify journals and papers (particularly in the case of papers published in journals deemed multidisciplinary). Often these fields correspond to disciplines and sub-disciplines, perhaps in line with subject classifications in other library and information data systems, other times they cluster information about the citation network of a paper. ‘Multidisciplinary’ research may also form a category in its own right (for example, to be used for generalist journals that cover a range of sciences), but in the bibliometrics pilot for the 2014 REF papers published in such journals that had not already been reclassified by the database provider were reassigned to ‘more appropriate categories’ (p. 13). Such reassignment fits the above definition of boundary work.

At HEI level, preparation for REF submission also involves boundary work. Within HEIs, the allocation of staff to units of assessment entails boundary negotiations relative to both the definitions of units of assessment and eligibility criteria in the REF guidance, and internal clusterings of areas of research and teaching subjects. Moreover, in some disciplines the expectation to submit impact case studies has also generated further boundary work. For example, the impact case study form separates research from its impact, which may pose particular challenges in art-related fields where research and impact may be organically embedded in creative practice or practice-as-research (see examples in Oancea et al., 2018 ; Adams and McDougall, 2015 ).

REF-related boundary work continues after the end of the evaluation phase, too. As a final example, the funding formulae underpinning the application of resources post-REF use multipliers which are based on the allocation of disciplines to three different cost bands, with clinical and laboratory subjects, including psychology, classified in the top cost band and most social sciences and humanities in the lowest one. It is unclear whether the funding formulae reward the participation of humanities and social science units in interdisciplinary work across subjects with different cost weightings.

The outcomes of the boundary work illustrated by these examples become constitutive of everyday understandings and strategic priorities in research units across the sector. Paradoxically, as the mechanics of the exercise depend on mechanisms of differentiation such as disciplinary definitions and classifications, they can also lead to a false perception of comparability across units of assessment, with post-REF calculations of aggregate scores and meta-descriptors being used for marketing or for internal allocation of resources in apparently discipline-neutral ways.

The legitimation problem and the rise of impact as a component of research value

The institutional legitimacy of the REF—the extent to which it is accepted as ‘authoritative, binding or valid’ (Gellner, 1974 , p. 24) in underpinning decisions—depends on both the scientific and the political legitimacy of (partly or fully) publicly-resourced research; hence the necessary reliance on peer review in the conduct of the exercise, and on political processes in effecting its outcomes. Under the cumulative discursive weight of successive assessment exercises, research itself, as it is understood in the public space, has been reframed and re-defined, from a focus on research ‘understood as original investigation undertaken in order to gain knowledge and understanding’ in both RAE 2001 and 2008 (RAE, 1999 , and RAE, 2005 ), to its being ‘defined as ‘a process of investigation leading to new insights effectively shared” (REF 2011a , 2011b , REF 2012 , REF 2017 ). Arguably, this definitional change opened a place for knowledge sharing, exchange and impact right at the heart of policy understandings of the nature and value of research activity.

The introduction of impact in the assessment framework may thus be seen as a mechanism for indirect legitimation of the regulatory framework itself, through re-arranging the discursive construction of research excellence in ways that are rooted in both scientific and political epistemic communities (see Filippakou, 2017 ). This shift reflects wider and longer-term policy discourses about the relationships between higher education and industry, the connections between academic and non-academic contexts, the relevance of research to users and the wider role of research in the so-called knowledge and innovation society/economy (as evidenced by a long succession of white papers, reports and other policy documents—see Lockett et al., 2015 )—with added strength drawn from discourses rooted in professional cultures about evidence-based or evidence-informed practice (and more recently, policy) in professions such as medicine or education (see Nunan et al., 2017 ). Impact had also been for a sometime a priority along both arms of the UK ‘dual support’ system for research—however challenged the principles behind the system itself might be in the face of ongoing structural, political and financial pressures to align with each other. The Royal Charters of the Research Councils and their strategic frameworks, as they stood until the 2017 HE bill, already drew direct links between good research and social, cultural, health, economic and environmental impacts. The Councils were interested in impact largely prospectively, in terms of plans and potential benefits, but also retrospectively, with ever closer scrutiny and reporting of impact post-award and after the end of award. In some ways, the REF’s falling into step with impact in 2011 amplified an agenda that was already pervasive.

For the purposes of the REF 2014, impact was defined as ‘an effect on, change or benefit to the economy, society, culture, public policy or services, health, the environment or quality of life, beyond academia’ (REF, 2011, 2012 ) and was assessed by academic and user reviewers on the basis of standard-format case studies and unit-level strategic statements, using the twin criteria of ‘reach’ (or breadth) and ‘significance’ (or depth) of impact. The definition (plus some further explication) and the criteria have remained the same for REF 2021. At 20% of a unit’s final ‘quality profile’ in the REF 2014 (25% in REF 2021), impact has become a weighty element of the financial and reputational hierarchies at stake. Its introduction as one of the three domains for the assessment of research in the UK Research Excellence Framework in 2014 has had mixed, but lively, responses (see Chubb and Reed, 2018 , for a review of the different positions and Collini, 2012 , for a critique).

It may be argued that the addition of impact to system-wide research evaluation is a further form of boundary work, potentially leading to epistemological problems. After all, an impactful department will be valued more highly and receive higher reward than another department with exactly the same ratings as the first one for outputs and environment (Battaly, 2013 ). According to Battaly ( 2013 ), this situation may indicate ‘epistemic insensibility’, in the sense of implicitly signalling that research with less evident or direct impact may be less valuable to institutions in constructing their narrative of success - and by extension, letting percolate a sense of its being of lower epistemic value, and decreasing the propensity to engage in the appropriate epistemic practices associated with it. While the above is only one of the several strands of criticisms associated with the assessment of research impact, it stands in telling contrast with views of research assessment as epistemically neutral—in other words, as being concerned with the pragmatic allocation and justification of resources only, rather than with apportioning epistemic value.

Beyond the practicalities of the assessment exercise, the emphasis on impact can be seen as both a driver and an outcome of public renegotiation of the values that underpin the case for public investment in research. In recent years, as professing mistrust in expertise, truth, facts or academic rigour became more politically fashionable, impact has grown in discursive importance, although particularly in instrumental guises that may fit a range of normative frames and may be at odds with the wider aims of impact assessment, including those stated in the context of the REF (REF, 2012 and REF, 20018a , 2018b ; see McCowan, 2018 , for an extension of this argument). Hence the increasing political emphasis on research impact as a component of research value has come in tandem with academic critiques of the instrumentalisation and monetisation of research.

The agency problem and organisational recalibration

Research assessment, be it for performance-based funding such as the REF or in the light of requirements of key research funders, such as the Research Councils, government departments, or the European Commission, has been one of the drivers of the rise of research and knowledge exchange as part of the institutional mission of HEIs in the UK over the past few decades. Research on the impacts of the RAE/REF suggests that the exercises may have contributed to strengthening research cultures and the volume and quality of research and research communication in many institutions, but that they may have also affected the nexus between teaching and research, in particular in undergraduate provision, and may have increased the likelihood of a more pressurised and unequitable climate in a range of institutions (see e.g., Oancea, 2014 ).

The high-stakes status (reputationally and financially) of the REF outcomes has had major implications for the everyday work of HEIs, amounting to wide-ranging organisational recalibration across the sector. HEIs have flexed, stretched or contracted to accommodate the ever-evolving definitions of performance. Some of these changes have affected directly the capacity for research in institutions, for example through recruitment drives, changes to the contractual arrangements of staff (leading in some cases to defined separation between the workloads of teaching only and research active staff), or through the inclusion of outputs and, now, of impact among the criteria for the recruitment and promotion of staff, particularly to senior positions. As evidenced by the submissions to the REF (see e.g., Mills, Oancea and Robson, 2017 ), the workload models in many institutions have been adjusted to make space for impact activity—including ‘pathways’ to impact such as managing relationships of partnership, knowledge exchange, dissemination, or public engagement with research activities. New senior academic responsibilities have emerged: Impact Champions, Directors and Deans for Impact, Knowledge Exchange Leads, together with further appointments of Professors of Public Understanding of Science (and cognate titles), and so on.

The introduction of impact in the REF 2014 shaped strategic decisions in HEIs to invest differentially in areas of research, to restructure the organisational basis for the provision, validation and sharing of research, and may have contributed to re-directing parts of research activity towards shorter horizons of contribution to political priorities and societal challenges (as documented in Oancea, 2013 ). It also influenced decisions about the size and shape of the REF submissions themselves. Within the logic of the assessment exercise, in the lead up to 2014 the need to submit around one impact case study per ten FTE ‘research active’ staff in a unit prompted a lot of effort to generate, corroborate and write viable case studies, but also tactical decisions among units to have more or less inclusive submissions. For example, Kerridge ( 2015 ) notes a ‘spike’ in submissions just under each of the FTE thresholds beyond which a further impact case study would have been required.

The strategic leadership, management and governance of research in universities have also been recalibrated. The environment and impact statements submitted to the REF 2014 (see e.g., those analysed in Mills, Oancea and Robson, 2017 ) show that institutional managers think strategically about and monitor and scrutinise closely the research activity in their unit/s. Research strategies encompass, for example, incentive structures for research engagement and productivity at different stages of career; specific steering towards unit-level (rather than individually determined) substantive and methodological foci; collective output and publication plans; as well as tactics for attracting external research income. The more recent addition, in view of the REF 2021, of open access requirements has generated an unprecedented level of monitoring of publication cycles, which has been embedded in institutions with much more ease than many other changes, possibly due to the fact that it taps into shared values of fairness, freedom and visibility of research knowledge.

The outcomes of the exercise, actual or anticipated, have also lead to recalibration. Reputational outcomes may open or close possibilities for organisational growth, partnerships, or student recruitment; while financial outcomes may sustain or damage the vitality of established research environments and research capacity (for example, in universities with a history of significant quality-related funding), but they may also be the impetus for (or dampener of) emergent growth.

Overall, both the process and the outcomes of performance-based research funding have tensioned the organisational ethos of institutions, as they internalised their localised interpretations of the funders’ requirements. Many institutions have made difficult choices in the light of these interpretations, for example between distributed (but fragmented and slow) and hierarchical (but instrumentally efficient) governance structures; between potentially divisive (but sharp) or more cohesive (but generic) strategic priorities and mission statements; and between transparent (but endlessly redressed) or opaque (but contentious) mechanisms for the management and administration of research and of research funding. This way, what is being measured and monitored and what matters to researchers and their communities have subtly morphed into each other. Both principled resistance and pragmatic compliance may be ultimately co-opted by the strategic institutionalisation of research assessment in organisations faced with the rigours of performance-based allocations of funding and recognition. It may seem easy to dismiss these decisions as defensive routines within organisations. However, as responsibility for these transformations is regularly passed backwards and forwards along structural and political lines (for example, between government agencies, funding bodies, and different layers of institutional management), the problem remains, of who ultimately owns this agenda and who has the agency to introduce or reverse change in organisational practices and trans-organisational networks.

The identity problem and the growth in blended professional practice

In the UK, a clear area of specialisation has formed around the RAE/REF, with many UK higher education institutions, as well as the bodies tasked with allocating their core funding, appointing senior REF directors, project managers or administrators, and other dedicated staff for the different aspects of academic performance recognised in the REF. For example, in response to the addition of impact assessment to the REF in the 2010 pilot and 2011 guidance, most institutions have added RE impact-related tasks to existing roles, including those of directors, deans and other senior research management staff, and have created impact task forces, project boards, and delivery and oversight groups. They have created new roles or reframed existing ones, such as impact officers, KE officers, professional impact writers, case study copy-editors, and public engagement managers (see Manville et al., 2015 ). They have also employed a large number of casual workers (many of whom are postgraduate students) to collect, input and clean data on impact and on different metrics. While a large proportion of such posts created prior to REF 2014 were temporary and there has been vast restructuring and mobility in these areas since, many were not and have since become established parts of organisational structures, with many impact and assessment professionals appointed during the previous cycle now line-managing new colleagues or entire units. Even in institutions where the pre-2014 appointments had been fixed-term, the model remained inscribed in their REF planning documents, and in many cases it is being revived as preparations get underway for the next exercise.

As a way of tooling the new impact- and research monitoring-related practices, some institutions have bought into the thriving market of commercial packages for monitoring and recording research and impact activities (or have created their own packages), and have invested in the training and allocation of staff time necessary to operate them. Further investments into the growing ‘para-academic’ (Macfarlane, 2011 ) industry associated with performance-based assessment include the buying in of experience in the form of expert advisors and external reviewers for the running of ‘mock’ REF exercises and the decoding of REF guidelines.

In this context, impact-related staff—with the exception of many of the precarious workers drafted into supporting the basic rungs of running the exercise—have strengthened their professional identity and sense of community over the recent years, perhaps echoing the way in which research management became a fully recognised area of professional HE practice in the past few decades, supported by the stronger voice of professional organisations such as ARMA (Association of Research Managers and Administrators, incorporated in 2006 and the predecessor of which was created in 1991).

In addition, the ongoing arrangements for performance based research funding have stimulated the increasing professionalisation of other specialist ‘blended’ (Whitchurch, 2009 ) or ‘third space’ (Whitchurch, 2012 ) practice and practitioners, such as industry partnership, entrepreneurship, or commercialisation managers. These ‘dedicated appointments spanning professional and academic domains’ (Whitchurch, 2009 , p. 408) have developed widely in different sectors (including public, commercial and third sector research), across different aspects of institutional missions (beyond research), and in different international contexts. ‘Braided’ careers that alternate between, or otherwise combine, work in academia and other sectors are becoming more common. Secondments to and from other sectors, and various visiting positions, internships and practitioner or industry fellowships spanning the boundaries between HEIs and other types of organisation, are also used to facilitate the ‘brokering’ of new research networks and quasi-formal relationships with the potential to generate collaborative research and impact.

Arguably, the growth of such relationships contributes to ‘unbundling’ (Macfarlane, 2011 , Locke, 2014 ) current constructions of academic identities and careers, and to introducing further (and welcome) differentiation. At the same time, it may lead to miscommunication and territorialism, as the spaces occupied by research professionals and professional researchers get renegotiated; or to new forms of inequality, as the gaps between specialist career tracks and precarious academic work may widen.

Coupled with wider ‘soft’ practices in the governance of research, including co-option and hospitability, assessment technologies like those discussed in this paper are Janus-like, operating in a transient equilibrium that is highly sensitive to changes in the domestic and international research economy. I argued before that technologies like the REF are ‘multiply ambivalent’: they place individual, institutional and trans-institutional forms of participation, responses, and consequences in a versatile balance of overlapping tensions (Oancea, 2014 ). These ambivalences are not only down to the pragmatic details of how the REF is practiced in everyday institutional contexts, but are also traceable to persistent sociological and philosophical problems and systemic structural issues that underpin high-stakes, large-scale assessments of research performance. In this paper, I highlighted seven such problems, and I connected them with a consideration of the directions of travel in research assessment that I detected in my empirical research on research. Collective learning from the experience of several decades of formalised sector-wide research assessment in the UK, particularly through full consideration of insights from relevant research on research, may help ground these debates. A large body of empirical theoretical and critical literature has already been developed around research policy and assessment; while government departments and funding councils have also commissioned a range of evaluations and reports, including the Stern review and Metric Tide report (BEIS, 2016b , Wilsdon, 2016 ). This paper has drawn reflectively on findings from past research to make a contribution to this collective learning project.

Overall, the directions of travel identified in this paper and the discursive and political shifts that enabled them have challenged established principles of funding and governance, including the dual support, and have pushed assessment technologies into a pivot position in the political dynamics of renegotiating the relationships between universities and the state. Transformative change of the research governance regime discussed in this paper and of its implications, while possible, would be a major undertaking, which could not rest on simply removing any one particular element of it, but would need to involve changing both the structural conditions that underpin it, and the cultural and normative premises that legitimise it. As far as glimpses into the future go, the UK seems to have placed its bets on performance-based resource allocation and funding-based incentivisation of organisational and individual behaviour; complex and formal accountability systems; and an emphasis on extra-academic definition of research agendas and valuation of their outcomes. In the light of current geo-political changes and regional power re-configurations, these mechanisms are seen as key means for sustaining the capacity for, and quality of, research in the UK. To achieve this goal, however, balanced funding policies and a diverse portfolio of funding opportunities would need to be coupled with a determined stance on enabling healthy governance in the research and higher education system. The pre-conditions for such governance include intellectual freedom in research; structural conditions for insightful, dialogical, equitable and responsible decision-making; support and recognition for a truly diverse and critical academic agora; and commitment to the public funding of diverse modes of higher education research (including research that is critical, theoretical and conceptual, expressive or interpretive, and goes beyond short-term political agendas).

At the same time, a swell of generative energies from across all strata of the research communities is now pushing for active and more radical re-imagining of the organisation of research and research assessment, of its structures and mechanisms, and of its norms and values. Arguments are bubbling up for re-balancing intrinsic and extrinsic interpretations of value, for recognising fully and supporting structurally the epistemic value of diversity, and a richer sense of equity, and for nurturing the symbiotic relationship between freedom and responsibility. These are not escapist, nor ‘alternative’, voices to be othered or dismissed, but principled movements towards re-claiming the moral and intellectual strengths of academic research. A strong research-on-research base, genuine dialogue and courageous leadership would be necessary in order to re-imagine research assessment as a formative, communicative, epistemically sound and morally defensible enterprise.

Data availability

Data sharing is not applicable to this paper as no new datasets were generated or analysed.

Parts of this section are adapted, with permission, from a piece published originally in Research Fortnight (Oancea, 2015 ).

Adams J, McDougall J (2015) Revisiting the evidence: practice submissions to the REF. J Media Pract 16(2):97–107. https://doi.org/10.1080/14682753.2015.1041803

Article   Google Scholar  

Allan Hanson F (2000) How tests create what they are intended to measure. In Filler A(ed) Assessment: social practice and social product. Routledge Falmer, London and New York

Google Scholar  

Andras P (2011) Research: metrics, quality, and management implications. Res Eval 20(2):90–106

Attwood R (2010, September 9) ‘Frankenstein’ assessment is out of control. Times Higher Education

Baird JA, Elliott V (2018) Metrics in education—control and corruption. Oxford Review of Education 44(5):533–544. https://doi.org/10.1080/03054985.2018.1504858

Battaly H (2013) Detecting epistemic vice in higher education policy: epistemic insensibility in the Seven Solutions and the REF. J Philos Educ 47(2):263–280

Baudrillard J (1994) Simulacra and simulations. The University of Michigan Press, Ann Arbor

Bence V, Oppenheim C (2002) The evolution of the UK’s Research Assessment Exercise: publications, performance and perceptions. J Educ Adm Hist 37(2):137–155

BIS (2016) Success as a knowledge economy: teaching excellence, social mobility and student choice. White Paper, Cm 925, BIS/16/265. Department for Business, Innovation and Skills, London

BEIS (2016a) Lord Stern’s review of the Research Excellence Framework Call for evidence. https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/500114/ind-16-1-ref-review-call-for-evidence.pdf

BEIS (2016b) Research Excellence Framework (REF) review: building on success and learning from experience (the Stern review). Department for Business, Energy and Industrial Strategy, London

Bourdieu P (1988) Homo academicus. Polity Press, Cambridge

Chubb J, Reed MS (2018) The politics of research impact: academic perceptions of the implications for research funding, motivation and quality. Br Polit 1–17, https://doi.org/10.1057/s41293-018-0077-9

Collini S (2012) What are universities for? Penguin, Harmondsworth

Collins HM, Evans R (2002) The third wave of science studies: studies of expertise and experience. Social Stud Sci 32(2):235–296. https://doi.org/10.1177/0306312702032002003

Coryn CLS (2007) Evaluation of researchers and their research: Toward making the implicit explicit. Doctoral dissertation, Western Michigan University, Kalamazoo

Cronin B (2000) Semiotics and evaluative bibliometric s . J Doc 56(4):440–453

De Munck J, Zimmermann B (2015) Evaluation as practical judgment. Human Stud 38:113. https://doi.org/10.1007/s10746-014-9325-1

Derrick G (2018) The evaluators’ eye: impact assessment and academic peer review. Palgrave Macmillan, London

Book   Google Scholar  

Dewey J (1939) Theory of valuation. In JA Boydston (Ed.) The Later Works of John Dewey, Vol 13. Southern Illinois University Press, Carbondale, pp. 189–251

Donovan C (2009) Gradgrinding the social sciences: the politics of metrics of political science. Political Stud Rev 7:73–83

Filippakou (2017) The evolution of the quality agenda in higher education: the politics of legitimation. J Educ Adm Hist 49(1):37–52. https://doi.org/10.1080/00220620.2017.1252738

Gellner E (1974) Legitimation of belief. London: Cambridge University Press. Pring, R. & Thomas, G. (Eds) 2004. Evidence-Based Practice in Education. Maidenhead: Open University Press

Hicks D (2012) Performance-based university research funding systems. Research Policy 41(2):251–261. https://doi.org/10.1016/j.respol.2011.09.007

Hill S (2016) Assessing (for) impact: future assessment of the societal impact of research. Pal Comm 2, https://doi.org/10.1057/palcomms.2016.73

Jonkers K, Zacharewicz T (2016) Research performance based funding systems: A comparative assessment. Brussels: European Commission, EUR 27837 EN. https://doi.org/10.2791/659483

Keane W (2003) Semiotics and the social analysis of material things. Lang Commun 23(3):409–425

Article   MathSciNet   Google Scholar  

Kelly A, Burrows R (2011) Measuring the value of sociology? Some notes on performative metricization in the contemporary academy. Sociol Rev 59(2s):130–150

Kerr C (1993) Higher education cannot escape history: issues for the twenty-first century. State University of New York Press, Albany

Kerridge S (2015, February 11) How thresholds for case studies shaped REF submissions. Research Fortnight

Locke W (2014) Shifting academic careers: implications for enhancing professionalism in teaching and supporting learning. Higher Education Academy, London

Lockett A, Wright M, Wild A (2015) The institutionalization of third stream activities in uk higher education: the role of discourse and metrics. Br J Manag 26:78–92

Lucas L (2006) The research game in academic life. Open University, Maidenhead

Macfarlane B (2011) The morphing of academic practice: unbundling and the rise of the paraacademic. High Educ Q 65(1):59–73

Manville C, Morgan Jones M, Frearson M, Castle-Clarke S, Henham ML, Gunashekar S, Grant J (2015) Preparing impact submissions for REF 2014: an evaluation: findings and observations. HEFCE, London

McCowan T (2018) Five perils of the impact agenda in higher education. Lond Rev Educ 16(2):279–295. https://doi.org/10.18546/LRE.16.2.08

Mills D, Oancea A, Robson J (2017) The Capacity and Impact of Education Research in the UK. Report to the Royal Society and British Academy Joint Enquiry on Educational Research. London: RS/BA

Miller P (2001) Governing by numbers: why calculative practices matters. Social Res 68(2):379–396

Minogue K (1986) Political science and the gross intellectual product. Gov Oppos 21:396–405

Nichols T (2017) The death of expertise. Oxford University Press, Oxford

Nunan D, O’Sullivan J, Heneghan C, Pluddemann A, Aronson J, Mahtani K (2017) Ten essential papers for the practice of evidence-based medicine. BMJ Evid-Based Med 22:202–204

Oancea A (2009) Standardisation and versatility in research assessment. In Besley A(ed) Assessing the quality of research in higher education. Sense, Rotterdam

Oancea A (2008) Performative accountability and the UK Research Assessment Exercise. ACCESS: Critical Perspectives on Communication, Cultural & Policy Studies, 27(1/2): 153–173

Oancea A (2010) The Impacts of RAE 2008 on Education Research in UK Higher Education Institutions. Macclesfield: UCET/BERA

Oancea A (2013) Interpretations of research impact in seven disciplines. Eur Educ Res J 12(2):242–250. https://doi.org/10.2304/eerj.2013.12.2.242

Oancea A (2014) Research assessment as governance technology in the United Kingdom: findings from a survey of RAE 2008 impacts. Z Fur Erzieh 17:83–110. https://doi.org/10.1007/s11618-014-0575-5

Oancea A (2015) Metrics debate must be about ethics as well as techniques. Research Fortnight

Oancea A, Florez-Petour T, Atkinson J (2018) The ecologies and economy of cultural value from research. Int J Cult Policy 24(1):1–24. https://doi.org/10.1080/10286632.2015.1128418

Oancea A (2016) Challenging the grudging consensus behind the REF. Times Higher Education, 25 March

O’Neill O (2013) Intelligent accountability in education. Oxf Rev Educ 39(1):4–16

Phillimore AJ (1989) University research performance indicators in practice: The University Grants Committee’s evaluation of British universities, 1985-86. Res Policy 18:255–271

Pirrie A, Adamson K, Humes W (2010) Flexing academic identities: speaking truth to power. Power Educ 2(1):97–106

Power M (1997) The audit society: rituals of verification. Oxford University Press, Oxford

Sidhu R (2008) Risky custodians of trust: Instruments of quality in higher education. Int Educ J 9(1):59–71

Slaughter S, Leslie L (1997) Academic capitalism: politics, policies and the entrepreneurial university. The John Hopkins University Press, Baltimore

Strathern M (2000) The tyranny of transparency. Br Educ Res J 26(3):309–321

Sugimoto CB, Larivière V (2018) Measuring Research. What everyone needs to know. Oxford University Press, Oxford

Technopolis/SPRU (Science Policy Research Unit, University of Sussex) (2016) Landscape Review of Interdisciplinary Research in the UK. Report to HEFCE and RCUK. London: HEFCE

RAE (1999) Guidance on Submissions RAE 2/99. HEFCE, London

RAE (2005) Guidance on Submissions RAE 03/2005. HEFCE, London

REF (2011a) Decisions on assessing research impact. REF 01.2011. HEFCE, London

REF (2011b) Assessment framework and guidance on submissions. HEFCE, London, REF02.2011, July

REF (2012) Panel criteria and working methods. HEFCE, London, REF01.2012, Jan

REF (2017) REF 2021 Decisions on staff and outputs. HEFCE, London, November

REF (2018a) Draft guidance on submissions (2018/01). Research England, London, 23 July

REF (2018b) Consultation on the panel criteria and working methods (2018/02). Research England, London, 23 July

Rijcke S, de, Wouters PF, Rushforth AD, Franssen TP, Hammarfelt B (2016) Evaluation practices and effects of indicator use—a literature review. Res Eval 25(2):161–169

Watermeyer R (2019) Competitive accountability in academic life: the struggle for social impact and public legitimacy. Edward Elgar, Cheltenham, forthcoming

Whitchurch C (2009) The rise of the blended professional in higher education: a comparison between the UK, Australia and the United States. High Educ 58(3):407–418

Whitchurch C (2012) Reconstructing Identities in HE: the Rise of the ‘Third Space’ professionals. Routledge, London

Wilsdon J (2016) The metric tide: independent review of the role of metrics in research assessment and management. SAGE, London, (chair)

Wouters P (2016) Semiotics and citations. In Sugimoto CR(ed) Theories of Informetrics and Scholarly Communication. A Festschrift in honour of Blaise Cronin. De Gruyter, Inc, Berlin, p 72–92

Download references

Acknowledgements

The studies that provided the empirical background for this paper were funded by several grants from the Arts and Humanities Research Council; HEIF; British Educational Research Association; and the University of Oxford. Some parts of the text are adapted with permission from Oancea ( 2015 ) and Oancea ( 2016 ).

Author information

Authors and affiliations.

University of Oxford, Oxford, UK

Alis Oancea

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Alis Oancea .

Ethics declarations

Competing interests.

After the acceptance of this paper, the author became a member of the Research England Advisory Group on the ‘Future of Research Assessment' (2019). The author is REF2021 coordinator for Unit of Assessment 23 at the University of Oxford.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Oancea, A. Research governance and the future(s) of research assessment. Palgrave Commun 5 , 27 (2019). https://doi.org/10.1057/s41599-018-0213-6

Download citation

Received : 29 March 2018

Accepted : 13 December 2018

Published : 05 March 2019

DOI : https://doi.org/10.1057/s41599-018-0213-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Well-being and the internationalisation of academic life: an exploration from the periphery.

  • Viviana Ramírez
  • Leandro Rodriguez-Medina

Higher Education (2024)

Faculty perceptions on managerial changes in a sub-degree institution in Hong Kong

  • Yui-yip Lau
  • Lok Ming Eric Cheung

Asia Pacific Education Review (2023)

Territorial bias in university rankings: a complex network approach

  • Loredana Bellantuono
  • Alfonso Monaco
  • Roberto Bellotti

Scientific Reports (2022)

The model of maximum productivity for research universities SciVal author ranks, productivity, university rankings, and their implications

  • Marton Demeter
  • Zsolt Balázs Major

Scientometrics (2022)

Higher Quantity, Higher Quality? Current Publication Trends of the Most Productive Journal Authors on the Field of Communication Studies

  • Márton Demeter
  • Veronika Pelle
  • Manuel Goyanes

Publishing Research Quarterly (2022)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

research paper on assessment

Psychological Assessment

  • Reference work entry
  • First Online: 01 January 2022
  • pp 4023–4030
  • Cite this reference work entry

research paper on assessment

  • Sofia von Humboldt 3 ,
  • Joana Rolo 4 &
  • Isabel Leal 3  

259 Accesses

Psychodiagnostic assessment ; Psychological battery ; Psychological evaluation ; Psychological testing

Psychological assessment is a testing method that uses a number of techniques to find hypotheses about individuals and their behavior, abilities, and personality (Framingham 2016 ). Psychological testing or psychological assessment is also referred to as conducting a battery of psychological tests on subjects. For different researchers, psychological assessment is a process, in which a psychologist aims: to achieve an accurate description of an individual’s functioning; to identify the person’s clinical needs (e.g., which interventions are more suitable); to make a differential diagnosis of mental disorders of all sorts; and to keep track of the progress made when an intervention is taking place (Meyer et al. 2001 ). A rigorous psychological assessment implies that health professionals also carry out a complete medical examination, to exclude the possibility of a...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

American Psychiatric Association (2013) Diagnostic and statistical manual of mental disorders. Author, Washington, DC

Google Scholar  

Amieva H, Mokri H, Le Goff M et al (2014) Compensatory mechanisms in higher-educated subjects with Alzheimer’s disease: a study of 20 years of cognitive decline. Brain 137:1167–1175. https://doi.org/10.1093/brain/awu035

Article   Google Scholar  

Arean PA, Alvidrez J, Barrera A et al (2002) Would older medical patients use psychological services? Gerontologist 42:392–398. https://doi.org/10.1093/geront/42.3.392

Australian Institute of Health and Welfare [AIHW] (2002) Australia’s health 2002. Australian Institute of Health and Welfare, Canberra

Australian Institute of Health and Welfare [AIHW] (2013) Australian hospital statistics 2011–12. Australian Institute of Health and Welfare, Canberra

Bartels SJ, Blow FC, Brockmann LM, Van Citters AD (2005) Substance abuse and mental health care among older Americans: the state of the knowledge and future directions. WESTAT, Rockville

Bertera EM, Bertera RL (2008) Fear of falling and activity avoidance in a national sample of older adults in the United States. Health Soc Work 33:54–62. https://doi.org/10.1093/hsw/33.1.54

Carr D, Luth EA (2017) Advance care planning: contemporary issues and future directions. Innov Aging 1. https://doi.org/10.1093/geroni/igx012

Depp CA, Loughran C, Vahia I, Molinari V (2010) Assessing psychosis in acute and chronic mentally ill older adults. In: Lichtenberg PA (ed) Handbook of assessment in clinical gerontology, 2nd edn. Elsevier, New York, pp 123–154

Edelstein BA, Woodhead EL, Segal DL et al (2007) Older adult psychological assessment: current instrument status and related considerations. Clin Gerontol 31:1–35. https://doi.org/10.1080/07317110802072108

Edelstein BA, Drozdick LW, Ciliberti CM (2010) Assessment of depression and bereavement in older adults. In: Lichtenberg PA (ed) Handbook of assessment in clinical gerontology, 2nd edn. Elsevier, New York, pp 123–154

Framingham J (2016) What is psychological assessment? Psych Cent. https://psychcentral.com/lib/what-is-psychological-assessment/ . Accessed 16 Oct 2018

Gatz M, Smyer M (2001) Mental health and aging at the outset of the twenty-first century. In: Birren JE, Schaie KW (eds) Handbook of the psychology of aging, 5th edn. Academic Press, San Diego, pp 523–544

Gatz M, Smyer MA, DiGilio DA (2016) Psychology’s contribution to the well-being of older Americans. Am Psychol 71:257–267. https://doi.org/10.1037/a0040251

Karel MJ, Gatz M, Smyer MA (2012) Aging and mental health in the decade ahead: what psychologists need to know. Am Psychol 67:184–198. https://doi.org/10.1037/a0025393

Knight BG, Pachana NA (2015) Psychological assessment and therapy with older adults. Oxford University Press, Oxford

Koh S, Blank K, Cohen CI et al (2010) Public’s view of mental health services for the elderly: responses to dear Abby. Psychiatr Serv 61:1146–1149. https://doi.org/10.1176/ps.2010.61.11.1146

Kollack-Walker S, Liu CY, Fleisher AS (2017) The role of neuroimaging in the assessment of the cognitively impaired elderly. Neurol Clin 35:231–262. https://doi.org/10.1016/j.ncl.2017.01.010

La Rue A, Watson J (1998) Psychological assessment of older adults. Prof Psychol Res Pract 29:5–14. https://doi.org/10.1037/0735-7028.29.1.5

Landry GJ, Best JR, Liu-Ambrose T (2015) Measuring sleep quality in older adults: a comparison using subjective and objective methods. Front Aging Neurosci 7:1–10. https://doi.org/10.3389/fnagi.2015.00166

Lawton EM, Shields AJ, Oltmanns TF (2011) Five-factor model personality disorder prototypes in a community sample: self- and informant-reports predicting interview-based dsm diagnoses. Personal Disord Theory Res Treat 2:279–292. https://doi.org/10.1037/a0022617

Maruish ME (2013) Outcomes assessment in health settings. In: Geisinger KF (ed) APA handbook of testing and assessment in psychology, Testing and assessment in clinical and counseling psychology, vol 2. American Psychological Association, Washington, DC, pp 303–324

Mattar S, Khan F (2017) Personality disorders in older adults: diagnosis and management. Prog Neurol Psychiatry 21:22–27. https://doi.org/10.1002/pnp.467

McVay JC, Kane MJ, Kwapil TR (2009) Tracking the train of thought from the laboratory into everyday life: an experience-sampling study of mind wandering across controlled and ecological contexts. Psychon Bull Rev 16:857–863. https://doi.org/10.3758/PBR.16.5.857

Meeks S, Van Haitsma K, Schoenbachler B, Looney SW (2015) BE-ACTIV for depression in nursing homes: primary outcomes of a randomized clinical trial. J Gerontol Ser B 70:13–23. https://doi.org/10.1093/geronb/gbu026

Meyer GJ, Finn SE, Eyde LD et al (2001) Psychological testing and psychological assessment. A review of evidence and issues. Am Psychol 56:128–165. https://doi.org/10.1037//OOO3-O66X.56.2.128

Moore RC, Depp CA, Wetherell JL, Lenze EJ (2016) Ecological momentary assessment versus standard assessment instruments for measuring mindfulness, depressed mood, and anxiety among older adults. J Psychiatr Res 75:116–123. https://doi.org/10.1016/j.jpsychires.2016.01.011

Neikrug AB, Ancoli-Israel S (2010) Sleep disorders in the older adult – a mini-review. Gerontology 56:181–189. https://doi.org/10.1159/000236900

Pryor R (2012) Contemporary issues in the use of psychological assessment for recruitment and selection. InPsych 34:10–13

Qualls SH, Segal DL, Norman S et al (2002) Psychologists in practice with older adults: current patterns, sources of training, and need for continuing education. Prof Psychol Res Pract 33:435–442. https://doi.org/10.1037/0735-7028.33.5.435

Ramsey AT, Wetherell JL, Depp C et al (2016) Feasibility and acceptability of smartphone assessment in older adults with cognitive and emotional difficulties. J Technol Hum Serv 34:209–223. https://doi.org/10.1080/15228835.2016.1170649

Rossi G, Van Den Broeck J, Dierckx E et al (2014) Personality assessment among older adults: the value of personality questionnaires unraveled. Aging Ment Health 18:936–940. https://doi.org/10.1080/13607863.2014.924089

Rossi G, Videler A, van Alphen SPJ (2018) Challenges and developments in the assessment of (mal)adaptive personality and pathological states in older adults. Assessment 25:279–284. https://doi.org/10.1177/1073191116685810

Skelton F, Kunik ME, Regev T, Naik AD (2010) Determining if an older adult can make and execute decisions to live safely at home: a capacity assessment and intervention model. Arch Gerontol Geriatr 50:300–305. https://doi.org/10.1016/j.archger.2009.04.016

Spangenberg L, Forkmann T, Brahler E, Glaesmer H (2011) The association of depression and multimorbidity in the elderly: implications for the assessment of depression. Psychogeriatrics 11:227–234. https://doi.org/10.1111/j.1479-8301.2011.00375.x

Vacha-Haase T (2013) Psychological assessment with older adults. In: APA handbook of testing and assessment in psychology, vol. 2: testing and assessment in clinical and counseling psychology. American Psychological Association, Washington, DC, pp 555–568

Wahl H, Schnabel E (2019) Geropsychology. In: Gu D, Dupre DM (eds) Encyclopedia of gerontology and population aging. Springer, Singapore

Willis TA, Yearall SM, Gregory AM (2011) Self-reported sleep quality and cognitive style in older adults. Cogn Ther Res 35:1–10. https://doi.org/10.1007/s10608-009-9270-x

Wolitzky-Taylor KB, Castriotta N, Lenze EJ et al (2010) Anxiety disorders in older adults: a comprehensive review. Depress Anxiety 27:190–211. https://doi.org/10.1002/ijc.28507

World Health Organization (2017) Mental health of older adults. http://www.who.int/mediacentre/factsheets/fs381/en/ . Accessed 19 Jun 2018

Download references

Author information

Authors and affiliations.

William James Center for Research, ISPA – Instituto Universitário, Lisbon, Portugal

Sofia von Humboldt & Isabel Leal

ISPA – Instituto Universitário, Lisbon, Portugal

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Sofia von Humboldt .

Editor information

Editors and affiliations.

Population Division, Department of Economics and Social Affairs, United Nations, New York, NY, USA

Department of Population Health Sciences, Department of Sociology, Duke University, Durham, NC, USA

Matthew E. Dupre

Section Editor information

ISPA-Instituto Universitário, William James Center for Research, Lisbon, Portugal

Sofia von Humboldt Ph.D.

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this entry

Cite this entry.

von Humboldt, S., Rolo, J., Leal, I. (2021). Psychological Assessment. In: Gu, D., Dupre, M.E. (eds) Encyclopedia of Gerontology and Population Aging. Springer, Cham. https://doi.org/10.1007/978-3-030-22009-9_84

Download citation

DOI : https://doi.org/10.1007/978-3-030-22009-9_84

Published : 24 May 2022

Publisher Name : Springer, Cham

Print ISBN : 978-3-030-22008-2

Online ISBN : 978-3-030-22009-9

eBook Packages : Social Sciences Reference Module Humanities and Social Sciences Reference Module Business, Economics and Social Sciences

Share this entry

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research
  • Search Menu
  • Advance articles
  • Author Guidelines
  • Submission Site
  • Open Access
  • Why Publish?
  • About Research Evaluation
  • Editorial Board
  • Advertising and Corporate Services
  • Journals Career Network
  • Self-Archiving Policy
  • Dispatch Dates
  • Journals on Oxford Academic
  • Books on Oxford Academic

Issue Cover

Article Contents

1. introduction, what is meant by impact, 2. why evaluate research impact, 3. evaluating research impact, 4. impact and the ref, 5. the challenges of impact evaluation, 6. developing systems and taxonomies for capturing impact, 7. indicators, evidence, and impact within systems, 8. conclusions and recommendations.

  • < Previous

Assessment, evaluations, and definitions of research impact: A review

  • Article contents
  • Figures & tables
  • Supplementary Data

Teresa Penfield, Matthew J. Baker, Rosa Scoble, Michael C. Wykes, Assessment, evaluations, and definitions of research impact: A review, Research Evaluation , Volume 23, Issue 1, January 2014, Pages 21–32, https://doi.org/10.1093/reseval/rvt021

  • Permissions Icon Permissions

This article aims to explore what is understood by the term ‘research impact’ and to provide a comprehensive assimilation of available literature and information, drawing on global experiences to understand the potential for methods and frameworks of impact assessment being implemented for UK impact assessment. We take a more focused look at the impact component of the UK Research Excellence Framework taking place in 2014 and some of the challenges to evaluating impact and the role that systems might play in the future for capturing the links between research and impact and the requirements we have for these systems.

When considering the impact that is generated as a result of research, a number of authors and government recommendations have advised that a clear definition of impact is required ( Duryea, Hochman, and Parfitt 2007 ; Grant et al. 2009 ; Russell Group 2009 ). From the outset, we note that the understanding of the term impact differs between users and audiences. There is a distinction between ‘academic impact’ understood as the intellectual contribution to one’s field of study within academia and ‘external socio-economic impact’ beyond academia. In the UK, evaluation of academic and broader socio-economic impact takes place separately. ‘Impact’ has become the term of choice in the UK for research influence beyond academia. This distinction is not so clear in impact assessments outside of the UK, where academic outputs and socio-economic impacts are often viewed as one, to give an overall assessment of value and change created through research.

an effect on, change or benefit to the economy, society, culture, public policy or services, health, the environment or quality of life, beyond academia

Impact is assessed alongside research outputs and environment to provide an evaluation of research taking place within an institution. As such research outputs, for example, knowledge generated and publications, can be translated into outcomes, for example, new products and services, and impacts or added value ( Duryea et al. 2007 ). Although some might find the distinction somewhat marginal or even confusing, this differentiation between outputs, outcomes, and impacts is important, and has been highlighted, not only for the impacts derived from university research ( Kelly and McNicol 2011 ) but also for work done in the charitable sector ( Ebrahim and Rangan, 2010 ; Berg and Månsson 2011 ; Kelly and McNicoll 2011 ). The Social Return on Investment (SROI) guide ( The SROI Network 2012 ) suggests that ‘The language varies “impact”, “returns”, “benefits”, “value” but the questions around what sort of difference and how much of a difference we are making are the same’. It is perhaps assumed here that a positive or beneficial effect will be considered as an impact but what about changes that are perceived to be negative? Wooding et al. (2007) adapted the terminology of the Payback Framework, developed for the health and biomedical sciences from ‘benefit’ to ‘impact’ when modifying the framework for the social sciences, arguing that the positive or negative nature of a change was subjective and can also change with time, as has commonly been highlighted with the drug thalidomide, which was introduced in the 1950s to help with, among other things, morning sickness but due to teratogenic effects, which resulted in birth defects, was withdrawn in the early 1960s. Thalidomide has since been found to have beneficial effects in the treatment of certain types of cancer. Clearly the impact of thalidomide would have been viewed very differently in the 1950s compared with the 1960s or today.

In viewing impact evaluations it is important to consider not only who has evaluated the work but the purpose of the evaluation to determine the limits and relevance of an assessment exercise. In this article, we draw on a broad range of examples with a focus on methods of evaluation for research impact within Higher Education Institutions (HEIs). As part of this review, we aim to explore the following questions:

What are the reasons behind trying to understand and evaluate research impact?

What are the methodologies and frameworks that have been employed globally to assess research impact and how do these compare?

What are the challenges associated with understanding and evaluating research impact?

What indicators, evidence, and impacts need to be captured within developing systems

What are the reasons behind trying to understand and evaluate research impact? Throughout history, the activities of a university have been to provide both education and research, but the fundamental purpose of a university was perhaps described in the writings of mathematician and philosopher Alfred North Whitehead (1929) .

‘The justification for a university is that it preserves the connection between knowledge and the zest of life, by uniting the young and the old in the imaginative consideration of learning. The university imparts information, but it imparts it imaginatively. At least, this is the function which it should perform for society. A university which fails in this respect has no reason for existence. This atmosphere of excitement, arising from imaginative consideration transforms knowledge.’

In undertaking excellent research, we anticipate that great things will come and as such one of the fundamental reasons for undertaking research is that we will generate and transform knowledge that will benefit society as a whole.

One might consider that by funding excellent research, impacts (including those that are unforeseen) will follow, and traditionally, assessment of university research focused on academic quality and productivity. Aspects of impact, such as value of Intellectual Property, are currently recorded by universities in the UK through their Higher Education Business and Community Interaction Survey return to Higher Education Statistics Agency; however, as with other public and charitable sector organizations, showcasing impact is an important part of attracting and retaining donors and support ( Kelly and McNicoll 2011 ).

The reasoning behind the move towards assessing research impact is undoubtedly complex, involving both political and socio-economic factors, but, nevertheless, we can differentiate between four primary purposes.

HEIs overview. To enable research organizations including HEIs to monitor and manage their performance and understand and disseminate the contribution that they are making to local, national, and international communities.

Accountability. To demonstrate to government, stakeholders, and the wider public the value of research. There has been a drive from the UK government through Higher Education Funding Council for England (HEFCE) and the Research Councils ( HM Treasury 2004 ) to account for the spending of public money by demonstrating the value of research to tax payers, voters, and the public in terms of socio-economic benefits ( European Science Foundation 2009 ), in effect, justifying this expenditure ( Davies Nutley, and Walter 2005 ; Hanney and González-Block 2011 ).

Inform funding. To understand the socio-economic value of research and subsequently inform funding decisions. By evaluating the contribution that research makes to society and the economy, future funding can be allocated where it is perceived to bring about the desired impact. As Donovan (2011) comments, ‘Impact is a strong weapon for making an evidence based case to governments for enhanced research support’.

Understand. To understand the method and routes by which research leads to impacts to maximize on the findings that come out of research and develop better ways of delivering impact.

The growing trend for accountability within the university system is not limited to research and is mirrored in assessments of teaching quality, which now feed into evaluation of universities to ensure fee-paying students’ satisfaction. In demonstrating research impact, we can provide accountability upwards to funders and downwards to users on a project and strategic basis ( Kelly and McNicoll 2011 ). Organizations may be interested in reviewing and assessing research impact for one or more of the aforementioned purposes and this will influence the way in which evaluation is approached.

It is important to emphasize that ‘Not everyone within the higher education sector itself is convinced that evaluation of higher education activity is a worthwhile task’ ( Kelly and McNicoll 2011 ). The University and College Union ( University and College Union 2011 ) organized a petition calling on the UK funding councils to withdraw the inclusion of impact assessment from the REF proposals once plans for the new assessment of university research were released. This petition was signed by 17,570 academics (52,409 academics were returned to the 2008 Research Assessment Exercise), including Nobel laureates and Fellows of the Royal Society ( University and College Union 2011 ). Impact assessments raise concerns over the steer of research towards disciplines and topics in which impact is more easily evidenced and that provide economic impacts that could subsequently lead to a devaluation of ‘blue skies’ research. Johnston ( Johnston 1995 ) notes that by developing relationships between researchers and industry, new research strategies can be developed. This raises the questions of whether UK business and industry should not invest in the research that will deliver them impacts and who will fund basic research if not the government? Donovan (2011) asserts that there should be no disincentive for conducting basic research. By asking academics to consider the impact of the research they undertake and by reviewing and funding them accordingly, the result may be to compromise research by steering it away from the imaginative and creative quest for knowledge. Professor James Ladyman, at the University of Bristol, a vocal adversary of awarding funding based on the assessment of research impact, has been quoted as saying that ‘…inclusion of impact in the REF will create “selection pressure,” promoting academic research that has “more direct economic impact” or which is easier to explain to the public’ ( Corbyn 2009 ).

Despite the concerns raised, the broader socio-economic impacts of research will be included and count for 20% of the overall research assessment, as part of the REF in 2014. From an international perspective, this represents a step change in the comprehensive nature to which impact will be assessed within universities and research institutes, incorporating impact from across all research disciplines. Understanding what impact looks like across the various strands of research and the variety of indicators and proxies used to evidence impact will be important to developing a meaningful assessment.

What are the methodologies and frameworks that have been employed globally to evaluate research impact and how do these compare? The traditional form of evaluation of university research in the UK was based on measuring academic impact and quality through a process of peer review ( Grant 2006 ). Evidence of academic impact may be derived through various bibliometric methods, one example of which is the H index, which has incorporated factors such as the number of publications and citations. These metrics may be used in the UK to understand the benefits of research within academia and are often incorporated into the broader perspective of impact seen internationally, for example, within the Excellence in Research for Australia and using Star Metrics in the USA, in which quantitative measures are used to assess impact, for example, publications, citation, and research income. These ‘traditional’ bibliometric techniques can be regarded as giving only a partial picture of full impact ( Bornmann and Marx 2013 ) with no link to causality. Standard approaches actively used in programme evaluation such as surveys, case studies, bibliometrics, econometrics and statistical analyses, content analysis, and expert judgment are each considered by some (Vonortas and Link, 2012) to have shortcomings when used to measure impacts.

Incorporating assessment of the wider socio-economic impact began using metrics-based indicators such as Intellectual Property registered and commercial income generated ( Australian Research Council 2008 ). In the UK, more sophisticated assessments of impact incorporating wider socio-economic benefits were first investigated within the fields of Biomedical and Health Sciences ( Grant 2006 ), an area of research that wanted to be able to justify the significant investment it received. Frameworks for assessing impact have been designed and are employed at an organizational level addressing the specific requirements of the organization and stakeholders. As a result, numerous and widely varying models and frameworks for assessing impact exist. Here we outline a few of the most notable models that demonstrate the contrast in approaches available.

The Payback Framework is possibly the most widely used and adapted model for impact assessment ( Wooding et al. 2007 ; Nason et al. 2008 ), developed during the mid-1990s by Buxton and Hanney, working at Brunel University. It incorporates both academic outputs and wider societal benefits ( Donovan and Hanney 2011 ) to assess outcomes of health sciences research. The Payback Framework systematically links research with the associated benefits ( Scoble et al. 2010 ; Hanney and González-Block 2011 ) and can be thought of in two parts: a model that allows the research and subsequent dissemination process to be broken into specific components within which the benefits of research can be studied, and second, a multi-dimensional classification scheme into which the various outputs, outcomes, and impacts can be placed ( Hanney and Gonzalez Block 2011 ). The Payback Framework has been adopted internationally, largely within the health sector, by organizations such as the Canadian Institute of Health Research, the Dutch Public Health Authority, the Australian National Health and Medical Research Council, and the Welfare Bureau in Hong Kong ( Bernstein et al. 2006 ; Nason et al. 2008 ; CAHS 2009; Spaapen et al. n.d. ). The Payback Framework enables health and medical research and impact to be linked and the process by which impact occurs to be traced. For more extensive reviews of the Payback Framework, see Davies et al. (2005) , Wooding et al. (2007) , Nason et al. (2008) , and Hanney and González-Block (2011) .

A very different approach known as Social Impact Assessment Methods for research and funding instruments through the study of Productive Interactions (SIAMPI) was developed from the Dutch project Evaluating Research in Context and has a central theme of capturing ‘productive interactions’ between researchers and stakeholders by analysing the networks that evolve during research programmes ( Spaapen and Drooge, 2011 ; Spaapen et al. n.d. ). SIAMPI is based on the widely held assumption that interactions between researchers and stakeholder are an important pre-requisite to achieving impact ( Donovan 2011 ; Hughes and Martin 2012 ; Spaapen et al. n.d. ). This framework is intended to be used as a learning tool to develop a better understanding of how research interactions lead to social impact rather than as an assessment tool for judging, showcasing, or even linking impact to a specific piece of research. SIAMPI has been used within the Netherlands Institute for health Services Research ( SIAMPI n.d. ). ‘Productive interactions’, which can perhaps be viewed as instances of knowledge exchange, are widely valued and supported internationally as mechanisms for enabling impact and are often supported financially for example by Canada’s Social Sciences and Humanities Research Council, which aims to support knowledge exchange (financially) with a view to enabling long-term impact. In the UK, UK Department for Business, Innovation, and Skills provided funding of £150 million for knowledge exchange in 2011–12 to ‘help universities and colleges support the economic recovery and growth, and contribute to wider society’ ( Department for Business, Innovation and Skills 2012 ). While valuing and supporting knowledge exchange is important, SIAMPI perhaps takes this a step further in enabling these exchange events to be captured and analysed. One of the advantages of this method is that less input is required compared with capturing the full route from research to impact. A comprehensive assessment of impact itself is not undertaken with SIAMPI, which make it a less-suitable method where showcasing the benefits of research is desirable or where this justification of funding based on impact is required.

The first attempt globally to comprehensively capture the socio-economic impact of research across all disciplines was undertaken for the Australian Research Quality Framework (RQF), using a case study approach. The RQF was developed to demonstrate and justify public expenditure on research, and as part of this framework, a pilot assessment was undertaken by the Australian Technology Network. Researchers were asked to evidence the economic, societal, environmental, and cultural impact of their research within broad categories, which were then verified by an expert panel ( Duryea et al. 2007 ) who concluded that the researchers and case studies could provide enough qualitative and quantitative evidence for reviewers to assess the impact arising from their research ( Duryea et al. 2007 ). To evaluate impact, case studies were interrogated and verifiable indicators assessed to determine whether research had led to reciprocal engagement, adoption of research findings, or public value. The RQF pioneered the case study approach to assessing research impact; however, with a change in government in 2007, this framework was never implemented in Australia, although it has since been taken up and adapted for the UK REF.

In developing the UK REF, HEFCE commissioned a report, in 2009, from RAND to review international practice for assessing research impact and provide recommendations to inform the development of the REF. RAND selected four frameworks to represent the international arena ( Grant et al. 2009 ). One of these, the RQF, they identified as providing a ‘promising basis for developing an impact approach for the REF’ using the case study approach. HEFCE developed an initial methodology that was then tested through a pilot exercise. The case study approach, recommended by the RQF, was combined with ‘significance’ and ‘reach’ as criteria for assessment. The criteria for assessment were also supported by a model developed by Brunel for ‘measurement’ of impact that used similar measures defined as depth and spread. In the Brunel model, depth refers to the degree to which the research has influenced or caused change, whereas spread refers to the extent to which the change has occurred and influenced end users. Evaluation of impact in terms of reach and significance allows all disciplines of research and types of impact to be assessed side-by-side ( Scoble et al. 2010 ).

The range and diversity of frameworks developed reflect the variation in purpose of evaluation including the stakeholders for whom the assessment takes place, along with the type of impact and evidence anticipated. The most appropriate type of evaluation will vary according to the stakeholder whom we are wishing to inform. Studies ( Buxton, Hanney and Jones 2004 ) into the economic gains from biomedical and health sciences determined that different methodologies provide different ways of considering economic benefits. A discussion on the benefits and drawbacks of a range of evaluation tools (bibliometrics, economic rate of return, peer review, case study, logic modelling, and benchmarking) can be found in the article by Grant (2006) .

Evaluation of impact is becoming increasingly important, both within the UK and internationally, and research and development into impact evaluation continues, for example, researchers at Brunel have developed the concept of depth and spread further into the Brunel Impact Device for Evaluation, which also assesses the degree of separation between research and impact ( Scoble et al. working paper ).

Although based on the RQF, the REF did not adopt all of the suggestions held within, for example, the option of allowing research groups to opt out of impact assessment should the nature or stage of research deem it unsuitable ( Donovan 2008 ). In 2009–10, the REF team conducted a pilot study for the REF involving 29 institutions, submitting case studies to one of five units of assessment (in clinical medicine, physics, earth systems and environmental sciences, social work and social policy, and English language and literature) ( REF2014 2010 ). These case studies were reviewed by expert panels and, as with the RQF, they found that it was possible to assess impact and develop ‘impact profiles’ using the case study approach ( REF2014 2010 ).

From 2014, research within UK universities and institutions will be assessed through the REF; this will replace the Research Assessment Exercise, which has been used to assess UK research since the 1980s. Differences between these two assessments include the removal of indicators of esteem and the addition of assessment of socio-economic research impact. The REF will therefore assess three aspects of research:

Environment

Research impact is assessed in two formats, first, through an impact template that describes the approach to enabling impact within a unit of assessment, and second, using impact case studies that describe the impact taking place following excellent research within a unit of assessment ( REF2014 2011a ). HEFCE indicated that impact should merit a 25% weighting within the REF ( REF2014 2011b ); however, this has been reduced for the 2014 REF to 20%, perhaps as a result of feedback and lobbying, for example, from the Russell Group and Million + group of Universities who called for impact to count for 15% ( Russell Group 2009 ; Jump 2011 ) and following guidance from the expert panels undertaking the pilot exercise who suggested that during the 2014 REF, impact assessment would be in a developmental phase and that a lower weighting for impact would be appropriate with the expectation that this would be increased in subsequent assessments ( REF2014 2010 ).

The quality and reliability of impact indicators will vary according to the impact we are trying to describe and link to research. In the UK, evidence and research impacts will be assessed for the REF within research disciplines. Although it can be envisaged that the range of impacts derived from research of different disciplines are likely to vary, one might question whether it makes sense to compare impacts within disciplines when the range of impact can vary enormously, for example, from business development to cultural changes or saving lives? An alternative approach was suggested for the RQF in Australia, where it was proposed that types of impact be compared rather than impact from specific disciplines.

Providing advice and guidance within specific disciplines is undoubtedly helpful. It can be seen from the panel guidance produced by HEFCE to illustrate impacts and evidence that it is expected that impact and evidence will vary according to discipline ( REF2014 2012 ). Why should this be the case? Two areas of research impact health and biomedical sciences and the social sciences have received particular attention in the literature by comparison with, for example, the arts. Reviews and guidance on developing and evidencing impact in particular disciplines include the London School of Economics (LSE) Public Policy Group’s impact handbook (LSE n.d.), a review of the social and economic impacts arising from the arts produced by Reeve ( Reeves 2002 ), and a review by Kuruvilla et al. (2006) on the impact arising from health research. Perhaps it is time for a generic guide based on types of impact rather than research discipline?

What are the challenges associated with understanding and evaluating research impact? In endeavouring to assess or evaluate impact, a number of difficulties emerge and these may be specific to certain types of impact. Given that the type of impact we might expect varies according to research discipline, impact-specific challenges present us with the problem that an evaluation mechanism may not fairly compare impact between research disciplines.

5.1 Time lag

The time lag between research and impact varies enormously. For example, the development of a spin out can take place in a very short period, whereas it took around 30 years from the discovery of DNA before technology was developed to enable DNA fingerprinting. In development of the RQF, The Allen Consulting Group (2005) highlighted that defining a time lag between research and impact was difficult. In the UK, the Russell Group Universities responded to the REF consultation by recommending that no time lag be put on the delivery of impact from a piece of research citing examples such as the development of cardiovascular disease treatments, which take between 10 and 25 years from research to impact ( Russell Group 2009 ). To be considered for inclusion within the REF, impact must be underpinned by research that took place between 1 January 1993 and 31 December 2013, with impact occurring during an assessment window from 1 January 2008 to 31 July 2013. However, there has been recognition that this time window may be insufficient in some instances, with architecture being granted an additional 5-year period ( REF2014 2012 ); why only architecture has been granted this dispensation is not clear, when similar cases could be made for medicine, physics, or even English literature. Recommendations from the REF pilot were that the panel should be able to extend the time frame where appropriate; this, however, poses difficult decisions when submitting a case study to the REF as to what the view of the panel will be and whether if deemed inappropriate this will render the case study ‘unclassified’.

5.2 The developmental nature of impact

Impact is not static, it will develop and change over time, and this development may be an increase or decrease in the current degree of impact. Impact can be temporary or long-lasting. The point at which assessment takes place will therefore influence the degree and significance of that impact. For example, following the discovery of a new potential drug, preclinical work is required, followed by Phase 1, 2, and 3 trials, and then regulatory approval is granted before the drug is used to deliver potential health benefits. Clearly there is the possibility that the potential new drug will fail at any one of these phases but each phase can be classed as an interim impact of the original discovery work on route to the delivery of health benefits, but the time at which an impact assessment takes place will influence the degree of impact that has taken place. If impact is short-lived and has come and gone within an assessment period, how will it be viewed and considered? Again the objective and perspective of the individuals and organizations assessing impact will be key to understanding how temporal and dissipated impact will be valued in comparison with longer-term impact.

5.3 Attribution

Impact is derived not only from targeted research but from serendipitous findings, good fortune, and complex networks interacting and translating knowledge and research. The exploitation of research to provide impact occurs through a complex variety of processes, individuals, and organizations, and therefore, attributing the contribution made by a specific individual, piece of research, funding, strategy, or organization to an impact is not straight forward. Husbands-Fealing suggests that to assist identification of causality for impact assessment, it is useful to develop a theoretical framework to map the actors, activities, linkages, outputs, and impacts within the system under evaluation, which shows how later phases result from earlier ones. Such a framework should be not linear but recursive, including elements from contextual environments that influence and/or interact with various aspects of the system. Impact is often the culmination of work within spanning research communities ( Duryea et al. 2007 ). Concerns over how to attribute impacts have been raised many times ( The Allen Consulting Group 2005 ; Duryea et al. 2007 ; Grant et al. 2009 ), and differentiating between the various major and minor contributions that lead to impact is a significant challenge.

Figure 1 , replicated from Hughes and Martin (2012) , illustrates how the ease with which impact can be attributed decreases with time, whereas the impact, or effect of complementary assets, increases, highlighting the problem that it may take a considerable amount of time for the full impact of a piece of research to develop but because of this time and the increase in complexity of the networks involved in translating the research and interim impacts, it is more difficult to attribute and link back to a contributing piece of research.

Time, attribution, impact. Replicated from (Hughes and Martin 2012).

Time, attribution, impact. Replicated from ( Hughes and Martin 2012 ).

This presents particular difficulties in research disciplines conducting basic research, such as pure mathematics, where the impact of research is unlikely to be foreseen. Research findings will be taken up in other branches of research and developed further before socio-economic impact occurs, by which point, attribution becomes a huge challenge. If this research is to be assessed alongside more applied research, it is important that we are able to at least determine the contribution of basic research. It has been acknowledged that outstanding leaps forward in knowledge and understanding come from immersing in a background of intellectual thinking that ‘one is able to see further by standing on the shoulders of giants’.

5.4 Knowledge creep

It is acknowledged that one of the outcomes of developing new knowledge through research can be ‘knowledge creep’ where new data or information becomes accepted and gets absorbed over time. This is particularly recognized in the development of new government policy where findings can influence policy debate and policy change, without recognition of the contributing research ( Davies et al. 2005 ; Wooding et al. 2007 ). This is recognized as being particularly problematic within the social sciences where informing policy is a likely impact of research. In putting together evidence for the REF, impact can be attributed to a specific piece of research if it made a ‘distinctive contribution’ ( REF2014 2011a ). The difficulty then is how to determine what the contribution has been in the absence of adequate evidence and how we ensure that research that results in impacts that cannot be evidenced is valued and supported.

5.5 Gathering evidence

Gathering evidence of the links between research and impact is not only a challenge where that evidence is lacking. The introduction of impact assessments with the requirement to collate evidence retrospectively poses difficulties because evidence, measurements, and baselines have, in many cases, not been collected and may no longer be available. While looking forward, we will be able to reduce this problem in the future, identifying, capturing, and storing the evidence in such a way that it can be used in the decades to come is a difficulty that we will need to tackle.

Collating the evidence and indicators of impact is a significant task that is being undertaken within universities and institutions globally. Decker et al. (2007) surveyed researchers in the US top research institutions during 2005; the survey of more than 6000 researchers found that, on average, more than 40% of their time was spent doing administrative tasks. It is desirable that the assignation of administrative tasks to researchers is limited, and therefore, to assist the tracking and collating of impact data, systems are being developed involving numerous projects and developments internationally, including Star Metrics in the USA, the ERC (European Research Council) Research Information System, and Lattes in Brazil ( Lane 2010 ; Mugabushaka and Papazoglou 2012 ).

Ideally, systems within universities internationally would be able to share data allowing direct comparisons, accurate storage of information developed in collaborations, and transfer of comparable data as researchers move between institutions. To achieve compatible systems, a shared language is required. CERIF (Common European Research Information Format) was developed for this purpose, first released in 1991; a number of projects and systems across Europe such as the ERC Research Information System ( Mugabushaka and Papazoglou 2012 ) are being developed as CERIF-compatible.

In the UK, there have been several Jisc-funded projects in recent years to develop systems capable of storing research information, for example, MICE (Measuring Impacts Under CERIF), UK Research Information Shared Service, and Integrated Research Input and Output System, all based on the CERIF standard. To allow comparisons between institutions, identifying a comprehensive taxonomy of impact, and the evidence for it, that can be used universally is seen to be very valuable. However, the Achilles heel of any such attempt, as critics suggest, is the creation of a system that rewards what it can measure and codify, with the knock-on effect of directing research projects to deliver within the measures and categories that reward.

Attempts have been made to categorize impact evidence and data, for example, the aim of the MICE Project was to develop a set of impact indicators to enable impact to be fed into a based system. Indicators were identified from documents produced for the REF, by Research Councils UK, in unpublished draft case studies undertaken at King’s College London or outlined in relevant publications (MICE Project n.d.). A taxonomy of impact categories was then produced onto which impact could be mapped. What emerged on testing the MICE taxonomy ( Cooke and Nadim 2011 ), by mapping impacts from case studies, was that detailed categorization of impact was found to be too prescriptive. Every piece of research results in a unique tapestry of impact and despite the MICE taxonomy having more than 100 indicators, it was found that these did not suffice. It is perhaps worth noting that the expert panels, who assessed the pilot exercise for the REF, commented that the evidence provided by research institutes to demonstrate impact were ‘a unique collection’. Where quantitative data were available, for example, audience numbers or book sales, these numbers rarely reflected the degree of impact, as no context or baseline was available. Cooke and Nadim (2011) also noted that using a linear-style taxonomy did not reflect the complex networks of impacts that are generally found. The Goldsmith report ( Cooke and Nadim 2011 ) recommended making indicators ‘value free’, enabling the value or quality to be established in an impact descriptor that could be assessed by expert panels. The Goldsmith report concluded that general categories of evidence would be more useful such that indicators could encompass dissemination and circulation, re-use and influence, collaboration and boundary work, and innovation and invention.

While defining the terminology used to understand impact and indicators will enable comparable data to be stored and shared between organizations, we would recommend that any categorization of impacts be flexible such that impacts arising from non-standard routes can be placed. It is worth considering the degree to which indicators are defined and provide broader definitions with greater flexibility.

It is possible to incorporate both metrics and narratives within systems, for example, within the Research Outcomes System and Researchfish, currently used by several of the UK research councils to allow impacts to be recorded; although recording narratives has the advantage of allowing some context to be documented, it may make the evidence less flexible for use by different stakeholder groups (which include government, funding bodies, research assessment agencies, research providers, and user communities) for whom the purpose of analysis may vary ( Davies et al. 2005 ). Any tool for impact evaluation needs to be flexible, such that it enables access to impact data for a variety of purposes (Scoble et al. n.d.). Systems need to be able to capture links between and evidence of the full pathway from research to impact, including knowledge exchange, outputs, outcomes, and interim impacts, to allow the route to impact to be traced. This database of evidence needs to establish both where impact can be directly attributed to a piece of research as well as various contributions to impact made during the pathway.

Baselines and controls need to be captured alongside change to demonstrate the degree of impact. In many instances, controls are not feasible as we cannot look at what impact would have occurred if a piece of research had not taken place; however, indications of the picture before and after impact are valuable and worth collecting for impact that can be predicted.

It is now possible to use data-mining tools to extract specific data from narratives or unstructured data ( Mugabushaka and Papazoglou 2012 ). This is being done for collation of academic impact and outputs, for example, Research Portfolio Online Reporting Tools, which uses PubMed and text mining to cluster research projects, and STAR Metrics in the US, which uses administrative records and research outputs and is also being implemented by the ERC using data in the public domain ( Mugabushaka and Papazoglou 2012 ). These techniques have the potential to provide a transformation in data capture and impact assessment ( Jones and Grant 2013 ). It is acknowledged in the article by Mugabushaka and Papazoglou (2012) that it will take years to fully incorporate the impacts of ERC funding. For systems to be able to capture a full range of systems, definitions and categories of impact need to be determined that can be incorporated into system development. To adequately capture interactions taking place between researchers, institutions, and stakeholders, the introduction of tools to enable this would be very valuable. If knowledge exchange events could be captured, for example, electronically as they occur or automatically if flagged from an electronic calendar or a diary, then far more of these events could be recorded with relative ease. Capturing knowledge exchange events would greatly assist the linking of research with impact.

The transition to routine capture of impact data not only requires the development of tools and systems to help with implementation but also a cultural change to develop practices, currently undertaken by a few to be incorporated as standard behaviour among researchers and universities.

What indicators, evidence, and impacts need to be captured within developing systems? There is a great deal of interest in collating terms for impact and indicators of impact. Consortia for Advancing Standards in Research Administration Information, for example, has put together a data dictionary with the aim of setting the standards for terminology used to describe impact and indicators that can be incorporated into systems internationally and seems to be building a certain momentum in this area. A variety of types of indicators can be captured within systems; however, it is important that these are universally understood. Here we address types of evidence that need to be captured to enable an overview of impact to be developed. In the majority of cases, a number of types of evidence will be required to provide an overview of impact.

7.1 Metrics

Metrics have commonly been used as a measure of impact, for example, in terms of profit made, number of jobs provided, number of trained personnel recruited, number of visitors to an exhibition, number of items purchased, and so on. Metrics in themselves cannot convey the full impact; however, they are often viewed as powerful and unequivocal forms of evidence. If metrics are available as impact evidence, they should, where possible, also capture any baseline or control data. Any information on the context of the data will be valuable to understanding the degree to which impact has taken place.

Perhaps, SROI indicates the desire to be able to demonstrate the monetary value of investment and impact by some organizations. SROI aims to provide a valuation of the broader social, environmental, and economic impacts, providing a metric that can be used for demonstration of worth. This is a metric that has been used within the charitable sector ( Berg and Månsson 2011 ) and also features as evidence in the REF guidance for panel D ( REF2014 2012 ). More details on SROI can be found in ‘A guide to Social Return on Investment’ produced by The SROI Network (2012) .

Although metrics can provide evidence of quantitative changes or impacts from our research, they are unable to adequately provide evidence of the qualitative impacts that take place and hence are not suitable for all of the impact we will encounter. The main risks associated with the use of standardized metrics are that

The full impact will not be realized, as we focus on easily quantifiable indicators

We will focus attention towards generating results that enable boxes to be ticked rather than delivering real value for money and innovative research.

They risk being monetized or converted into a lowest common denominator in an attempt to compare the cost of a new theatre against that of a hospital.

7.2 Narratives

Narratives can be used to describe impact; the use of narratives enables a story to be told and the impact to be placed in context and can make good use of qualitative information. They are often written with a reader from a particular stakeholder group in mind and will present a view of impact from a particular perspective. The risk of relying on narratives to assess impact is that they often lack the evidence required to judge whether the research and impact are linked appropriately. Where narratives are used in conjunction with metrics, a complete picture of impact can be developed, again from a particular perspective but with the evidence available to corroborate the claims made. Table 1 summarizes some of the advantages and disadvantages of the case study approach.

The advantages and disadvantages of the case study approach

By allowing impact to be placed in context, we answer the ‘so what?’ question that can result from quantitative data analyses, but is there a risk that the full picture may not be presented to demonstrate impact in a positive light? Case studies are ideal for showcasing impact, but should they be used to critically evaluate impact?

7.3 Surveys and testimonies

One way in which change of opinion and user perceptions can be evidenced is by gathering of stakeholder and user testimonies or undertaking surveys. This might describe support for and development of research with end users, public engagement and evidence of knowledge exchange, or a demonstration of change in public opinion as a result of research. Collecting this type of evidence is time-consuming, and again, it can be difficult to gather the required evidence retrospectively when, for example, the appropriate user group might have dispersed.

The ability to record and log these type of data is important for enabling the path from research to impact to be established and the development of systems that can capture this would be very valuable.

7.4 Citations (outside of academia) and documentation

Citations (outside of academia) and documentation can be used as evidence to demonstrate the use research findings in developing new ideas and products for example. This might include the citation of a piece of research in policy documents or reference to a piece of research being cited within the media. A collation of several indicators of impact may be enough to convince that an impact has taken place. Even where we can evidence changes and benefits linked to our research, understanding the causal relationship may be difficult. Media coverage is a useful means of disseminating our research and ideas and may be considered alongside other evidence as contributing to or an indicator of impact.

The fast-moving developments in the field of altmetrics (or alternative metrics) are providing a richer understanding of how research is being used, viewed, and moved. The transfer of information electronically can be traced and reviewed to provide data on where and to whom research findings are going.

The understanding of the term impact varies considerably and as such the objectives of an impact assessment need to be thoroughly understood before evidence is collated.

While aspects of impact can be adequately interpreted using metrics, narratives, and other evidence, the mixed-method case study approach is an excellent means of pulling all available information, data, and evidence together, allowing a comprehensive summary of the impact within context. While the case study is a useful way of showcasing impact, its limitations must be understood if we are to use this for evaluation purposes. The case study does present evidence from a particular perspective and may need to be adapted for use with different stakeholders. It is time-intensive to both assimilate and review case studies and we therefore need to ensure that the resources required for this type of evaluation are justified by the knowledge gained. The ability to write a persuasive well-evidenced case study may influence the assessment of impact. Over the past year, there have been a number of new posts created within universities, such as writing impact case studies, and a number of companies are now offering this as a contract service. A key concern here is that we could find that universities which can afford to employ either consultants or impact ‘administrators’ will generate the best case studies.

The development of tools and systems for assisting with impact evaluation would be very valuable. We suggest that developing systems that focus on recording impact information alone will not provide all that is required to link research to ensuing events and impacts, systems require the capacity to capture any interactions between researchers, the institution, and external stakeholders and link these with research findings and outputs or interim impacts to provide a network of data. In designing systems and tools for collating data related to impact, it is important to consider who will populate the database and ensure that the time and capability required for capture of information is considered. Capturing data, interactions, and indicators as they emerge increases the chance of capturing all relevant information and tools to enable researchers to capture much of this would be valuable. However, it must be remembered that in the case of the UK REF, impact is only considered that is based on research that has taken place within the institution submitting the case study. It is therefore in an institution’s interest to have a process by which all the necessary information is captured to enable a story to be developed in the absence of a researcher who may have left the employment of the institution. Figure 2 demonstrates the information that systems will need to capture and link.

Research findings including outputs (e.g., presentations and publications)

Communications and interactions with stakeholders and the wider public (emails, visits, workshops, media publicity, etc)

Feedback from stakeholders and communication summaries (e.g., testimonials and altmetrics)

Research developments (based on stakeholder input and discussions)

Outcomes (e.g., commercial and cultural, citations)

Impacts (changes, e.g., behavioural and economic)

Overview of the types of information that systems need to capture and link.

Overview of the types of information that systems need to capture and link.

Attempting to evaluate impact to justify expenditure, showcase our work, and inform future funding decisions will only prove to be a valuable use of time and resources if we can take measures to ensure that assessment attempts will not ultimately have a negative influence on the impact of our research. There are areas of basic research where the impacts are so far removed from the research or are impractical to demonstrate; in these cases, it might be prudent to accept the limitations of impact assessment, and provide the potential for exclusion in appropriate circumstances.

This work was supported by Jisc [DIINN10].

Google Scholar

Google Preview

Email alerts

Citing articles via.

  • Recommend to your Library

Affiliations

  • Online ISSN 1471-5449
  • Print ISSN 0958-2029
  • Copyright © 2024 Oxford University Press
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Help | Advanced Search

Electrical Engineering and Systems Science > Image and Video Processing

Title: rmt-bvqa: recurrent memory transformer-based blind video quality assessment for enhanced video content.

Abstract: With recent advances in deep learning, numerous algorithms have been developed to enhance video quality, reduce visual artefacts and improve perceptual quality. However, little research has been reported on the quality assessment of enhanced content - the evaluation of enhancement methods is often based on quality metrics that were designed for compression applications. In this paper, we propose a novel blind deep video quality assessment (VQA) method specifically for enhanced video content. It employs a new Recurrent Memory Transformer (RMT) based network architecture to obtain video quality representations, which is optimised through a novel content-quality-aware contrastive learning strategy based on a new database containing 13K training patches with enhanced content. The extracted quality representations are then combined through linear regression to generate video-level quality indices. The proposed method, RMT-BVQA, has been evaluated on the VDPVE (VQA Dataset for Perceptual Video Enhancement) database through a five-fold cross validation. The results show its superior correlation performance when compared to ten existing no-reference quality metrics.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

COMMENTS

  1. (PDF) ASSESSMENT AND EVALUATION IN EDUCATION

    Aims: This study explored learning and assessment of Civic Education from the pupils' standpoint. Study Design: Employed a qualitative research paradigm, in particular a descriptive research design.

  2. Formative assessment: A systematic review of critical teacher

    Research design specifics, such as research questions, instruments, and analysis methods; • Research sample, such as number of schools, teachers, and students; • Type(s) of formative assessment approach: DBDM and/or AfL; • Results, such as the evidence with regard to the role of the teacher in formative assessment (i.e., the prerequisites

  3. Full article: A practical approach to assessment for learning and

    Assessment for learning (AfL) and differentiated instruction (DI) both imply a focus on learning processes and affect student learning positively. However, both AfL and DI prove to be difficult to implement for teachers. Two chemistry and two physics teachers were studied when designing and implementing the formative assessment of conceptual ...

  4. The past, present and future of educational assessment: A

    To see the horizon of educational assessment, a history of how assessment has been used and analysed from the earliest records, through the 20th century, and into contemporary times is deployed. Since paper-and-pencil assessments validity and integrity of candidate achievement has mattered. Assessments have relied on expert judgment. With the massification of education, formal group ...

  5. A Critical Review of Research on Student Self-Assessment

    This article is a review of research on student self-assessment conducted largely between 2013 and 2018. The purpose of the review is to provide an updated overview of theory and research. The treatment of theory involves articulating a refined definition and operationalization of self-assessment. The review of 76 empirical studies offers a critical perspective on what has been investigated ...

  6. PDF Theoretical Framework for Educational Assessment: A Synoptic Review

    Research on authentic assessment has explored various aspects including design, scoring, effects on teaching and learning, professional development, validity, reliability, and costs. Those are relative to authentic assessment (used interchangeably with performance assessment) in the classroom and will be reviewed.

  7. The quality of assessment tasks as a determinant of learning

    His research interest is focused on research methods and assessment and evaluation in higher education. He is currently co-researcher of the FLOASS Project-Learning outcomes and learning analytics in higher education: A framework for action from sustainable assessment (RTI2018-093630-B-I00). Author of articles, book chapters and contributions ...

  8. PDF Assessment in Higher Education and Student Learning

    academics appear to rely on traditional pen and paper examinations to determine student knowledge (Carless et al., 2010; Duncan & Buskirk-Cohen, 2011; Gilles et al., ... This research brings awareness to assessment practices in higher education. Only with awareness, will instructors learn the value of assessment, its effect on learning, and be

  9. Artificial Intelligence in Technology-Enhanced Assessment: A Survey of

    This paper explored how machine learning methods facilitate PE, intelligent assessment and data-intensive research in education. By using machine learning, especially DL models, in adaptive ITS, the learning paths can be changed dynamically and personalized based on the learner's progress and pace.

  10. The Impact of Peer Assessment on Academic Performance: A ...

    Peer assessment has been the subject of considerable research interest over the last three decades, with numerous educational researchers advocating for the integration of peer assessment into schools and instructional practice. Research synthesis in this area has, however, largely relied on narrative reviews to evaluate the efficacy of peer assessment. Here, we present a meta-analysis (54 ...

  11. The power of assessment feedback in teaching and learning: a ...

    The paper contributes to the extant literature on assessment feedback by highlighting the integral role it plays in improving teaching and learning in the education field. The article is intended for educators (school administrators/leaders and teachers) and students whose goal is to facilitate teaching and learning for school effectiveness.

  12. PDF Issues and Concerns in Classroom Assessment Practices

    (Research Centre, In Education, University of Calicut), Kerala, India [email protected] Mob: 9447847053 Abstract Assessment is an integral part of any teaching learning process. Assessment has large number of functions to perform, whether it is formative or summative assessment. This paper analyse the

  13. (PDF) Formative assessment: A critical review

    Officers (CCSSO) and contained in McManus (2008, 3): 'Formative assessment is a. process used by teachers and students during instruction that provides feedback to. adjust ongoing teaching and ...

  14. Implementing evidence-based assessment and selection in organizations

    Papers from the reference list search and papers that we found or that were published during the writing process were added. Eventually, 21 articles met the inclusion criteria. ... In future research, costs of evidence-based assessment practices can be contrasted with unstructured interviewing and holistic candidate discussions. Research topics ...

  15. Full article: Effects of Classroom Assessment Practices on Students

    GENDER. Previous research findings (e.g., Citation Wang, 2004) suggest that gender may also need to be considered when investigating the impact of classroom assessment on students' achievement goals.Specifically, in Wang's study, performance-approach goals were found to be positively related to both perceptions of the classroom assessment environment as being learning-oriented and test ...

  16. Students' perceptions of the quality and utility of assessment feedback

    For students to value and use assessment feedback, it must be clear, specific, provide directions for improvement, and must be communicated in a timely manner, using the most appropriate method. Both the students' perceptions of the quality of assessment feedback and their utility of assessment feedback were good.

  17. Assessment in the age of artificial intelligence

    In this paper, we argue that a particular set of issues mars traditional assessment practices. They may be difficult for educators to design and implement; only provide discrete snapshots of performance rather than nuanced views of learning; be unadapted to the particular knowledge, skills, and backgrounds of participants; be tailored to the culture of schooling rather than the cultures ...

  18. The impact of assessment on students learning

    In this paper, a methodology for characterizing assessment environments at an engineering program level that is able to distinguish between weak, average, and talented students has been presented.

  19. Research governance and the future(s) of research assessment

    This paper explores recent public debates around research assessment and its future as part of a dynamic landscape of governance discourses and practices, and organisational, professional and ...

  20. Psychological Assessment

    Psychological assessment is mostly applied to older adults suffering from mental health conditions, including those frequently diagnosed among older populations, namely depression, dementia, and anxiety (7%, 5%, and 3.8% of older population worldwide, respectively) (World Health Organization 2017).Psychological assessment is often used to evaluate older adults with personality disorders, sleep ...

  21. Assessment, evaluations, and definitions of research impact: A review

    This petition was signed by 17,570 academics (52,409 academics were returned to the 2008 Research Assessment Exercise), including Nobel laureates and Fellows of the Royal Society (University and College Union 2011). Impact assessments raise concerns over the steer of research towards disciplines and topics in which impact is more easily ...

  22. PDF The Impact of Assessment for Learning on Students' Achievement in

    Assessment for learning or constructive assessment is defined as a process used by teachers and learners during instruction that provides feedback to adjust ongoing teaching and learning to improve students' achievement of intended instructional goals (Sadler, 1989). For Pophan (2008), assessment for learning is a planned process in

  23. (PDF) A Study on the Assessment Methods and Experiences ...

    This study investigates teachers' assessment methods and the challenges they encounter in assessing learning in an Ethiopian university. A convergent parallel mixed-method research design was used ...

  24. PDF Assessment, Evaluation and Research Relationships and Definitions in

    "Assessment is a multi-stage, multi-dimensional process - a vehicle - for bringing clarity and balance to an individual activity or set of activities." Assessment, Evaluation, and Research - Similarities with Distinctions Upcraft and Schuh (2001) distinguish between assessment and research, noting that research guides theory

  25. PDF Assessment criteria for the research paper

    LAW5082: MASTERS RESEARCH UNIT Assessment criteria for the research paper Aspects of the research paper that will be relevant to the determination of a final grade are as follows: Problem Definition and Methodology Statement of the research problem, the aims of the paper and the significance of the research. Explanation of scope of the study.

  26. Full article: High Stakes Assessments in Primary Schools and Teachers

    England - the empirical setting for this paper - is a prime example. Here, 10/11-year-olds sit Key Stage 2 Statutory Assessment Tests (SATs) at the end of primary education, with the results potentially having important consequences for schools, leaders, teachers and - to a lesser extent - individual pupils.

  27. PDF Home

    Home | Food and Agriculture Organization of the United Nations

  28. Challenges in Assessment Centres: Lessons from Experience

    Abstract. The success of the assessment centre (AC) approach has been widely acknowledged and its widespread practice testifies to this. In India, its adoption as a tool for HRD is increasing slowly and steadily. Its proven success is due to its strong criterion and content-related validity. However, certain challenges during design and ...

  29. RMT-BVQA: Recurrent Memory Transformer-based Blind Video Quality

    However, little research has been reported on the quality assessment of enhanced content - the evaluation of enhancement methods is often based on quality metrics that were designed for compression applications. In this paper, we propose a novel blind deep video quality assessment (VQA) method specifically for enhanced video content.