AI: Can You Help Address This Issue?

IF 1.9 4区教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH

Educational Measurement-Issues and Practice Pub Date : 2024-11-10 DOI:10.1111/emip.12655

Deborah J. Harris

{"title":"AI: Can You Help Address This Issue?","authors":"Deborah J. Harris","doi":"10.1111/emip.12655","DOIUrl":null,"url":null,"abstract":"Linking across test forms or pools of items is necessary to ensure scores that are reported across different administrations are comparable and lead to consistent decisions for examinees whose abilities are the same, but who were administered different items. Most of these linkages consist of equating test forms or scaling calibrated items or pools to be on the same theta scale. The typical methodology to accomplish this linking makes use of common examinees or common items, where common examinees are understood to be groups of examinees of comparable ability, whether obtained through a single group (where the same examinees are administered multiple assessments) or a random groups design, where random assignment or pseudo random assignment is done (such as spiraling the test forms, say 1, 2, 3, 4, 5, and distributing them such that every 5th examinee receives the same form). Common item methodology is usually implemented by having identical items in multiple forms and using those items to link across forms or pools. These common items may be scored or unscored in terms of whether they are treated as internal or external anchors (i.e., whether they are contributing to the examinee's score).There are situations where it is not practical to have either common examinees nor common items. Typically, these are high-stakes settings, where the security of the assessment questions would likely be at risk if any were repeated. This would include scenarios where the entire assessment is released after administration to promote transparency. In some countries, a single form of a national test may be administered to all examinees during a single administration time. While in some cases a student who does not do as well as they had hoped may retest the following year, this may be a small sample and these students would not be considered representative of the entire body of test-takers. In addition, it is presumed they would have spent the intervening year studying for the exam, and so they could not really be considered common examinees across years and assessment forms.Although the decisions (such as university admissions) based on the assessment scores are comparable within the year, because all examinees are administered the same set of items on the same date, it is difficult to monitor trends over time as there is no linkage between forms across years. Although the general populations may be similar (e.g., 2024 secondary school graduates versus 2023 secondary school graduates), there is no evidence that the groups are strictly equivalent across years. Similarly, comparing how examinees perform across years (e.g., highest scores, average raw score, and so on) is challenging as there is no adjustment for yearly fluctuations in form difficulty across years.There have been variations of both common item and common examinee linking, such as using similar items, rather than identical items, including where perhaps these similar items are clones of each other, and using various types of matching techniques in an attempt to achieve common examinees by creating equivalent subgroups across forms. Cloning items or generating items from a template has had some success in terms of creating items of identical-ish difficulty. However, whether items that are clones of released items would still maintain their integrity and properties sufficiently to serve as linking items would need to be studied.I, and many others, have been involved in several studies trying to accomplish linking for comparability where neither common items nor common examinees have been available. This short section provides a glimpse of some of that research.Harris and Fang (2015) considered multiple options to address comparing assessment scores across years where there were no common items and no common examinees. Two of these options involved making an assumption, and the others involved making an adjustment. In the first instance, the assumption was made to treat the groups of examinees in different years as equivalent. This was solely an assumption, with no confirming evidence that the assumption was reasonable. Once the assumption was made, a random groups equating across years was conducted. The second method was to assume the test forms were built to be equivalent in difficulty, again with no evidence this was indeed the case. Because it was assumed the test forms were equivalent in difficulty, equating was not necessary (recall equating across test forms only adjusts for small differences in form difficulty; if there is no difference in difficulty, equating is unnecessary), and scores from the different forms in different years could be directly compared. The third option was to create subgroups of examinees to hopefully imitate equivalent groups of examinees being administered each of the forms, and then to conduct random groups equating. One of these subgroup options was created by using the middle 80% of the distributions of examinee scores for each year, and another used self-reported information the examinees provided, such as courses they had taken and course grades received, to create comparable subgroups across years.Huh et al. (2016, 2017) expanded on Harris and Fang (2015), again examining alternative ways to utilize equating methodology in a context where items cannot be readministered, and the examinee groups being administered different test forms cannot be assumed to be equivalent groups. The authors referred to the methods they studied as “pseudo-equating,” as equating methodology was used, but the data assumptions associated with actual equating, such as truly randomly equivalent groups of examinees or common items, were absent. They included the two assumption-based methods Harris and Fang looked at, as well as the two ways of adjusting the examinee groups. Assuming the two forms were built to the same difficulty specifications and assuming the two samples of examinees were equivalent did not work as well as making an adjustment to try to form comparable subgroups. The two adjustments used in Harris and Fang were replicated: the middle 80% of each examinee distribution were used, the basis again being that perhaps the examinee groups would differ more in the tails than in the center, and matching group distributions based on additional information. When classical test theory equipercentile with post smoothing equating methodology was implemented, the score distribution of the subsequent year's group was adjusted based on weighting to match the initial group distribution based on variables thought to be important such as self-reported subject grades and extracurricular activities. When IRT true score equating methodology was used, comparable samples for the two groups were created for calibrations by matching the proportions within each stratum as defined by the related variables. In general, the attempts to match the groups performed better than simply making assumptions about the groups or form difficulty equivalences. Wu et al. (2017) also looked at trying to create matched sample distributions across two dispirit examinee groups, with similar results.Kim and Walker (2021) investigated creating groups of similar ability using subgroup weighting, augmented by a small number of common items. Propensity score weighting, including variants such as coarsened exact matching, has also been used in an attempt to create equivalent groups of examinees (see, for example, Cho et al., 2024; Kapoor et al., 2024; Li et al., 2024; Woods et al., 2024 who all looked at creating equivalent samples of test takers in the context of mode studies, where one group of examinees tested on a device and the other group tested on paper).Propensity scores are “the conditional probability of assignment to a particular treatment given a vector of observed covariates” (Rosenbaum & Rubin, 1983, p. 41). In our scenario that translates to an examinee testing on one assessment form instead of the other. One key step in implementing propensity score matching is identifying the covariates to include. In our scenario that would involve those variables which would be available for both populations and be appropriately related to the variable of interest, and to end up with equivalent samples of examinees being administered the different test forms, to allow us to appropriately conduct a random groups equating across the two forms. For example, number of math classes taken, names of the specific math classes taken, grades in the individual math classes, overall grade point average for math classes, and so on are all possible covariates that could be included. What data is collected from examinees as well as at what level of granularity are decisions that need to be made. Whether the data on variables is self-reported or provided from a trusted source (e.g., self-reported course grades versus data from a transcript) also are considerations. How many covariates to include, whether exact matching is required, how to deal with missing data, deciding what matching algorithm to use, and so on are further decisions that need to be made.Cloning items, generating items from a template, or other ways of finding “matching items” from a subsequent form to substitute as common items with the form one wants to link to has also been studied in a variety of settings. In our scenario the issue is whether the characteristics of the items that impact the assumptions of the common item equating method being used “match” the item it is being substituted for. Item content and item position should be fairly easy to assess. However, item statistics such as IRT parameters and classical difficulty and discrimination would be computed using the responses by the test takers administered the subsequent form, and therefore not directly comparable. That is, trying to match an item with a particular p-value and point-biserial from a form administered to one group using an item administered on a different form to a different group brings us full circle. The item statistics would be comparable across the two forms if they were computed on equivalent groups of test-takers (in which case we would not need common items, we could rely on common examinees to link). If cloned items or auto-generated items were shown to have comparable statistics, they could be considered common items when placed in different forms and administered to different groups, at least in theory.Common items and common examinees are the two vehicles used in obtaining comparable scores across different forms of an assessment. Common items can be mimicked by having items of equivalent characteristics or by having unique items that have their item parameter estimates on a common theta scale. Research on using artificial intelligence to assist in estimating item parameters already exists, and one assumes is continuing to expand (Hao, et al., 2024). Some of these initiatives have been around augmenting small sample sizes to reduce the sampling requirements for pretesting and calibrating new items as they are introduced. Examples include McCarthy et al. (2021) who used a “multi-task generalized linear model with BERT features” to provide initial estimates that are then refined as empirical data are collected; their methodology can also be used to calculate “new item difficulty estimates without piloting them first” (p. 883). Belov et al. (2024) used item pool information and trained a neural network to interpolate the relationship between item parameter estimates and response patterns. Zhang and Chen (2024) presented a “cross estimation network” consisting of a person network and an item network, finding that their approach produced accurate parameter estimates for items or persons, providing the other parameters were given (known). In a different approach, Maeda (2024) used “examinees” generated with AI to calibrate items; he determined the methodology had promise, but was not as accurate as having actual human responses to the items. AI has also been applied in the attempt to form equivalent subgroups in various settings. Monlezun et al. (2022) integrated AI and propensity scoring by using a propensity score adjustment, augmented by machine learning. Collier et al. (2022) demonstrated the application of artificial neural networks in a multilevel setting in propensity score estimation, stating “AI can be helpful in propensity score estimation because it can identify underlying patterns between treatments and confounding variables using machine learning without being explicitly programmed. Many classification algorithms in machine learning can outperform the classical methods for propensity score estimation, mainly when processing the data with many covariates…” (p. 3). The authors note that not all neural networks are created equal and that using different training methods and hyperparameters on the same data can yield different results. They also mention several computer packages and programming languages that are available to implement neural networks for propensity score estimation, including Python, R, and SAS.What I would like to suggest is a concerted effort to use AI to assist linking test forms with no common examinees and no common items to allow comparing scores, trends, form difficulty, examinee ability, and so on over time in these situations. There are a variety of ways this could proceed, and obviously there would need to be multiple settings studied if there were to be any generalizations about what may and may not work in practice. However, I am going to focus on one scenario. I would like an assessment to be identified fulfilling these conditions: there are many items, the items have been or can be publicly released, and a large sample of examinees have been administered the assessment and their scored item responses and some relevant demographic/auxiliary information is available for the majority of examinees. These later variables might be related to previous achievement, such as earlier test scores or course grades, or demographics such as zip code and age. This assessment data would be used to assess how accurately an AI linkage might be, where we have a real, if contrived, result to serve as the “true” linkage.The original assessment is then divided into two sub forms that should be similar in terms of content specifications and difficulty but should differ to the degree alternate forms comprised of items that cannot be pretested or tried out in advance typically are. For simplicity in this paper, I am going to assume odd numbered questions comprise the Odd form and even numbered questions make up the Even form. The examinee group is also divided such that the two subgroups show what would be considered a reasonable difference in composition in terms of ability, demographics, sample size, and so on, when two populations are tested in different years (e.g., 2024 secondary school graduates versus 2023 secondary school graduates).The task for AI would be to adjust scores on the Odd form to be comparable to scores on the Even form. This could be done by estimating item characteristics for the Odd form items on the Even form scale and conducting an IRT equating to obtain comparable scores. Or creating equivalent samples from the subgroups administered the separate forms and running a random groups equating, or some other adjustment on items, test takers, or a combination of both. Because all examinees actually were “administered” both the Odd and Even forms, and because all items were administered together and could be calibrated together, there are multiple ways a criterion to evaluate the AI solutions could be created. (Personally, I would compare the AI results to many of them, as any of them could be considered reasonable operationally and if AI results were somewhere in the mix of these other results, it would seem a more reasonable evaluation than requiring the AI results to match any single criterion.)AI would be trained using secure items and secure data, and secure equating results to learn the features of items and response patterns that correspond to different item parameter estimates, what different covariates look like in random subgroups of examinees, and so on in this particular context. That is, the training could occur on secure items and data that would not be released. AI could then incorporate item characteristics, examinee responses, examinee demographic variables in arriving at one (or multiple) adjustments to put the scores from the Odd and Even forms on the same scale. AI would likely be able to uncover patterns researchers have not been able to because of the way machine learning works. And there could be multiple ways to divide the original assessment and original sample into subgroups. If the items and the data were able to be made publicly available, I think this exercise has the potential to move us forward in trying to address this, and other, linking issues, as one could observe what characteristics of the items, as well as the response data and demographics, turned out to be important in this particular context. Plus, it would just be really cool to see how well AI might be able to address the issues of comparable scores without common items or common examinees.","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"43 4","pages":"9-12"},"PeriodicalIF":1.9000,"publicationDate":"2024-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12655","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Educational Measurement-Issues and Practice","FirstCategoryId":"95","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/emip.12655","RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}

引用次数: 0

Abstract

Linking across test forms or pools of items is necessary to ensure scores that are reported across different administrations are comparable and lead to consistent decisions for examinees whose abilities are the same, but who were administered different items. Most of these linkages consist of equating test forms or scaling calibrated items or pools to be on the same theta scale. The typical methodology to accomplish this linking makes use of common examinees or common items, where common examinees are understood to be groups of examinees of comparable ability, whether obtained through a single group (where the same examinees are administered multiple assessments) or a random groups design, where random assignment or pseudo random assignment is done (such as spiraling the test forms, say 1, 2, 3, 4, 5, and distributing them such that every 5th examinee receives the same form). Common item methodology is usually implemented by having identical items in multiple forms and using those items to link across forms or pools. These common items may be scored or unscored in terms of whether they are treated as internal or external anchors (i.e., whether they are contributing to the examinee's score).

There are situations where it is not practical to have either common examinees nor common items. Typically, these are high-stakes settings, where the security of the assessment questions would likely be at risk if any were repeated. This would include scenarios where the entire assessment is released after administration to promote transparency. In some countries, a single form of a national test may be administered to all examinees during a single administration time. While in some cases a student who does not do as well as they had hoped may retest the following year, this may be a small sample and these students would not be considered representative of the entire body of test-takers. In addition, it is presumed they would have spent the intervening year studying for the exam, and so they could not really be considered common examinees across years and assessment forms.

Although the decisions (such as university admissions) based on the assessment scores are comparable within the year, because all examinees are administered the same set of items on the same date, it is difficult to monitor trends over time as there is no linkage between forms across years. Although the general populations may be similar (e.g., 2024 secondary school graduates versus 2023 secondary school graduates), there is no evidence that the groups are strictly equivalent across years. Similarly, comparing how examinees perform across years (e.g., highest scores, average raw score, and so on) is challenging as there is no adjustment for yearly fluctuations in form difficulty across years.

There have been variations of both common item and common examinee linking, such as using similar items, rather than identical items, including where perhaps these similar items are clones of each other, and using various types of matching techniques in an attempt to achieve common examinees by creating equivalent subgroups across forms. Cloning items or generating items from a template has had some success in terms of creating items of identical-ish difficulty. However, whether items that are clones of released items would still maintain their integrity and properties sufficiently to serve as linking items would need to be studied.

I, and many others, have been involved in several studies trying to accomplish linking for comparability where neither common items nor common examinees have been available. This short section provides a glimpse of some of that research.

Harris and Fang (2015) considered multiple options to address comparing assessment scores across years where there were no common items and no common examinees. Two of these options involved making an assumption, and the others involved making an adjustment. In the first instance, the assumption was made to treat the groups of examinees in different years as equivalent. This was solely an assumption, with no confirming evidence that the assumption was reasonable. Once the assumption was made, a random groups equating across years was conducted. The second method was to assume the test forms were built to be equivalent in difficulty, again with no evidence this was indeed the case. Because it was assumed the test forms were equivalent in difficulty, equating was not necessary (recall equating across test forms only adjusts for small differences in form difficulty; if there is no difference in difficulty, equating is unnecessary), and scores from the different forms in different years could be directly compared. The third option was to create subgroups of examinees to hopefully imitate equivalent groups of examinees being administered each of the forms, and then to conduct random groups equating. One of these subgroup options was created by using the middle 80% of the distributions of examinee scores for each year, and another used self-reported information the examinees provided, such as courses they had taken and course grades received, to create comparable subgroups across years.

Huh et al. (2016, 2017) expanded on Harris and Fang (2015), again examining alternative ways to utilize equating methodology in a context where items cannot be readministered, and the examinee groups being administered different test forms cannot be assumed to be equivalent groups. The authors referred to the methods they studied as “pseudo-equating,” as equating methodology was used, but the data assumptions associated with actual equating, such as truly randomly equivalent groups of examinees or common items, were absent. They included the two assumption-based methods Harris and Fang looked at, as well as the two ways of adjusting the examinee groups. Assuming the two forms were built to the same difficulty specifications and assuming the two samples of examinees were equivalent did not work as well as making an adjustment to try to form comparable subgroups. The two adjustments used in Harris and Fang were replicated: the middle 80% of each examinee distribution were used, the basis again being that perhaps the examinee groups would differ more in the tails than in the center, and matching group distributions based on additional information. When classical test theory equipercentile with post smoothing equating methodology was implemented, the score distribution of the subsequent year's group was adjusted based on weighting to match the initial group distribution based on variables thought to be important such as self-reported subject grades and extracurricular activities. When IRT true score equating methodology was used, comparable samples for the two groups were created for calibrations by matching the proportions within each stratum as defined by the related variables. In general, the attempts to match the groups performed better than simply making assumptions about the groups or form difficulty equivalences. Wu et al. (2017) also looked at trying to create matched sample distributions across two dispirit examinee groups, with similar results.

Kim and Walker (2021) investigated creating groups of similar ability using subgroup weighting, augmented by a small number of common items. Propensity score weighting, including variants such as coarsened exact matching, has also been used in an attempt to create equivalent groups of examinees (see, for example, Cho et al., 2024; Kapoor et al., 2024; Li et al., 2024; Woods et al., 2024 who all looked at creating equivalent samples of test takers in the context of mode studies, where one group of examinees tested on a device and the other group tested on paper).

Propensity scores are “the conditional probability of assignment to a particular treatment given a vector of observed covariates” (Rosenbaum & Rubin, 1983, p. 41). In our scenario that translates to an examinee testing on one assessment form instead of the other. One key step in implementing propensity score matching is identifying the covariates to include. In our scenario that would involve those variables which would be available for both populations and be appropriately related to the variable of interest, and to end up with equivalent samples of examinees being administered the different test forms, to allow us to appropriately conduct a random groups equating across the two forms. For example, number of math classes taken, names of the specific math classes taken, grades in the individual math classes, overall grade point average for math classes, and so on are all possible covariates that could be included. What data is collected from examinees as well as at what level of granularity are decisions that need to be made. Whether the data on variables is self-reported or provided from a trusted source (e.g., self-reported course grades versus data from a transcript) also are considerations. How many covariates to include, whether exact matching is required, how to deal with missing data, deciding what matching algorithm to use, and so on are further decisions that need to be made.

Cloning items, generating items from a template, or other ways of finding “matching items” from a subsequent form to substitute as common items with the form one wants to link to has also been studied in a variety of settings. In our scenario the issue is whether the characteristics of the items that impact the assumptions of the common item equating method being used “match” the item it is being substituted for. Item content and item position should be fairly easy to assess. However, item statistics such as IRT parameters and classical difficulty and discrimination would be computed using the responses by the test takers administered the subsequent form, and therefore not directly comparable. That is, trying to match an item with a particular p-value and point-biserial from a form administered to one group using an item administered on a different form to a different group brings us full circle. The item statistics would be comparable across the two forms if they were computed on equivalent groups of test-takers (in which case we would not need common items, we could rely on common examinees to link). If cloned items or auto-generated items were shown to have comparable statistics, they could be considered common items when placed in different forms and administered to different groups, at least in theory.

Common items and common examinees are the two vehicles used in obtaining comparable scores across different forms of an assessment. Common items can be mimicked by having items of equivalent characteristics or by having unique items that have their item parameter estimates on a common theta scale. Research on using artificial intelligence to assist in estimating item parameters already exists, and one assumes is continuing to expand (Hao, et al., 2024). Some of these initiatives have been around augmenting small sample sizes to reduce the sampling requirements for pretesting and calibrating new items as they are introduced. Examples include McCarthy et al. (2021) who used a “multi-task generalized linear model with BERT features” to provide initial estimates that are then refined as empirical data are collected; their methodology can also be used to calculate “new item difficulty estimates without piloting them first” (p. 883). Belov et al. (2024) used item pool information and trained a neural network to interpolate the relationship between item parameter estimates and response patterns. Zhang and Chen (2024) presented a “cross estimation network” consisting of a person network and an item network, finding that their approach produced accurate parameter estimates for items or persons, providing the other parameters were given (known). In a different approach, Maeda (2024) used “examinees” generated with AI to calibrate items; he determined the methodology had promise, but was not as accurate as having actual human responses to the items. AI has also been applied in the attempt to form equivalent subgroups in various settings. Monlezun et al. (2022) integrated AI and propensity scoring by using a propensity score adjustment, augmented by machine learning. Collier et al. (2022) demonstrated the application of artificial neural networks in a multilevel setting in propensity score estimation, stating “AI can be helpful in propensity score estimation because it can identify underlying patterns between treatments and confounding variables using machine learning without being explicitly programmed. Many classification algorithms in machine learning can outperform the classical methods for propensity score estimation, mainly when processing the data with many covariates…” (p. 3). The authors note that not all neural networks are created equal and that using different training methods and hyperparameters on the same data can yield different results. They also mention several computer packages and programming languages that are available to implement neural networks for propensity score estimation, including Python, R, and SAS.

What I would like to suggest is a concerted effort to use AI to assist linking test forms with no common examinees and no common items to allow comparing scores, trends, form difficulty, examinee ability, and so on over time in these situations. There are a variety of ways this could proceed, and obviously there would need to be multiple settings studied if there were to be any generalizations about what may and may not work in practice. However, I am going to focus on one scenario. I would like an assessment to be identified fulfilling these conditions: there are many items, the items have been or can be publicly released, and a large sample of examinees have been administered the assessment and their scored item responses and some relevant demographic/auxiliary information is available for the majority of examinees. These later variables might be related to previous achievement, such as earlier test scores or course grades, or demographics such as zip code and age. This assessment data would be used to assess how accurately an AI linkage might be, where we have a real, if contrived, result to serve as the “true” linkage.

The original assessment is then divided into two sub forms that should be similar in terms of content specifications and difficulty but should differ to the degree alternate forms comprised of items that cannot be pretested or tried out in advance typically are. For simplicity in this paper, I am going to assume odd numbered questions comprise the Odd form and even numbered questions make up the Even form. The examinee group is also divided such that the two subgroups show what would be considered a reasonable difference in composition in terms of ability, demographics, sample size, and so on, when two populations are tested in different years (e.g., 2024 secondary school graduates versus 2023 secondary school graduates).

The task for AI would be to adjust scores on the Odd form to be comparable to scores on the Even form. This could be done by estimating item characteristics for the Odd form items on the Even form scale and conducting an IRT equating to obtain comparable scores. Or creating equivalent samples from the subgroups administered the separate forms and running a random groups equating, or some other adjustment on items, test takers, or a combination of both. Because all examinees actually were “administered” both the Odd and Even forms, and because all items were administered together and could be calibrated together, there are multiple ways a criterion to evaluate the AI solutions could be created. (Personally, I would compare the AI results to many of them, as any of them could be considered reasonable operationally and if AI results were somewhere in the mix of these other results, it would seem a more reasonable evaluation than requiring the AI results to match any single criterion.)

AI would be trained using secure items and secure data, and secure equating results to learn the features of items and response patterns that correspond to different item parameter estimates, what different covariates look like in random subgroups of examinees, and so on in this particular context. That is, the training could occur on secure items and data that would not be released. AI could then incorporate item characteristics, examinee responses, examinee demographic variables in arriving at one (or multiple) adjustments to put the scores from the Odd and Even forms on the same scale. AI would likely be able to uncover patterns researchers have not been able to because of the way machine learning works. And there could be multiple ways to divide the original assessment and original sample into subgroups. If the items and the data were able to be made publicly available, I think this exercise has the potential to move us forward in trying to address this, and other, linking issues, as one could observe what characteristics of the items, as well as the response data and demographics, turned out to be important in this particular context. Plus, it would just be really cool to see how well AI might be able to address the issues of comparable scores without common items or common examinees.

查看原文本刊更多论文

AI：你能帮我解决这个问题吗？

为了确保不同部门报告的分数具有可比性，并为那些能力相同但参加不同项目的考生做出一致的决定，有必要在考试表格或项目池之间建立联系。大多数这些联系包括相等的测试形式或缩放校准项目或池，使其处于相同的θ刻度上。实现这种联系的典型方法是利用普通考生或普通项目，其中普通考生被理解为具有相当能力的考生群体，无论是通过单一群体（同一考生接受多次评估）还是随机群体设计获得的，其中随机分配或伪随机分配（例如螺旋形测试表格，例如1,2,3,4,5）。并将其分发给每5个考生中就有1个收到相同的表格)。公共项方法通常通过在多个表单中拥有相同的项并使用这些项跨表单或池进行链接来实现。这些常见的题目可以根据它们是内部锚还是外部锚（即它们是否对考生的分数有贡献）来评分或不评分。在某些情况下，既要有共同的考生，也要有共同的项目，这是不现实的。通常，这些都是高风险的设置，如果有任何问题被重复，评估问题的安全性可能会受到威胁。这将包括在行政管理后公布整个评估以提高透明度的情况。在一些国家，单一形式的国家考试可以在一次管理时间内对所有考生进行管理。虽然在某些情况下，表现不如预期的学生可能会在第二年重新参加考试，但这可能是一个小样本，这些学生不会被认为是整个考生群体的代表。此外，据推测，他们将花一年的时间为考试学习，因此他们不能真正被视为跨年和评估形式的普通考生。尽管基于评估分数的决定（如大学录取）在一年内具有可比性，但由于所有考生在同一天接受相同的考试，因此很难监测一段时间内的趋势，因为不同年份的表格之间没有联系。虽然一般人群可能相似（例如，2024年中学毕业生与2023年中学毕业生），但没有证据表明这些群体在各年之间是严格相等的。同样，比较不同年份考生的表现（例如，最高分数、平均原始分数等）也很有挑战性，因为不同年份的考试难度每年的波动都没有调整。常见题型和常见考生题型的链接都有变化，比如使用相似题型，而不是相同题型，包括这些相似题型可能是彼此的克隆，以及使用各种类型的匹配技术，试图通过在表格中创建相同的子组来实现常见考生。克隆项目或从模板生成项目在创建相同难度的项目方面取得了一些成功。但是，作为已发布项目的克隆的项目是否仍能保持其完整性和属性，足以作为链接项目，则需要进行研究。我和其他许多人都参与了几项研究，试图在既没有共同项目也没有共同考生的情况下实现可比性的连接。这个简短的部分提供了一些研究的一瞥。Harris和Fang（2015）考虑了多种选择，以解决在没有共同项目和没有共同考生的情况下，跨年比较评估分数的问题。其中两个选项涉及假设，其他选项涉及调整。在第一个例子中，假设不同年份的考生群体是相等的。这仅仅是一种假设，没有确凿的证据证明这种假设是合理的。一旦有了这样的假设，研究人员就随机选取了一组年龄相等的人群。第二种方法是假设测试表格在难度上是相同的，同样没有证据表明情况确实如此。因为假设测试形式在难度上是相等的，所以相等是不必要的(跨测试形式的回忆相等只调整形式难度的微小差异；如果难度没有差别，就不需要等号)，不同年份不同形式的分数可以直接比较。第三种选择是将考生分组，希望能模拟出每一种形式的同等分组，然后进行随机分组。其中一个子组选项是通过使用每年考生分数分布的中间80%来创建的，另一个是使用考生提供的自我报告信息，比如他们修过的课程和获得的课程成绩，来创建跨年的可比子组。Huh等人（2016年，2017年）在Harris和Fang（2015年）的基础上进行了扩展，再次研究了在项目不能重新管理的情况下使用等效方法的替代方法，并且不能假设接受不同考试形式的考生群体是等效群体。作者将他们研究的方法称为“伪相等”，因为使用了相等方法，但与实际相等相关的数据假设，例如真正随机相等的考生组或共同项目，都没有。其中包括哈里斯和方研究的两种基于假设的方法，以及调整考生群体的两种方法。假设这两种表格是按照相同的难度规格构建的，并且假设两个考生的样本是相同的，那么做出调整以试图形成可比较的子组是行不通的。Harris和Fang使用的两种调整被复制了：每个考生分布的中间80%被使用，同样的基础是，可能考生组在尾部的差异比在中心的差异更大，并且根据额外的信息匹配组分布。当采用经典测试理论等百分位后平滑等式方法时，根据自述科目成绩和课外活动等重要变量，对次年组的分数分布进行加权调整，以匹配初始组的分数分布。当使用IRT真实分数相等方法时，通过匹配相关变量定义的每个层内的比例，为两组创建可比较的样本进行校准。一般来说，尝试匹配群组比简单地假设群组或形成难度相等效果更好。Wu等人（2017）也试图在两个沮丧的考生群体中创建匹配的样本分布，结果相似。Kim和Walker（2021）研究了使用子组权重来创建具有相似能力的组，并增加了少量常见项目。倾向得分加权，包括诸如粗化精确匹配的变体，也被用于创建等效的考生组(例如，参见Cho等人，2024；Kapoor等人，2024；Li et al., 2024；Woods等人，2024年，他们都着眼于在模式研究的背景下创建等效的考生样本，其中一组考生在设备上测试，另一组在纸上测试)。倾向得分是“给定观察到的协变量向量，分配给特定治疗的条件概率”(罗森鲍姆&amp；Rubin, 1983，第41页)。在我们的场景中，这意味着考生在一种评估表上测试，而不是在另一种评估表上测试。实现倾向得分匹配的一个关键步骤是确定要包括的协变量。在我们的场景中，这些变量将适用于两种人群，并与感兴趣的变量适当相关，最终得到相同的考生样本，接受不同的测试形式，从而允许我们适当地在两种形式之间进行随机分组。例如，所选数学课的数量、所选特定数学课的名称、单个数学课的成绩、数学课的总体平均成绩等等都是可能包含的协变量。从考生那里收集什么样的数据，以及在什么样的粒度级别上需要做出决定。关于变量的数据是自我报告还是来自可信来源（例如，自我报告的课程成绩与成绩单数据）也需要考虑。需要包括多少协变量、是否需要精确匹配、如何处理缺失数据、决定使用哪种匹配算法等等，这些都是需要做出的进一步决策。克隆项目，从模板中生成项目，或从后续表单中找到“匹配项目”的其他方法，以替换为普通项目与想要链接到的表单，也已经在各种设置中进行了研究。在我们的场景中，问题是影响正在使用的公共项目等价方法的假设的项目的特征是否与它所替代的项目“匹配”。项目内容和项目位置应该相当容易评估。然而，诸如IRT参数和经典难度和辨析等项目统计数据将使用接受后续表格的考生的回答来计算，因此不能直接比较。考生群体也被划分为两个子群体，当两个群体在不同的年份进行测试时（例如，2024年中学毕业生与2023年中学毕业生），两个子群体在能力、人口统计学、样本量等方面的构成上显示出合理的差异。AI的任务将是调整奇数形态的分数，使其与偶数形态的分数相当。这可以通过估计偶数形式量表上奇数形式项目的项目特征并进行IRT等同于获得可比分数来完成。或者从管理单独表格的子组中创建等效样本，并运行一个随机组，或对项目，考生或两者的组合进行一些其他调整。因为所有的考生实际上都被“管理”了单双题，因为所有的项目都是一起管理的，可以一起校准，所以有多种方法可以创建评估人工智能解决方案的标准。（就我个人而言，我会将AI结果与其中的许多结果进行比较，因为它们中的任何一个都可以被认为是合理的操作，如果AI结果是这些其他结果的混合，那么这似乎是比要求AI结果匹配任何单一标准更合理的评估。）人工智能将使用安全的项目和安全的数据进行训练，并使用安全的等同结果来学习项目的特征和对应于不同项目参数估计的反应模式，在随机的考生子组中不同的协变量是什么样子的，等等。也就是说，培训可以发生在不会被释放的安全项目和数据上。然后，人工智能可以结合项目特征、考生反应、考生人口统计变量，以达到一个（或多个）调整，将单双题的分数放在同一个尺度上。由于机器学习的工作方式，人工智能很可能能够发现研究人员无法发现的模式。可以有多种方法将原始评估和原始样本划分为子组。如果这些项目和数据能够公开，我认为这项工作有可能推动我们努力解决这个问题，以及其他相关问题，因为人们可以观察到项目的哪些特征，以及回应数据和人口统计数据，在这个特定的背景下是重要的。此外，看到人工智能在没有共同项目或共同考生的情况下能够很好地解决可比分数问题，这将是一件很酷的事情。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Educational Measurement-Issues and Practice Multiple-

CiteScore

3.90

自引率

15.00%

发文量