Intra- and inter-rater reliability of a manual codification system for footwear impressions: first lessons learned from the development of a footwear database for forensic intelligence purposes
Vincent Mousseau, Maralee Tapps, Romain Volery, Jean Brazeau
{"title":"Intra- and inter-rater reliability of a manual codification system for footwear impressions: first lessons learned from the development of a footwear database for forensic intelligence purposes","authors":"Vincent Mousseau, Maralee Tapps, Romain Volery, Jean Brazeau","doi":"10.1080/00085030.2023.2278911","DOIUrl":null,"url":null,"abstract":"AbstractTo generate forensic intelligence from footwear impressions and link crime scenes, most law enforcement agencies and forensic laboratories rely on a manual codification system based on pattern recognition and classification by human analysts. However, although they are commonly used in practice, to date we still know little about the reliability of such systems. Taking advantage of the development of a footwear database for forensic intelligence purposes at the Laboratoire de sciences judiciaires et de médecine légale in Quebec (Canada), this study aims to make a preliminary assessment of the intra- and inter-rater reliability (i.e., the level of repeatability over time and the level of consensus between analysts) of the proposed codification system. To do so, three forensic intelligence analysts classified a set of 27 crime scene impressions and test impressions at two different times (two weeks apart). Percent agreement, Cohen’s Kappa, and Light’s Kappa were then calculated. Results show that two out of three analysts have reached an almost perfect level of intra-rater agreement, while the other have achieved a substantial level of intra-agreement, and that all analysts have reached a substantial level of inter-rater agreement. Findings suggest that, although a few patterns may have lower levels of agreement, overall, the developed codification system presents a satisfactory level of reliability. This preliminary study thus suggests that contrary to what advocates of fully automated systems may sometimes imply, manual codification of footwear impressions may be fairly appropriate for intelligence purposes. It calls for further evaluative research in the field.RÉSUMÉPour générer du renseignement forensique à partir des traces de chaussure et ainsi insérer la scène de crime unique dans une série criminelle, la plupart des corps policiers ont recours à un système de codification manuelle basé sur la reconnaissance et la classification de certaines formes ou motifs par des analystes formés en la matière. Bien que ces systèmes soient couramment utilisés dans la pratique quotidienne, peu d’études ont jusqu’ici tenté de cerner leur fiabilité. Profitant du développement du service de profilage de traces et d’empreintes de chaussure à des fins de renseignement criminalistique au Laboratoire de sciences judiciaires et de médecine légale au Québec (Canada), la présente étude cherche à réaliser une évaluation préliminaire de la fiabilité intra- et inter-juges (c.-à-d. le niveau de répétabilité dans le temps et le niveau de consensus entre les analystes) du système de codification manuelle élaborée. Pour ce faire, trois membres du service de renseignement criminalistique du Laboratoire ont codifié à deux reprises, à deux semaines d’intervalle, le même ensemble de 27 traces et empreintes de chaussure. Le pourcentage d’accord, le Kappa de Cohen et le Kappa de Light ont ensuite été calculés à partir des données recueillies. Les résultats révèlent que deux analystes sur trois ont atteint un niveau d’accord intra-juge (répétabilité) considéré comme presque parfait, et que tous ont atteint un niveau d’accord inter-juges (consensus) considéré comme substantiel. Les résultats suggèrent également que, bien que quelques motifs et formes présentent des niveaux d’accord plus faibles, dans l’ensemble, le système de codification développé présente un niveau de fiabilité satisfaisant. Cette étude préliminaire suggère donc que, malgré la montée de l’attention dédiée aux systèmes de codification automatisée, la codification manuelle des traces et empreintes de chaussures demeure une méthode pouvant être appropriée pour générer du renseignement forensique et appelle du même coup à des recherches évaluatives supplémentaires dans ce domaine.Keywords: Forensic intelligencereliabilityrepeatabilityreproducibilitydata acquisitionfootwear impressionMots-clés: Renseignement forensiqueFiabilitéRépétabilitéReproductibilitéTraces de chaussureBase de données AcknowledgementsThe authors would like to thank Caroline Mireault from the Chemistry Department of the Laboratoire de sciences judiciaires et de médecine légale for her considerable support in initiating and continuing the implementation of this project in its early years.Disclosure statementNo potential conflict of interest was reported by the authors.Notes1 Our Traduction.2 Austin Hicklin et al. [1, p. 15] wrote: “even if two examiners observe the same features in correspondence/non-correspondence, they may assign different strengths to these observations based upon factors such as their training and experience.”, highlighting the discrepancies that may exist between codifications and comparison conclusions, and consequently, the need to study both independently.3 The authors acknowledge that a sample of 3 participants is quite a small sample, but the three analysts were the only practitioners in the Forensic Intelligence Service at the time of the study who were responsible of footwear impressions analysis, as the other members of the Service were specialized in toxicology and drug intelligence.4 Although they do not perform traditional forensic footwear examination and comparison, they all have the minimal education proposed by the Scientific Working Group for Shoeprint and Tire Tread Evidence (SWGTREAD) to do so [Citation40].5 Cohen’s Kappa and Light’s Kappa cannot be calculated with constant patterns (i.e. “0” for all codified entries). There must be minimal variation to observe when raters do recognize a pattern and when they don’t, and to evaluate intra-rater reliability for a maximum of patterns of the codification system. This explains why there are more metrics computed for the inter-rater reliability test (with 27 traces and prints) than for the intra-rater reliability test (with 20 traces and prints).6 Those seven photographs were selected according to their initial classification in routine work.7 The alphanumeric code for each pattern (see Results and Appendices) therefore reads as follows: letters and numbers before the parenthesis correspond to the pattern, while the letter in parentheses (P, A or T) corresponds to the section of the outsole where the pattern is observed.Additional informationFundingThis research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.","PeriodicalId":44383,"journal":{"name":"Canadian Society of Forensic Science Journal","volume":null,"pages":null},"PeriodicalIF":0.2000,"publicationDate":"2023-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Canadian Society of Forensic Science Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/00085030.2023.2278911","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"MEDICINE, LEGAL","Score":null,"Total":0}
引用次数: 0
Abstract
AbstractTo generate forensic intelligence from footwear impressions and link crime scenes, most law enforcement agencies and forensic laboratories rely on a manual codification system based on pattern recognition and classification by human analysts. However, although they are commonly used in practice, to date we still know little about the reliability of such systems. Taking advantage of the development of a footwear database for forensic intelligence purposes at the Laboratoire de sciences judiciaires et de médecine légale in Quebec (Canada), this study aims to make a preliminary assessment of the intra- and inter-rater reliability (i.e., the level of repeatability over time and the level of consensus between analysts) of the proposed codification system. To do so, three forensic intelligence analysts classified a set of 27 crime scene impressions and test impressions at two different times (two weeks apart). Percent agreement, Cohen’s Kappa, and Light’s Kappa were then calculated. Results show that two out of three analysts have reached an almost perfect level of intra-rater agreement, while the other have achieved a substantial level of intra-agreement, and that all analysts have reached a substantial level of inter-rater agreement. Findings suggest that, although a few patterns may have lower levels of agreement, overall, the developed codification system presents a satisfactory level of reliability. This preliminary study thus suggests that contrary to what advocates of fully automated systems may sometimes imply, manual codification of footwear impressions may be fairly appropriate for intelligence purposes. It calls for further evaluative research in the field.RÉSUMÉPour générer du renseignement forensique à partir des traces de chaussure et ainsi insérer la scène de crime unique dans une série criminelle, la plupart des corps policiers ont recours à un système de codification manuelle basé sur la reconnaissance et la classification de certaines formes ou motifs par des analystes formés en la matière. Bien que ces systèmes soient couramment utilisés dans la pratique quotidienne, peu d’études ont jusqu’ici tenté de cerner leur fiabilité. Profitant du développement du service de profilage de traces et d’empreintes de chaussure à des fins de renseignement criminalistique au Laboratoire de sciences judiciaires et de médecine légale au Québec (Canada), la présente étude cherche à réaliser une évaluation préliminaire de la fiabilité intra- et inter-juges (c.-à-d. le niveau de répétabilité dans le temps et le niveau de consensus entre les analystes) du système de codification manuelle élaborée. Pour ce faire, trois membres du service de renseignement criminalistique du Laboratoire ont codifié à deux reprises, à deux semaines d’intervalle, le même ensemble de 27 traces et empreintes de chaussure. Le pourcentage d’accord, le Kappa de Cohen et le Kappa de Light ont ensuite été calculés à partir des données recueillies. Les résultats révèlent que deux analystes sur trois ont atteint un niveau d’accord intra-juge (répétabilité) considéré comme presque parfait, et que tous ont atteint un niveau d’accord inter-juges (consensus) considéré comme substantiel. Les résultats suggèrent également que, bien que quelques motifs et formes présentent des niveaux d’accord plus faibles, dans l’ensemble, le système de codification développé présente un niveau de fiabilité satisfaisant. Cette étude préliminaire suggère donc que, malgré la montée de l’attention dédiée aux systèmes de codification automatisée, la codification manuelle des traces et empreintes de chaussures demeure une méthode pouvant être appropriée pour générer du renseignement forensique et appelle du même coup à des recherches évaluatives supplémentaires dans ce domaine.Keywords: Forensic intelligencereliabilityrepeatabilityreproducibilitydata acquisitionfootwear impressionMots-clés: Renseignement forensiqueFiabilitéRépétabilitéReproductibilitéTraces de chaussureBase de données AcknowledgementsThe authors would like to thank Caroline Mireault from the Chemistry Department of the Laboratoire de sciences judiciaires et de médecine légale for her considerable support in initiating and continuing the implementation of this project in its early years.Disclosure statementNo potential conflict of interest was reported by the authors.Notes1 Our Traduction.2 Austin Hicklin et al. [1, p. 15] wrote: “even if two examiners observe the same features in correspondence/non-correspondence, they may assign different strengths to these observations based upon factors such as their training and experience.”, highlighting the discrepancies that may exist between codifications and comparison conclusions, and consequently, the need to study both independently.3 The authors acknowledge that a sample of 3 participants is quite a small sample, but the three analysts were the only practitioners in the Forensic Intelligence Service at the time of the study who were responsible of footwear impressions analysis, as the other members of the Service were specialized in toxicology and drug intelligence.4 Although they do not perform traditional forensic footwear examination and comparison, they all have the minimal education proposed by the Scientific Working Group for Shoeprint and Tire Tread Evidence (SWGTREAD) to do so [Citation40].5 Cohen’s Kappa and Light’s Kappa cannot be calculated with constant patterns (i.e. “0” for all codified entries). There must be minimal variation to observe when raters do recognize a pattern and when they don’t, and to evaluate intra-rater reliability for a maximum of patterns of the codification system. This explains why there are more metrics computed for the inter-rater reliability test (with 27 traces and prints) than for the intra-rater reliability test (with 20 traces and prints).6 Those seven photographs were selected according to their initial classification in routine work.7 The alphanumeric code for each pattern (see Results and Appendices) therefore reads as follows: letters and numbers before the parenthesis correspond to the pattern, while the letter in parentheses (P, A or T) corresponds to the section of the outsole where the pattern is observed.Additional informationFundingThis research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.