Bryan Seegmiller , Dimitris Papanikolaou , Lawrence D.W. Schmidt
{"title":"Measuring document similarity with weighted averages of word embeddings","authors":"Bryan Seegmiller , Dimitris Papanikolaou , Lawrence D.W. Schmidt","doi":"10.1016/j.eeh.2022.101494","DOIUrl":"https://doi.org/10.1016/j.eeh.2022.101494","url":null,"abstract":"<div><p>We detail a methodology for estimating the textual similarity between two documents while accounting for the possibility that two different words can have a similar meaning. We illustrate the method’s usefulness in facilitating comparisons between documents with very different formats and vocabularies by textually linking occupation task and industry output descriptions with related technologies as described in patent texts; we also examine economic applications of the resultant document similarity measures. In a final application we demonstrate that the method also works well relative to alternatives for comparing documents within the same domain by showing that pairwise textual similarity between occupations’ task descriptions strongly predicts the probability that a given worker will transition from one occupation to another. Finally, we offer some suggestions on other potential uses and guidance in implementing the method.</p></div>","PeriodicalId":47413,"journal":{"name":"Explorations in Economic History","volume":"87 ","pages":"Article 101494"},"PeriodicalIF":2.3,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49857314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"历史学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Perks and pitfalls of city directories as a micro-geographic data source","authors":"Thilo N.H. Albers, Kalle Kappner","doi":"10.1016/j.eeh.2022.101476","DOIUrl":"https://doi.org/10.1016/j.eeh.2022.101476","url":null,"abstract":"<div><p>Historical city directories are rich sources of micro-geographic data. They provide information on the location of households and firms and their occupations and industries<span>, respectively. We develop a generic algorithmic work flow that converts scans of them into geo- and status-referenced household-level data sets. Applying the work flow to our case study, the Berlin 1880 directory, adds idiosyncratic challenges that should make automation less attractive. Yet, employing an administrative benchmark data set on household counts, incomes, and income distributions across more than 200 census tracts, we show that semi-automatic referencing yields results very similar to those from labour-intensive manual referencing. Finally, we discuss how to scale the work flow to other years and cities as well as potential applications in economic history and beyond.</span></p></div>","PeriodicalId":47413,"journal":{"name":"Explorations in Economic History","volume":"87 ","pages":"Article 101476"},"PeriodicalIF":2.3,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49857303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"历史学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The census place project: A method for geolocating unstructured place names","authors":"Enrico Berkes , Ezra Karger , Peter Nencka","doi":"10.1016/j.eeh.2022.101477","DOIUrl":"https://doi.org/10.1016/j.eeh.2022.101477","url":null,"abstract":"<div><p>Researchers use microdata to study the economic development of the United States and the causal effects of historical policies. Much of this research focuses on county- and state-level patterns and policies because comprehensive sub-county data is not consistently available. We describe a new method that geocodes and standardizes the towns and cities of residence for individuals and households in decennial census microdata from 1790–1940. We release public crosswalks linking individuals and households to consistently-defined place names, longitude-latitude pairs, counties, and states. Our method dramatically increases the number of individuals and households assigned to a sub-county location relative to standard publicly available data: we geocode an average of 83% of the individuals and households in 1790–1940 census microdata, compared to 23% in widely-used crosswalks. In years with individual-level microdata (1850–1940), our average match rate is 94% relative to 33% in widely-used crosswalks. To illustrate the value of our crosswalks, we measure place-level population growth across the United States between 1870 and 1940 at a sub-county level, confirming predictions of Zipf’s Law and Gibrat’s Law for large cities but rejecting similar predictions for small towns. We describe how our approach can be used to accurately geocode other historical datasets.</p></div>","PeriodicalId":47413,"journal":{"name":"Explorations in Economic History","volume":"87 ","pages":"Article 101477"},"PeriodicalIF":2.3,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49899235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"历史学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christian M. Dahl , Torben S.D. Johansen , Emil N. Sørensen , Simon Wittrock
{"title":"HANA: A handwritten name database for offline handwritten text recognition","authors":"Christian M. Dahl , Torben S.D. Johansen , Emil N. Sørensen , Simon Wittrock","doi":"10.1016/j.eeh.2022.101473","DOIUrl":"https://doi.org/10.1016/j.eeh.2022.101473","url":null,"abstract":"<div><p>Methods for linking individuals across historical data sets, typically in combination with AI based transcription models, are developing rapidly. Perhaps the single most important identifier for linking is personal names. However, personal names are prone to enumeration and transcription errors and although modern linking methods are designed to handle such challenges, these sources of errors are critical and should be minimized. For this purpose, improved transcription methods and large-scale databases are crucial components. This paper describes and provides documentation for HANA, a newly constructed large-scale database which consists of more than 3.3 million names. The database contains more than 105 thousand unique names with a total of more than 1.1 million images of personal names, which proves useful for transfer learning to other settings. We provide three examples hereof, obtaining significantly improved transcription accuracy on both Danish and US census data. In addition, we present benchmark results for deep learning models automatically transcribing the personal names from the scanned documents. Through making more challenging large-scale databases publicly available we hope to foster more sophisticated, accurate, and robust models for handwritten text recognition.</p></div>","PeriodicalId":47413,"journal":{"name":"Explorations in Economic History","volume":"87 ","pages":"Article 101473"},"PeriodicalIF":2.3,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49857304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"历史学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christopher Blomqvist , Kerstin Enflo , Andreas Jakobsson , Kalle Åström
{"title":"Reading the ransom: Methodological advancements in extracting the Swedish Wealth Tax of 1571","authors":"Christopher Blomqvist , Kerstin Enflo , Andreas Jakobsson , Kalle Åström","doi":"10.1016/j.eeh.2022.101470","DOIUrl":"https://doi.org/10.1016/j.eeh.2022.101470","url":null,"abstract":"<div><p>We describe a deep learning method to read hand-written records from the 16th century. The method consists of a combination of a segmentation module and a Handwritten Text Recognition (HTR) module. The transformer-based HTR module exploits both language and image features in reading, classifying and extracting the position of each word on the page. The method is demonstrated on a unique historical document: The Swedish Wealth Tax of 1571. Results suggest that the segmentation module performs significantly better than the lay-out analysis implemented in state-of-the art programs, enabling us to trace many more text blocks correctly on each page. The HTR module has a low character error rate (CER), in addition to being able to classify words and help organize them into tabular formats. By demonstrating an automated process to transform loosely structured handwritten information from the 16th century into organized tables, our method should interest economic historians seeking to digitize and organize quantitative material from pre-industrial periods.</p></div>","PeriodicalId":47413,"journal":{"name":"Explorations in Economic History","volume":"87 ","pages":"Article 101470"},"PeriodicalIF":2.3,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49857306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"历史学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Record linkage for character-based surnames: Evidence from chinese exclusion","authors":"Hannah M. Postel","doi":"10.1016/j.eeh.2022.101493","DOIUrl":"10.1016/j.eeh.2022.101493","url":null,"abstract":"<div><p>This paper proposes a novel pre-processing technique to improve record linkage for historical Chinese populations. Current matching approaches are relatively ineffective due to Chinese-specific naming conventions and enumeration errors. This paper develops a three-step process that both triples the match rate over baseline and improves match accuracy. The procedures developed in this paper can be applied in part or in full to other sources of historical data, and/or modified for use with other character-based languages such as Japanese. More broadly, this approach suggests the promise of language-specific linkage procedures to boost match rates for ethnic minority groups.</p></div>","PeriodicalId":47413,"journal":{"name":"Explorations in Economic History","volume":"87 ","pages":"Article 101493"},"PeriodicalIF":2.3,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9854273/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10604380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"历史学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Martha J. Bailey , Susan H. Leonard , Joseph Price , Evan Roberts , Logan Spector , Mengying Zhang
{"title":"Breathing new life into death certificates: Extracting handwritten cause of death in the LIFE-M project","authors":"Martha J. Bailey , Susan H. Leonard , Joseph Price , Evan Roberts , Logan Spector , Mengying Zhang","doi":"10.1016/j.eeh.2022.101474","DOIUrl":"10.1016/j.eeh.2022.101474","url":null,"abstract":"<div><p>The demographic and epidemiological transitions of the past 200 years are well documented at an aggregate level. Understanding differences in individual and group risks for mortality during these transitions requires linkage between demographic data and detailed individual cause of death information. This paper describes the digitization of almost 185,000 causes of death for Ohio to supplement demographic information in the Longitudinal, Intergenerational Family Electronic Micro-database (LIFE-M). To extract causes of death, our methodology combines handwriting recognition, extensive data cleaning algorithms, and the semi-automated classification of causes of death into International Classification of Diseases (ICD) codes. Our procedures are adaptable to other collections of handwritten data, which require both handwriting recognition and semi-automated coding of the information extracted.</p></div>","PeriodicalId":47413,"journal":{"name":"Explorations in Economic History","volume":"87 ","pages":"Article 101474"},"PeriodicalIF":2.3,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9912950/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10826426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"历史学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Digitizing historical balance sheet data: A practitioner’s guide","authors":"Sergio Correia , Stephan Luck","doi":"10.1016/j.eeh.2022.101475","DOIUrl":"https://doi.org/10.1016/j.eeh.2022.101475","url":null,"abstract":"<div><p>This paper discusses how to successfully digitize large-scale historical micro-data by augmenting optical character recognition (OCR) engines with pre- and post-processing methods. Although OCR software has improved dramatically in recent years due to improvements in machine learning, off-the-shelf OCR applications still present high error rates which limit their applications for accurate extraction of structured information. Complementing OCR with additional methods can however dramatically increase its success rate, making it a powerful and cost-efficient tool for economic historians. This paper showcases these methods and explains why they are useful. We apply them against two large balance sheet datasets and introduce quipucamayoc, a Python package containing these methods in a unified framework.</p></div>","PeriodicalId":47413,"journal":{"name":"Explorations in Economic History","volume":"87 ","pages":"Article 101475"},"PeriodicalIF":2.3,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49857307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"历史学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Someswar Amujala , Angela Vossmeyer , Sanjiv R. Das
{"title":"Digitization and data frames for card index records","authors":"Someswar Amujala , Angela Vossmeyer , Sanjiv R. Das","doi":"10.1016/j.eeh.2022.101469","DOIUrl":"https://doi.org/10.1016/j.eeh.2022.101469","url":null,"abstract":"<div><p>We develop a methodology for converting card index archival records into usable data frames for statistical and textual analyses. Leveraging machine learning and natural-language processing tools from Amazon Web Services (AWS), we overcome hurdles associated with character recognition, inconsistent data reporting, column misalignment, and irregular naming. In this article, we detail the step-by-step conversion process and discuss remedies for common problems and edge cases, using historical records from the Reconstruction Finance Corporation.</p></div>","PeriodicalId":47413,"journal":{"name":"Explorations in Economic History","volume":"87 ","pages":"Article 101469"},"PeriodicalIF":2.3,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49857305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"历史学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The mortality risk of being overweight in the twentieth century: Evidence from two cohorts of New Zealand men","authors":"Kris Inwood , Les Oxley , Evan Roberts","doi":"10.1016/j.eeh.2022.101472","DOIUrl":"10.1016/j.eeh.2022.101472","url":null,"abstract":"<div><p>How have health and social mortality risks changed over time? Evidence from pre-1945 cohorts is sparse, mostly from the United States, and evidence is mixed on long-term changes in the risk of being overweight. We develop a dataset of men entering the NZ army in the two world wars, with objectively measured height and weight, and socioeconomic status in early adulthood. Our sample includes significant numbers of indigenous Māori, providing estimates of weight and mortality risk in an indigenous population. We follow men from war's end until death, with data on more than 12,000 men from each war. Overweight and obesity were important risk factors for mortality, and associated with shorter life expectancy. However, the reduction in life expectancy associated with being overweight declined from 5 to 3 years between the two cohorts, consistent with the hypothesis that being overweight became less risky during the twentieth century</p></div>","PeriodicalId":47413,"journal":{"name":"Explorations in Economic History","volume":"86 ","pages":"Article 101472"},"PeriodicalIF":2.3,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10164007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"历史学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}