Methods

The huge number of listes (about 15 million images from 1836 to 1936, corresponding to 700 million individual records) and their spatial dispersion (they are kept in nearly one hundred archives deposits) have limited their use until now. The Socface project intends to overcome this limitation by using the most recent advances in machine learning technologies.

Taking advantage of the regularity of the source over time, we will develop automatic models that will process all the images to: detect rows and columns, perform text recognition, and identify entities within the text (name, age, hamlet, etc.). Over the course of this processing chain, we will carry various tests to evaluate the consistency and quality of the results obtained. To do so, we will take advantage of the knowledge of the listes by the archivists, historians and demographers involved in the project. Symmetrically, computer scientists are not mere data purveyors for the social scientists; they will explain them how text recognition and document processing work, so that social scientists can code the information extracted from the documents and use them for doing research in perfect knowledge of its characteristics and limitations.

Hence, Socface is truly a coproduction of a database on a unique scale, helping to produce frontier research in both computer sciences and social sciences.