Handwritten Text Recognition and the Notebooks of Jean-Henri Polier de Vernand (1715-1791)
Master Thesis Project of Digital Transcription of 26,300 Pages Through HTR Technology (2023)
Les Cahiers Polier
Master's Research in Digital Humanities & History, University of Lausanne (2023)
Requires manual correction for scholarly precision; serves as foundation for computational analysis
This repository presents the comprehensive digital transcription of Jean-Henri Polier de Vernand's personal notebooks, accomplished through state-of-the-art Handwritten Text Recognition (HTR) technologies. As lieutenant baillival of Lausanne from 1754 to 1791, Polier systematically documented daily life across 26,300 manuscript pages, creating one of the most significant historical records of 18th-century Lausanne society.
The digital transcription process employed a multi-stage approach combining manual ground truth generation with automated recognition systems:
- Ground Truth Generation: Strategic selection and manual transcription of representative pages using Transkribus platform
- Layout Analysis: Automated text line detection and baseline correction through computer vision algorithms
- Model Training: HTR-Flor++ implementation with TensorFlow, enhanced by Bentham dataset pre-training
- Mass Inference: Application of trained model to entire corpus with GPU acceleration via Google Colab
- Post-Processing: Conversion of predictions to structured JSON format maintaining page-level organization
- Quality Validation: Statistical analysis and sample verification against ground truth standards
The training dataset was strategically constructed to represent the diversity of Polier's writing across different periods, contexts, and content types:
Notebook Range | Pages Selected | Selection Criteria | Content Characteristics |
---|---|---|---|
001-020 | 15 pages | Early period documentation | Initial writing patterns, varied layouts |
040-080 | 12 pages | Regular interval sampling | Administrative content, numerical data |
100-160 | 10 pages | Middle period diversity | Mixed content types, layout variations |
185 | 3 pages | Late period examples | Mature writing style, complex layouts |
Research Foundation
This digital corpus provides a searchable foundation for computational analysis, with manual correction workflows established for critical passages requiring scholarly precision.
- • Python environment with OpenCV, TensorFlow, and NumPy dependencies
- • Access to Google Colab or equivalent GPU computing environment
- • Transkribus account for layout analysis and ground truth generation
- • Execute
From_Transkribus_to_HTR_Flor.ipynb
for training data preparation - • Train model using HTR-Flor++ framework on prepared dataset
- • Apply trained model to complete manuscript collection
- • Process predictions using
From_HTR_Flor_to_JSON.ipynb
Ongoing research includes enhanced error correction algorithms, multilingual recognition capabilities for Polier's occasional English passages, and development of specialized annotation tools for historical document analysis. The established methodology provides a replicable framework for similar digital humanities projects involving historical manuscript collections.