Handwritten Text Recognition and the Notebooks of Jean-Henri Polier de Vernand (1715-1791)

Master Thesis Project of Digital Transcription of 26,300 Pages Through HTR Technology (2023)

HTR Polier de Vernand - Digital Transcription Repository

Les Cahiers Polier

Archives cantonales vaudoises (ACV) × EPFL Collège des Humanités Digitales (CDH)
Master's Research in Digital Humanities & History, University of Lausanne (2023)
8.78%
Character Error Rate (2023)
Acceptable accuracy for initial digitization of historical documents
Requires manual correction for scholarly precision; serves as foundation for computational analysis
Project Overview

This repository presents the comprehensive digital transcription of Jean-Henri Polier de Vernand's personal notebooks, accomplished through state-of-the-art Handwritten Text Recognition (HTR) technologies. As lieutenant baillival of Lausanne from 1754 to 1791, Polier systematically documented daily life across 26,300 manuscript pages, creating one of the most significant historical records of 18th-century Lausanne society.

Historical Significance: Jean-Henri Polier de Vernand served as one of the most important figures in Lausanne society of his time, holding positions in multiple councils and courts. His meticulous documentation provides unprecedented insight into the social, economic, and political fabric of 18th-century Swiss urban life.
Quantitative Analysis
26,300
Manuscript Pages
Complete digitization of Polier's personal notebooks spanning his entire career as lieutenant baillival
37 years
Temporal Coverage
Continuous documentation from 1754 until Polier's death in 1791
40
Training Pages
Manually transcribed pages representing diverse layouts, vocabulary, and writing styles
JSON
Structured Output
Machine-readable format enabling computational analysis and digital humanities research
Methodological Framework
Technical Pipeline

The digital transcription process employed a multi-stage approach combining manual ground truth generation with automated recognition systems:

  1. Ground Truth Generation: Strategic selection and manual transcription of representative pages using Transkribus platform
  2. Layout Analysis: Automated text line detection and baseline correction through computer vision algorithms
  3. Model Training: HTR-Flor++ implementation with TensorFlow, enhanced by Bentham dataset pre-training
  4. Mass Inference: Application of trained model to entire corpus with GPU acceleration via Google Colab
  5. Post-Processing: Conversion of predictions to structured JSON format maintaining page-level organization
  6. Quality Validation: Statistical analysis and sample verification against ground truth standards
Training Dataset Composition

The training dataset was strategically constructed to represent the diversity of Polier's writing across different periods, contexts, and content types:

Notebook Range Pages Selected Selection Criteria Content Characteristics
001-020 15 pages Early period documentation Initial writing patterns, varied layouts
040-080 12 pages Regular interval sampling Administrative content, numerical data
100-160 10 pages Middle period diversity Mixed content types, layout variations
185 3 pages Late period examples Mature writing style, complex layouts
Technical Implementation
Core Technologies
HTR Framework:
HTR-Flor++ with TensorFlow backend
Pre-training Dataset:
tranScriptorium Bentham collection
Layout Analysis:
Transkribus XML export with baseline detection
Computing Environment:
Google Colab with GPU acceleration
Output Format:
Structured JSON with page-level organization
Character Encoding:
Unicode UTF-8 with accent normalization

Research Foundation

This digital corpus provides a searchable foundation for computational analysis, with manual correction workflows established for critical passages requiring scholarly precision.

Repository Access
Implementation Considerations
Prerequisites
  • • Python environment with OpenCV, TensorFlow, and NumPy dependencies
  • • Access to Google Colab or equivalent GPU computing environment
  • • Transkribus account for layout analysis and ground truth generation
Replication Workflow
  • • Execute From_Transkribus_to_HTR_Flor.ipynb for training data preparation
  • • Train model using HTR-Flor++ framework on prepared dataset
  • • Apply trained model to complete manuscript collection
  • • Process predictions using From_HTR_Flor_to_JSON.ipynb
Future Developments

Ongoing research includes enhanced error correction algorithms, multilingual recognition capabilities for Polier's occasional English passages, and development of specialized annotation tools for historical document analysis. The established methodology provides a replicable framework for similar digital humanities projects involving historical manuscript collections.