HTR Polier de Vernand - Digital Transcription Repository

Les Cahiers Polier

Archives cantonales vaudoises (ACV) × EPFL Collège des Humanités Digitales (CDH)
Master's Research in Digital Humanities & History, University of Lausanne (2023)

8.78%

Character Error Rate (2023)

Acceptable accuracy for initial digitization of historical documents
Requires manual correction for scholarly precision; serves as foundation for computational analysis

Project Overview

This repository presents the comprehensive digital transcription of Jean-Henri Polier de Vernand's personal notebooks, accomplished through state-of-the-art Handwritten Text Recognition (HTR) technologies. As lieutenant baillival of Lausanne from 1754 to 1791, Polier systematically documented daily life across 26,300 manuscript pages, creating one of the most significant historical records of 18th-century Lausanne society.

Historical Significance: Jean-Henri Polier de Vernand served as one of the most important figures in Lausanne society of his time, holding positions in multiple councils and courts. His meticulous documentation provides unprecedented insight into the social, economic, and political fabric of 18th-century Swiss urban life.

Quantitative Analysis

26,300

Manuscript Pages

Complete digitization of Polier's personal notebooks spanning his entire career as lieutenant baillival

37 years

Temporal Coverage

Continuous documentation from 1754 until Polier's death in 1791

Training Pages

Manually transcribed pages representing diverse layouts, vocabulary, and writing styles

JSON

Structured Output

Machine-readable format enabling computational analysis and digital humanities research

Methodological Framework

Technical Pipeline

The digital transcription process employed a multi-stage approach combining manual ground truth generation with automated recognition systems:

Ground Truth Generation: Strategic selection and manual transcription of representative pages using Transkribus platform
Layout Analysis: Automated text line detection and baseline correction through computer vision algorithms
Model Training: HTR-Flor++ implementation with TensorFlow, enhanced by Bentham dataset pre-training
Mass Inference: Application of trained model to entire corpus with GPU acceleration via Google Colab
Post-Processing: Conversion of predictions to structured JSON format maintaining page-level organization
Quality Validation: Statistical analysis and sample verification against ground truth standards

Training Dataset Composition

The training dataset was strategically constructed to represent the diversity of Polier's writing across different periods, contexts, and content types:

Notebook Range	Pages Selected	Selection Criteria	Content Characteristics
001-020	15 pages	Early period documentation	Initial writing patterns, varied layouts
040-080	12 pages	Regular interval sampling	Administrative content, numerical data
100-160	10 pages	Middle period diversity	Mixed content types, layout variations
185	3 pages	Late period examples	Mature writing style, complex layouts

Technical Implementation

Core Technologies

HTR Framework:

HTR-Flor++ with TensorFlow backend

Pre-training Dataset:

tranScriptorium Bentham collection

Layout Analysis:

Transkribus XML export with baseline detection

Computing Environment:

Google Colab with GPU acceleration

Output Format:

Structured JSON with page-level organization

Character Encoding:

Unicode UTF-8 with accent normalization

Research Foundation

This digital corpus provides a searchable foundation for computational analysis, with manual correction workflows established for critical passages requiring scholarly precision.

Repository Access

☀︎ JSON Dataset ☀︎ ☀︎ Master's Thesis (French) ☀︎ ☀︎ GitHub Repository ☀︎

Implementation Considerations

Prerequisites

• Python environment with OpenCV, TensorFlow, and NumPy dependencies
• Access to Google Colab or equivalent GPU computing environment
• Transkribus account for layout analysis and ground truth generation

Replication Workflow

• Execute From_Transkribus_to_HTR_Flor.ipynb for training data preparation
• Train model using HTR-Flor++ framework on prepared dataset
• Apply trained model to complete manuscript collection
• Process predictions using From_HTR_Flor_to_JSON.ipynb

Future Developments

Ongoing research includes enhanced error correction algorithms, multilingual recognition capabilities for Polier's occasional English passages, and development of specialized annotation tools for historical document analysis. The established methodology provides a replicable framework for similar digital humanities projects involving historical manuscript collections.

Digital humanities research bridging 18th-century manuscript traditions with 21st-century computational methodologies.