XerOCR — Corpus library

VIEW · LIBRARY

Corpus library

Local corpora and remote catalogues — all your material in one place.

local

127

HTR-United

HuggingFace

pages

My corpora Discover

Local corpora

Uploaded, imported from HTR-United / HuggingFace, or pointed from the filesystem.

1 local · 10 pages

Prepare a corpus

Upload a ZIP (drag-and-drop) or import from a remote source.

Drop a .zip here or click to select ZIP · max 500 MB · pairs auto-detected

Stored corpora

Uploaded, imported from HTR-United / HuggingFace, or pointed from the filesystem.

bnlgt.zip

ID · a8ff8edd93… 10 documents 10 ground truths

ready for benchmark

documents

0000 0001 0002 +7

Use

Discover

Explore one source at a time, then import only what matters.

HUHTR-United127 HFHuggingFace3

HTR-United

Filter the HTR-United catalogue

TranscriboQuest 2024 Medieval Literary

zenodo.13757440

catalogue

This dataset was created in the context of TranscriboQuest 2024 (Medieval Literary Team) held in Lyon (11/09/2024-13/09/2024). We opted to focus on medieval scientific documents that are damaged, in several different languages. The result is 808 lines transcribed by experts in the field. The dataset contains the images of the manuscripts and ALTO-XMLs.

lat dum fro gmh

Open source

ÖNB, Cod. 3891. Ground Truth

zenodo.7467249

catalogue

The Ground Truth was produced by the participants of the HTR Winter School 2022 in the Late Latin Group (more information: https://www.oeaw.ac.at/imafo/veranstaltungen/detail/introduction-into-handwritten-text-recognition). The Ground Thruth includes the following folios: 1-3r, 6-8, 11r, 27 and is still work in progress. We are adding more pages soon. If you find any errors we kindly ask you to contact Jan Odstrčilík (jan.odstrcilik@oeaw.ac.at). The Supervisors of the Late Latin Group: Jan Odstrčilík PhD, Austrian Acadamy of Sciences, Daniela Mairhofer PhD, Princeton University, Tobias Hodel PhD, University of Bern.

lat

Open source

Données vérité de terrain HTR+ Annuaire des propriétaires et des propriétés de Paris et du département de la Seine (1898-1923)

nkl.acb724xs

catalogue

Annuaire des propriétaires et des propriétés de Paris et du département de la Seine. Lien dans le catalogue de la BNF : https://catalogue.bnf.fr/ark:/12148/cb32697229h. Crédits : Bibliothèque nationale de France. Données vérité de terrain résultant de la transcription et la segmentation manuelle d’un échantillon de 169 pages des annuaires appartenant aux volumes 1898 et 1923. Un modèle de transcription HTR+ a été entrainé à partir de cet échantillon grâce à Transkribus et est disponible sur cette plateforme en mode public. Ce modèle est valable pour transcrire automatiquement les volumes de 1903 et 1913 et tout autre document imprimé à deux colonnes et en utilisant l'alphabet latin et particulièrement en français. Le choix de l'échantillon est fait par critère alphabétique car c'est le mode d'organisation de l'information dans ce document. Les accolades présentes dans le document n'ont pas été segmentées. 118 pages pour entrainer et 51 pages pour validation. Contexte et financement : Subvention DAHN (Dispositif de soutien à l'archivistique et aux humanités numériques) par le MESRI. Equipes : Consortium Paris Time Machine - TGIR Humanum EHESS / CNRS / LATTICE / INRIA Contact si besoin d'anonymiser les noms de personnes : carmen.brando@ehess.fr.

fra

Open source

University of Denver Jewish Consumptives Relief Society Medical Records Training and Validation Set

zenodo.4243023

catalogue

Training and validation set. Transcribed records available upon request. The transcribed corpus of records from the Jewish Consumptive Relief Society contains data that include individually identifiable health information, among other sensitive information regarding persons and people. All individuals for whom records are provided have been deceased for at least 70 years, but were they still living today, these records would be recognized as being protected health information under the US Health Insurance Portability and Accountability Act of 1996 (HIPAA). While HIPPA and other privacy laws no longer apply to these individuals, in providing these data the University of Denver wishes to foster research practices that express the utmost respect for the human beings whose lives are represented, at least in some part, in these collections. In addition, we ask researchers respect the lives of these individuals’ ancestors and their communities. To foster practices that honor patients, staff, nurses and physicians connected with the JCRS Sanitorium, as well as their families, ancestors and communities, we ask that researchers disclose their intended use of the collection for review by our Advisory Board (see reverse). This Board is comprised of ethicists, historians, librarians, attorneys, physicians, and members of the Jewish community. In addition, we ask researchers agree to conduct their work under the following set of principles: 1. I affirm the role of JCRS patients and staff as data creators and will avoid exploiting and/or dehumanizing them by treating them simply as data. 2. My research will, when possible and appropriate, account for the contexts surrounding the JCRS subjects as data arise. My work will recognize that all data and datasets are shaped by decisions about how histories are recorded, remembered, and valued. 3. If the nature of my work is such that I am sharing the life stories and/or narratives of individuals in these data, and I can do so with no potential harm to their reputation or that of their ancestors, I will honor them by naming them. If the nature of my work is such that I am exploring large-scale patterns in the dataset, and naming individuals serves no specific research purpose, I will anonymize and/or redact names within the data. 4. If I am publishing the results of research conducted with these data, I will, if possible and appropriate, include a note of recognition and/or gratitude in my publication. We suggest a version of: “This work was made possible in part by the patients, staff, nurses, physicians, and community of the Jewish Consumptive Relief Society (JCRS). The people who lived, worked, and died at the JCRS sought to relieve human suffering. I am grateful to them.”

eng

Open source

AMSMB HTR

0VB0MC

catalogue

Dataset for handwritten text recognition on medieval notarial charters written on parchment (1208-1499). The dataset is comprised of 100 digitized manuscripts (3,369 lines), carefully selected to represent the large variation that is present in the sources, encompassing at least 80 distinct hands and various document types (from sales and inventories to last wills and marriage contracts). Written primarily in Medieval Latin with fragments in Medieval Catalan, these manuscripts exhibit varying stages of preservation and degrees of deterioration.

lat cat

Open source

Ground truth data for printed Devanagari

EGOKEI

catalogue

Ground truth (GT) data (jpg and alto xml files) for an OCR model that recognizes printed text in Devanagari script. The GT data was trained on Transkribus with the HTR+ engine. The training was performed on appr. 220 pages with appr. 27,000 words. The validation set was 10% of the training set. The training material is comprised of letterpress printings from the Naval Kishore Press (Lakhnau, North India) from the late 19th and early 20th century in the Hindi, Sanskrit, Braj Bhasha and Awadhi languages. Transcription was performed by Nicole Merkel-Hilf (CATS Library / Heidelberg University Library) with support by Daria Peshcherova (CATS Library / Heidelberg University Library).

hin san bra

Open source

Ground Truth data for printed Malayalam

L2KRZO

catalogue

Ground Truth (GT) data (JPG and ALTO XML files) which can be used to train OCR models that recognize printed text in Malayalam script. The training material is gathered from 19th and 20th centuries prints. The GT data was trained in Transkribus with the HTR+ and the PyLaia engine with a resulting CER of 2.29% on validation set with HTR+ and 3,20% with PyLaia. The training was performed on 43 pages with appr. 9,000 words. The validation set consisted of 5 pages (ca. 1,000 words). Transcription was performed by Tübingen University Library, the Ground Truth data was created by Elena Mucciarelli (University of Groningen) with support and model training by Dorothee Huff (Tübingen University Library). (2022-11-02)

mal

Open source

Incunabula Reichenau

zenodo.11046061

catalogue

This data set contains the training data for the following three published Transkribus models\: German Incunabula (Reichenau) Latin Incunabula (Reichenau) Latin/German Bilingual Incunabula (Reichenau) This data set represents an excerpt of a collection of incunabula and post-incunabula of the former Reichenau monastery, now held at the Badische Landesbibliothek in Karlsruhe (see https://digital.blb-karlsruhe.de/topic/view/7530707). As, typically, 1-20 pages were drawn from single prints, it reflects a wide range of typefaces used by early printers from the German language area and Northern Italy. The data was created as part of the project Digitalisierung und Volltexterkennung der ehemals Reichenauer Inkunabeln at the Badische Landesbibliothek, which was funded by the Stiftung Kulturgut Baden-Württemberg.

lat deu

Open source

TranscriboQuest_Arabic_team

zenodo.13757236

catalogue

Dataset on an Arabic corpus of Christian-Islamic theology.

ara

Open source

Ground Truth Set for Handwritten Text Recognition (HTR/OCR): Dresdner Hofdiarium 1665 (Mscr.Dresd.K.80) - 17th century Kurrent manuscript

zenodo.14356190

catalogue

This dataset contains ten pages of Ground Truth from the Dresden Court Diaries of elector Johann Georg II. as Page XML, Alto XML and jpg.

deu

Open source

ÖNB Cod. Syr. 1, Ground Truth from HTR Winter School 2024

zenodo.14714089

catalogue

Ground truth of 140 folios of ÖNB Cod. Syr. 1. This ground truth was produced by participants of the Vienna 2024 HTR Winter School, who used Transkribus to manually correct a preliminary automatic transcription that had been generated using Kraken/eScriptorium.

syr

Open source

Ground Truth Set for Handwritten Text Recognition (HTR/OCR): Dresdner Hofdiarium 1673 (Mscr.Dresd.K.117) - 17th century Kurrent manuscript

zenodo.15303243

catalogue

Twenty pages of Ground Truth from the "Hofdiarium des Kurfürsten Johann Georgs II. 1673" (SLUB Mscr.Dresd.K.117; https://www.wikidata.org/wiki/Q134220291). The handwriting is a typical late 17th century Saxon kurrent ("Kanzleikurrent"), with occasional words written in bastarda or fraktur-like script. This transcription is part of a larger project regarding the Dresden court diaries. Check https://slub-dresden.academia.edu/StefanBeckert for further updates.

deu

Open source

Ground Truth Set for Handwritten Text Recognition (HTR/OCR): Dresdner Hofdiarium 1653-56 (Mscr.Dresd.K.113) - 17th century Kurrent manuscript

zenodo.15303398

catalogue

Twelve pages of Ground Truth from the "Hofdiarium des Kurfürsten Johann Georgs II. 1653-1656" (SLUB Mscr.Dresd.K113; https://www.wikidata.org/wiki/Q133883726). The handwriting is a typical late 17th century Saxon kurrent ("Kanzleikurrent"), with occasional words written in bastarda or fraktur-like script. This transcription is part of a larger project regarding the Dresden court diaries. Check https://slub-dresden.academia.edu/StefanBeckert for further updates.

deu

Open source

TranscriboQuest 2025 Medieval Vernacular Religious Texts

zenodo.17062963

catalogue

This dataset was created by a collaborative working group with the aim of transcribing medieval vernacular religious texts across a range of European languages. To reflect the linguistic expertise of the group members, the project included Old and Middle French, Old and Middle Irish, Old Castilian, Old Swedish, and Early New High German (Bavarian). Religious texts were chosen as a common thread because of their wide diffusion in the European vernacular tradition, their high survival rate in manuscripts, and their relevance for the study of medieval cultural and textual practices. The dataset is based on manuscripts preserved in France, Spain, Sweden, Germany, and Ireland, dating from the 11th to the 15th centuries, with a particular concentration in the 15th century. All manuscripts belong to the category of medium to highly decorated literary manuscripts. They are written in clearly identifiable scripts, predominantly in one or two columns; two manuscripts also include marginal texts.

fro frm mga lat osp swe bar

Open source

Transcriptions from medieval manuscripts related to the Augustinian canonesses in Inzigkofen

zenodo.17978574

catalogue

In a project of the Staatsbibliothek zu Berlin, we completely transcribed two manuscripts (Ms. germ. oct. 842, Ms. germ. fol. 1045) related to the Augustinian canonesses in Inzigkofen, dating from the 15th century and containing German texts. They were written by three scribes in three different Gothic scripts.

gmh

Open source

MS Jerusalem, Saint Mark’s Monastery 36, Ground Truth from HTR Winter School 2025

zenodo.18157525

catalogue

Ground truth of 133 bifolio images of MS Jerusalem, Saint Mark’s Monastery 36. This ground truth was produced by participants of the Vienna 2025 HTR Winter School, who used Transkribus to manually correct a preliminary automatic transcription that had been generated using Kraken/eScriptorium.

syr

Open source

Jeu de données OCR - Incunables sévillans 1494-1500

zenodo.3643393

catalogue

The data set corresponds to 60 pages printed in 1494 by Estanislao Polono and Meinardo Ungut in Seville. These pages are taken from the Regimiento de los Prínçipes (also known as 'Glosa castellana al Regimiento de prínçipes'), and the exemplar used is the INC/901 of the Biblioteca Nacional de España. The type used for this incunabulum is 97G (Martín Abad and Moyano Andrés, Estanislao Polono, 2002, p. 61). This type was used between 1494 and 1500. For other incunabula produced in this period, see op. cit, p.112-121.

spa

Open source

Handwritten Text Recognition Ground Truth Set: StABS Ratsbücher O10, Urfehdenbuch X

zenodo.5153263

catalogue

Ground Truth for "Urfehdenbuch X der Stadt Basel (1563-1569)" at Staatsarchiv Basel-Stadt (StABS).

deu

Open source

Charters and Records of Königsfelden Abbey and Bailiwick (1308-1662)

zenodo.5179361

catalogue

The data set is the publication of the data of the scholarly edition "Urkunden und Akten des Klosters und der Hofmeisterei Königsfelden".

lat deu

Open source

GT and HTR of VOC (Dutch East-Asia Company), WIC (Dutch West-Asia Company) and notarial deeds.

zenodo.6414086

catalogue

6000 ground truth of VOC and notarial deeds and 3.000.000 HTR of VOC, WIC and notarial deeds The National Archives of the Netherlands and Noord-Hollands Archief conducted a project using the Transkribus HTR (Handwritten Text Recognition) platform. The aim was to semi automatically transcribe 2 million pages of old Dutch texts. The transcribed archives are 17th and 18th century documents from the Dutch East-Asia Company (VOC). And 19th century notarial deeds from Noord-Hollands Archief and other archives in the provinces. In order to train the HTR software a team produced transcriptions of approximately 6000 scans. The scans are randomly selected from the dataset and contain hundreds of hands. With these transcriptions a model is trained that can recognize more than 90% of the characters correctly. Transkribus transcribed the 2 million scans automatically using the trained model. Later on, 1 million extra scans concerning the West India Company (WIC) were transcribed automatically without adding extra ground truth or training. These archives are from the 17th and 18th century. The datasets published in Zenodo contain the ground truth (scans in JPG, transcription in PAGE XML) and the HTR results (in PAGE XML and TXT). See the overview on the Zenodo page. A specification on which archives have been transcribed (both GT and HTR) can be found on the Zenodo. For open data access of scans and inventories of the National Archives click here: https://www.nationaalarchief.nl/onderzoeken/open-data/archiefinventarissen-digitale-objecten-en-scans-van-archieven Disclaimer: due to a variety of languages used and the bad state of the documents the HTR results of "1.05.21, Dutch series Guyana" can be of poor quality.

nld

Open source

Dataset for late medieval Castilian text recognition

zenodo.7386489

catalogue

HTR/OCR open access gold corpus for spanish late medieval sources, based on the allographetic transcription of more than 300 pages of several manuscripts of the Regimiento de los Prínçipes, as well as a first set of general transcription models trained with kraken and out-of-domain test data. See https://doi.org/10.5281/zenodo.7387376 for full description of the dataset.

spa

Open source

Klosterneuburg, Stiftsbibl., Cod. 48 - Ground Truth: Initial Release

zenodo.7466927

catalogue

This is ground truth for the vast collection of sermons of Nikolaus von Dinkelsbühl (ca. 1360 to 17th March 1433), translated and reorganised by a German redactor, from the 15th century has never been edited until now. It consists of 361 folios of parchment and paper. The text speaks about various topics such as fasting and other religious practices. Being one of the leading intellectuals of his time, Nikolaus von Dinkelsbühl also contributed to the development of the University of Vienna. The manuscript was probably produced in the vicinity of Klosterneuburg in Austria and is still kept there today (Shelfmark: Cod. 48). Data collection and ground truth creation: The edition at hand was produced by an international team of researchers from various fields in the context of the Vienna HTR Winter School 2022 with the help of Transkribus Expert Client. We uploaded the images of the manuscript into the Transkribus platform, applied the line recognition tool and manually copied the transcribed text lines into the recognised line boxes. Various models were trained with the ground truth (20% of the entire codex) created by the team. Images of the Klosterneuburg, Augustiner-Chorherrenstift, Cod. 48 are available at: https://manuscripta.at/diglit/AT5000-48/0001

gmh

Open source

GT4HistCommentLayout: Layout Ground Truth for Historical Commentaries

GT-commentaries-OLR

catalogue

This dataset contains layout annotations for ca. 370 pages sampled from 8 public domain classical commentaries, published in the 19th century in English, German and Latin. The commentaries concern Ancient Greek and Latin works from prose and poetry (caveat: AGreek poetry is slightly over-represented). Pages were annotated according to a taxonomy mapped to the SegmOnto controlled vocabulary.

eng deu lat grc

Open source

iForal-Dataset

catalogue

This dataset was designed for training machine learning models in the context of the [iForal project](https://iforal.hypotheses.org/), which focuses on transcribing medieval Portuguese texts, specifically forais (charters). It includes images of medieval manuscripts, along with corresponding line-level transcription labels, to facilitate the development of models capable of recognizing and transcribing historical handwriting. The dataset is ideal for OCR/HTR tasks and segmentation tasks within the domain of medieval document transcription. It serves as a critical resource for advancing automated transcription tools for medieval texts, making historical archives more accessible.

lat por

Open source

Fabliaux

catalogue

HTR data sets from medieval manuscripts (13th-14th c.) collecting "fabliaux" funded by Biblissima+

fro

Open source

Liber

catalogue

HTR datasets of medieval manuscripts (14th-15th c.) with Pierre Bersuire’s translation into Old French of the work of Titus Livius and Nicolas Trevet Commentaries

fro lat

Open source

FoNDUE Spanish chapbooks 19th c. Dataset

FoNDUE-Spanish-chapbooks-Dataset

catalogue

Digital editions of the second part of the Genevan Spanish chapbooks collection (19th c.).

cat spa lat

Open source

EHRI Dataset

ehri-dataset

catalogue

Multilingual dataset from various corpus of the EHRI project

eng ces deu slk hun pol dan

Open source

FONDUE-EN-PRINT-20

catalogue

Various prints (academic, archives, novels…)

eng

Open source

FONDUE-ES-PRINT-19

catalogue

Novels written in Spanish

spa

Open source

FONDUE-FR-MSS-18

catalogue

French Manuscripts of the 18th

fra

Open source

FONDUE-FR-PRINT-16

catalogue

Transcriptions of French 16th c. prints

fra

Open source

FONDUE-FR-PRINT-20

catalogue

French novels

eng

Open source

FONDUE-IT-PRINT-20

catalogue

Archives and novels

ita

Open source

FONDUE-MLT-ART

catalogue

Swiss art exhibitions catalogues

deu

Open source

FONDUE-MLT-CAT

catalogue

Groundtruth for 19th/20th sale/exhibition catalogues, mainly printed in France but not only.

por fra ita

Open source

FoNDUE_Kunsthistorisches-UZH_Archivdatenbank

catalogue

HTR data made with the Kunsthistorisches UZH corpus.

deu fra ita

Open source

HTR Front Justice

htr-front-justice.git

catalogue

Some transcriptions of minute books from military court councils during the First World War

fra

Open source

Données HTR manuscrits du 15e siècle

HTR-MSS-15e-Siecle

catalogue

Corpus d'entrainement pour l'HTR composé de manuscrits français du 15e s.

frm fra

Open source

Données imprimés du 16e siècle

HTR-imprime-16e-siecle

catalogue

Corpus d'entrainement pour l'HTR constitué d'imprimés du 16e siècle

frm fra

Open source

Imprimés 17e siècle

HTR-imprime-17e-siecle

catalogue

Corpus d'entrainement pour l'HTR composé d'imprimés français du 17e s.

frm fra

Open source

Données imprimés du 18e siècle

HTR-imprime-18e-siecle

catalogue

Corpus d'entrainement pour l'HTR constitué d'imprimés du 18e siècle

fra

Open source

Données HTR incunables du 15e siècle

HTR-incunable-15e-siecle

catalogue

Corpus d'entrainement pour l'HTR composé d'incunable français du 15e s.

frm fra

Open source

LiDi1.0-project

catalogue

This repository contains all data relating to the LiDi 1.0 project. In particular HTR GT of 16th antiquarian Pirro Ligorio, used to create Transkribus public model Ligorio 0.3 PyL.

ita

Open source

Wien ÖNB, Cod. 940, ff. 13r-142v Ground Truth from HTR Winter School 2024

-2024--carolingian-latin.git

catalogue

An anonymous Irish commentary on the Gospel of Matthew (end of the 8th/early 9th c.), from a manuscript from Salzburg/Saint Amand possibly under bishop Arn (785-821). Files: 2 Lines: 7835 Latin, Carolingian Minuscule (9th century)

lat

Open source

CREMMA-AN Testament De Poilus

CREMMA-AN-TestamentDePoilus

catalogue

WWI’s Poilus' testaments edited by the Archives National during the Testaments de Poilus project.

fra

Open source

CREMMA MSS 16

CREMMA-MSS-16

catalogue

Manuscripts of the 16th century

fra

Open source

CREMMA Manuscrits du 17e

CREMMA-MSS-17

catalogue

Various Manuscripts of the 17th century

fra

Open source

CREMMA Manuscrits du 18e

CREMMA-MSS-18

catalogue

Manuscripts of the 18th century

fra

Open source

CREMMA Manuscrits du 19e

CREMMA-MSS-19

catalogue

Manuscripts of the 19th century

fra

Open source

CREMMA Manuscrits du 20e

CREMMA-MSS-20

catalogue

Manuscripts of the 20th century

fra

Open source

CREMMA Medii Aevi

CREMMA-Medieval-LAT

catalogue

Ground truth for medieval latin manuscripts. Formerly `CREMMA-Medieval-LAT`.

lat

Open source

CREMMA Early Modern Books

cremma-16-17-print

catalogue

Collection of book samples in early print forms, 16th to 17th century, in Latin and pre-orthographic French.

frm lat

Open source

Cremma Medieval

cremma-medieval

catalogue

Transcription corpora for training HTR models for medieval manuscripts from the 12th to the 15th century.

fra fro

Open source

CREMMA WIKIPEDIA

cremma-wikipedia

catalogue

The CREMMA-WIKIPEDIA project aims at creating a collection of ground truth to train HTR models on contemporary French handwriting. Each image represents an exerpt from a randomly selected Wikipedia page, copied by hand by volunteers. We then took care of the alignment between the handwritten portion and the original text, also present on the image.

fra

Open source

DAHN Corpus

dahncorpus

catalogue

OCR ground Truth dataset based on French 20th typewritten letters

fra

Open source

Notaires de Paris - Bronod

lectaurep-bronod

catalogue

Ground truth for Maître Bronod’s registers, notary in Paris during the 18th century.

fra

Open source

Notaires de Paris - Mariages et Divorces

lectaurep-mariages-et-divorces

catalogue

Ground truth for the Registres des Contrats de Mariages et des Séparations et Divorces in Paris. The documents are written in Franch during the 19th century, contain many names and addresses. The information is organized in tables spreading on two pages. The table’s headers and the preamble are printed.

fra

Open source

Notaires de Paris - Répertoires

lectaurep-repertoires

catalogue

Ground truth for various Parisian registries of notary deeds written in French during the 19th century. The information is organized following pre-printed tables (with printed headers) and contain many names, addresses, numbers and abbreviations.

fra

Open source

Tapus Corpus

tapuscorpus

catalogue

Ground truth based on a variety of French typewritten documents from the 20th century. Contains exerpts plays, poems, letters and administrative reports.

fra

Open source

TIMEUS Corpus

timeuscorpus

catalogue

Ground-Truth for French 19th century pre-printed documents created by administrative services.

fra

Open source

HTRomance, Medieval French corpus of ground-truth for Handwritten Text Recognition and Layout Segmentation

medieval-french

catalogue

Ground truth of Old French and Middle French manuscripts. Manuscripts vary in themes, period, etc. Most manuscript have at most 10 columns transcribed.

fro

Open source

HTRomance, Medieval Italian corpus of ground-truth for Handwritten Text Recognition and Layout Segmentation

medieval-italian

catalogue

Transcription of samples of Medieval Italian manuscripts

ita vec

Open source

HTRomance, Medieval Latin corpus of ground-truth for Handwritten Text Recognition and Layout Segmentation

medieval-latin

catalogue

Ground truth of Latin medieval manuscripts. Manuscripts vary in themes, period, etc. Most manuscript have at most 10 columns transcribed.

lat

Open source

HTRomance, Medieval Spain corpus of ground-truth for Handwritten Text Recognition and Layout Segmentation

middle-ages-in-spain

catalogue

Ground truth of medieval manuscripts from Spain. Manuscripts vary in themes, period, etc. Most manuscript have at most 10 columns transcribed.

lat

Open source

Corpus Modern Roman Languages

modern-roman-languages

catalogue

Dataset for modern roman languages created within the context of the HTRomance project, using manuscripts from the Gallica digital library.

fra

Open source

Jeu de données HTR « Titres nobiliaires 17e-18e siècles »

main

catalogue

Ce dataset pour la reconnaissance des écritures automatiques est composé d’un mélange de transcriptions de documents du 17e-18 siècle (actes de mariage, preuves de noblesse etc.), essentiellement en français, et provenant de la série M, titre III "Titres nobiliaires" des Archives nationales de France.

fra

Open source

Antoine Verard extracts

Verard-corpus

catalogue

Parts of Antoine Vérard’s editions princeps of "Tristan", "Merlin" and "Gyron le Courtoys".

frm

Open source

gt_structure_text

catalogue

The OCR-D Ground Truth text and structure corpus was created between 2015-2017. In the years since 2017, this corpus has been further curated and supplemented with metadata where appropriate. The corpus includes page XML files within annotations of the text and structure include. The data is based on transcription data stored in the German Text Archive (DTA) (https://www.deutschestextarchiv.de/).

eng fra deu heb lat

Open source

De la généalogie des dieux

HN2021-Boccace

catalogue

This repository hosts all the documents, including transcriptions, bibliographical references and introduction that serve the team Boccace for the validation of the course "Bonnes pratiques du developpement collaboratif : initiation à Git" (prof. Thibault Clérice), of the first semester - Master Humanités Numériques ENC-PSL 2021-2022. At the same time it and constitutes part of the biannual project "Per un’edizione digitale della Genealogia deorum gentilium" di Boccaccio" (dir. F. Duval, M. Maulu). Financed in 2021, this project foresees to put on line in XML format the unpublished translation in Middle French entitled "De la genealogie des dieux".

frm lat

Open source

Chateau de Chavigny

HN2021-ChateauChavigny

catalogue

Le document sur lequel nous travaillons porte sur le Château de Chavigny à Lerné en Touraine. Au XVIème siècle, c’est la famille des seigneurs Leroy qui possède ce château. Avant 1568, en pleine guerre de religion, François Leroy, du parti du roi et des catholiques, participe à la capture et la rançon du prince de Condé, du parti protestant. En 1568, François Leroy, en tant que capitaine de 50 lances au service du roi, part en campagne avec lui. L'objectif est de transcrire cinq feuillets d'un manuscrit à l'aide d'eScriptorium. Le but étant d'apprendre à utiliser git et github pour mener à bien notre premier projet collaboratif.

frm

Open source

Maxime Kovalewsky - Coutume contemporaine et loi ancienne: droit coutumier ossétien

HN2021-Kovalewsky-1893

catalogue

Nous avons choisi de transcrire le deuxième chapitre de l’ouvrage de Maxime Kovalewsky : Coutume contemporaine et loi ancienne : droit coutumier ossétien, éclairé par l’histoire comparée. Paris, L. Larose, 1893.

fra

Open source

Memorials for Jane Lathrop Stanford

HN2021-Memorials_Jane_Lathrop_Stanford

catalogue

"Les données sources ont été téléversées sur le site From the page par les Archives de l’Université Stanford qui en sont les propriétaires. Elles ont ensuite été retranscrites par des bénévoles anonymes ; c'est leur travail nous a servi de base pour corriger nos propres retranscriptions. Les documents sources choisies sont des lettres de diffé rents auteurs portant sur les obsèques de Jane Lathrop Stanford. Les lettres sélectionnées étaient les lettres : 42, 43, 46, 49, 50, 54, 57 à 60, 69, 75, 76 [section 1, retranscrites par Perrine MAUREL] ; 80 à 93 [section 2, retranscrites par Ingrid GUIMARÃES] ; 241 à 242 [section 3, retranscrites par Yagmur OZTURK].

eng

Open source

OCR Corse

HN2021-OCR-Poesie-Corse

catalogue

Le premier ouvrage s’intitule *Pontenôvu* a été écrit par Petru Rocca et publié par la "Stamparia di a Muvra" en 1927. Il s'agit d'un recueil de poèmes en corse et en français dont les thèmes varient. *A Muvra* est un journal autonomiste corse d'influence maurassienne qui a existé pendant toute la période de l'entre-deux-guerres. Se revendiquant comme étant une revue culturelle, la dimension politique de la revue (incarnée par le PCA, ou Partitu corsu d'azione), en a fait un mouvement controversé. C'est dans ce contexte de lutte politique et d'éveil culturel corse que s'inscrit ce recueil. Le second ouvrage s'intitule *A nostra Santa Fede - Catechismu Corsu*, écrit par Ageniu Grimaldi en 1926 sous le pseudonyme de Saveriu Malaspina. Proche de Petru Rocca, ce-dernier est l'un des théoriciens de l'autonomisme corse de l'entre-deux-guerres et fidèle muvriste. Dans l'ouvrage, il est fait mention notamment de la façon dont un vrai corse doit se comproter vis-à-vis de sa foi envers Dieu et son île. Bien qu'il ne s'agisse pas réellement d'un recueil de poèmes, le style d'écriture de cet ouvrage est particulièrement intéressant. Il reprend un style qui se rapproche des écrits bibliques.

cos fra

Open source

Argus des Brevets

TNAH-2021-ArgusDesBrevets

catalogue

L’argus des brevets de 1910 se présente sous la forme d’un imprimé contemporain, organisé en rubriques regroupant de manière chronologique puis thématique les brevets déposés en France. Cette énumération et présentation succincte des brevets est répartie en deux colonnes et présente des abréviations normalisées. Dès lors, ce présent guide de contribution au projet entend présenter l’ensemble des normes de transcriptions adoptées au cours de ce projet de transcription, réalisé sur la plateforme E-scriptorium, dans le cadre du cours Git du master TNAH à l’ENC.

fra

Open source

DecameronFR

TNAH-2021-DecameronFR

catalogue

Le projet vise à la consitution de vérités de terrain pour l’entraînement de modèles HTR à partir d'un manuscrit français des années 1430-1455 : le manuscrit 5070 de la Bibliothèque de l'Arsenal (reproduit sur Gallica). Ce manuscrit contient la traduction française du Decameron de Boccace par Laurent de Premierfait. Nos vérités de terrain recouvrent la description de la peste à Florence située dans le prologue de l'ouvrage.

frm

Open source

Projet Exposition universelle de 1878

TNAH-2021-Expositions_Universelles

catalogue

Le Congrès international des sciences ethnographiques de 1878 a eu lieu à l’occasion de l'Exposition universelle de 1878, à Paris. Édité en 1881 par l'Imprimerie nationale, le compte rendu de ce congrès a été mis à disposition par le Conservatoire numérique des Arts et Métiers.

fra

Open source

Projet Correspondance Berlioz

TNAH-2021-Projet-Correspondance-Berlioz

catalogue

Nous avons choisi de travailler sur la correspondance active de Hector Berlioz adressée à sa sœur Anne-Marguerite "Nanci" Berlioz. L’ensemble des lettres adressées à Nanci Berlioz représentait un volume trop important pour notre projet, aussi nous les avons sélectionnées, par souci de cohérence, selon un ordre chronologique (voir le tableau de gestion) pour la liste exacte des lettres transcrites).

fra

Open source

Projet Notre-Dame

TNAH-2021-Projet-Notre-Dame

catalogue

Le Projet Notre-Dame consiste en une transcription des journaux quotidiens de l’année 1860 (https://mediatheque-patrimoine.culture.gouv.fr/sites/mediatheque/files/jnd_1860.pdf) des travaux de restauration effectués de 1844 à 1865 à la cathédrale Notre-Dame de Paris sous la direction d'Eugène Viollet-le-Duc et Jean-Baptiste Lassus. Celle-ci a été effectuée sur eScriptorium à partir de la numérisation des journaux des travaux (https://mediatheque-patrimoine.culture.gouv.fr/travaux-de-notre-dame-de-paris-1844-1865) réalisée par la Médiathèque de l'architecture et du patrimoine.

fra

Open source

FoNDUE-GasparoSardiToponomasia-Dataset

HTR

catalogue

Dataset produced as for the project to edit Gasparo Sardi’s Toponomasia from codex 174 of the Burgerbibliothek of Bern. Images are available on request by writing to: pauline.jacsont [ at ] unige.ch.

lat

Open source

Recensement Valaisan (Valais Time Machine)

valais-recensement

catalogue

Ensemble de formulaire de recensement

fra deu

Open source

HTR - Araucania manuscript XIX

HTR_Araucania_XIX

catalogue

Ground Truth dataset for Spanish 19th typewritten OCR. The archives come from the events of the Occupation of Araucania (1850-1881) in Chile. They are archived in the ’Colección manuscritos' of the Archivo Central Andres Bello - Universidad de Chile.

spa

Open source

HTR-SETAF-Jean-Michel

catalogue

OCR data for the SETAF project, 16th-century French prints in Gothic characters.

fra

Open source

HTR-SETAF-LesFaictzJCH

catalogue

OCR data for the SETAF project, 16th-century French prints in Gothic and Roman characters.

fra

Open source

HTR-SETAF-Pierre-de-Vingle

catalogue

OCR data for the SETAF project, 16th-century French prints in Gothic characters.

fra

Open source

GT Celestine Doniau-Danest

dataset-celestine-doniau-danest

catalogue

Jeu de vérités de terrain pour la transcription automatique produit avec eScriptorium dans le cadre du cours HNU2000 à l’Université de Montréal au trimestre d'automne 2024. Le jeu de données contient des pages tirées aléatoirement des numérisation du "Journal de Célestine Doniau-Danest sur les débuts de la Guerre 1914-1918" mis en ligne par les Archives départementales de la Somme. *Ground Truth dataset for automatic text recognition created with eScriptorium during the HNU 2000 course at the Université de Montréal during the Fall 2024 semester. The dataset contains pages taken randomly from the digitization of the "Journal de Célestine Doniau-Danest sur les débuts de la Guerre 1914-1918" (Diary of Célestine Doniau-Danest on the beginning of the 1914-1918 war), published by the departmental archives of Somme.*

fra

Open source

Moonshines

moonshines

catalogue

This dataset is composed of pages of text written in 2023 by a single person, copying texts taken from Guillaume Apollinaire's poems published in Alcools, and taken from Guillaume Apollinaire's Wikipedia page.

fra

Open source

Peraire Ground Truth

peraire-ground-truth

catalogue

Ground Truth for the Digital Peraire project.

fra

Open source

Copiste d’un jour

Copiste-d-un-jour

catalogue

This project draws inspiration from the CREMMA WIKIPEDA data set, with the objective to create a ground truth repository of contemporary Québécois handwriting to train HTR models. It is based on a collection of randomly selected Wikipedia summaries. Each text comprises between 125 and 175 words and was copied by hand by volunteers. The texts were ordered in a way to prioritize texts that presented rare character 1- and 2-grams. Non-French characters were replaced with "-". In general, the copy of one text took between 1 and 2 pages. In total, 267 volunteers copied 265 texts (2 texts were unfortunately copied twice by two different volunteers). We took care of the alignment between the handwritten portion and the original text.

fra

Open source

CHI-KNOW-PO CORPUS

chi-know-po

catalogue

HTR ground-truth of the CHI-KNOW-PO project (Collex-Persée), that aimed to digitize a corpus of belletristic anthologies, scholarly collections, dictionaries and encyclopedias from the Chinese medieval period (ca. 200-1000) and to process them using HTR.

lzh

Open source

RASAM 2

rasam-dataset

catalogue

The Dataset is made up of 250 images, with their related ground truth stored in a XML file (pageXML format). Images come from fifteen manuscripts selected among the collections of the BULAC Library (Paris), in Magribi Arabic. It extends RASAM 1 by covering a very wide variety of hands, text density, and cursiveness. This dataset is the result of a collaborative transcription. All the participants are credited on the official deposit. With the support of the French Ministry of Higher Education, Research and Innovation, the Research Consortium Middle-East and Muslim Worlds (GIS MOMM), Calfa and the BULAC library.

ara

Open source

TariMa

tarima

catalogue

The dataset has been collated within the frame of the TariMa project (Tarih al-Maghrib. Writing History in the Maghreb in the modern and contemporary era), sponsored by the French agency Collex-Persee and led by Antoine Perrier (CNRS). It comprises different image resolution and size (width from 982px to 8049px), different layouts (double page, multiple columns), and state of conservation. It also mixes microfilms, scans and lithographies. It presents a very wide variety representative of the Maghrebi Arabic production.

ara

Open source

OCR17plus

catalogue

Imprimés classiques

frm

Open source

GenAuto TD Corpus

genauto-td-htr.git

catalogue

150 transcribed images from "Tables Décennales" French Civil Registry. Those come from Sermaises and Romilly-sur-Seine municipalities.

fra

Open source

Joseph Hooker HTR

JosephHookerHTR.git

catalogue

XML transcriptions and JPEG images exported from Transkribus as ground truth for an eScriptorium-Kraken HTR model (CER 11-12%) trained on the correspondence of Joseph Dalton Hooker (1817-1911), primarily letters to William Turner Thiselton-Dyer (1843-1928) during the late-19th/early-20th century. Many transcriptions in this dataset were generated by a small team of anonymous volunteers as part of the Joseph Hooker Correspondence Project based at Kew Gardens. All images in this dataset are reproduced with the kind permission of the Board of Trustees of the Royal Botanic Gardens Kew (© RBG, Kew). Contact archives@kew.org for more information. HTR Model: Schaefer, John, & Litvine, Alexis. (2023). Joseph Hooker HTR Model. Zenodo. https://doi.org/10.5281/zenodo.8038689

eng

Open source

NuBIS-OCR

catalogue

Ground truth dataset for a selection of printed books from NuBIS, the digital library of the Bibliothèque Interuniversitaire de la Sorbonne.

fra lat

Open source

Eutyches

catalogue

Ground truth for minuscule caroline of the late 9th century from the grammatical work "de uerbo" of Eutychès.

lat grc

Open source

Burchards Dekret Digital (BDD) Segmentation Data

bdd-segmentation-data

catalogue

This dataset comprises PageXML for training segmentation models in Transkribus and Kraken. It is designed to capture the specific layout of medieval canon law collections. Compiled from several 11th-century manuscripts of the Decretum Burchardi, it supports the ongoing edition project Burchards Dekret Digital. Annotations are tailored to project-specific needs but can be adapted for other use cases. The data was first prepared using Transkribus and then remasked in eScriptorium for usage in Kraken.

lat

Open source

Shakespeare-Scott translations

ocr-data

catalogue

Ground truth data in German and English of Shakespeare and Scott prints in original and different translations.

eng deu

Open source

Paris Bible Project (PBP)

ground_truth

catalogue

The Paris Bible Project aims to understand the production and diffusion of medieval Latin Bibles in Europe. The dataset includes ground truth from Paris Bibles produced in the 13th and 14th centuries. We also provide the most recent version of our list of Paris Bible manuscripts found in the world along with information about them.

lat

Open source

Bullinger HTR Dataset

bullinger-htr

catalogue

This dataset contains 165,673 image and corresponding text line files (.png for images and .txt for the texts) in a random 80/10/10 training, validation and test set split. The source is the extensive correspondence of Swiss reformer Heinrich Bullinger (1504-1575) and his over 800 different correspondents. It therefore contains great variety in handwriting styles. Furthermore, it is multilingual since there are Latin and Early New High German (and sometimes mixed) letters. The data is split into Latin and Early New High German (determined with langid) and put into separate folders (de for Early New High German and la for Latin).

lat deu

Open source

Caroline Minuscule by Rescribe

carolineminuscule-groundtruth

catalogue

This ground truth repository is a work in process; it currently accounts for a part of our complete Caroline Minuscule training pool of around 70 manuscripts used for our OCRopus Caroline Minuscule model (see ocropus-models repository).

lat

Open source

Éditer la correspondance de Constance de Salm (1767-1845)

verite-terrain

catalogue

La correspondance de Constance de Salm (femme de lettres française) comprend différents spécimens d’écriture du début du XIXe siècle. Le jeu de données atteste les mains de quatre copistes différents.

fra

Open source

The Sloane Lab HTR Model

HTR-Model

catalogue

This repository contains Handwritten Text Recognition training data (layout segmentation and transcriptions ) for the Sloane Lab HTR model. The HTR model is trained on the handwriting of Hans Sloane (1660-1753). Funding: Enlightenment Architectures: Leverhulme Trust Project Grant 2016-21 The Sloane Lab: Towards a National Collection – AHRC AH/W003457/1

eng

Open source

EpiSearch HTR

episearch-htr

catalogue

Ground Truth for Astori’s letters (see the README.md file for details)

ita

Open source

HPGTR Dataset

hpgtr

catalogue

The HPGT dataset consists of images of Handwritten Paleographic Greek Text, derived from the Bodleian Libraries' Greek manuscript collection, specifically the Barocci collection, which dates from the 8th to the 17th centuries. This dataset is divided into two editions: HPGTR.N, which contains 77 unsegmented images categorized by century from the 10th to the 16th, and HPGTR.S, which features carefully segmented lines from selected images to facilitate machine learning tasks. The dataset captures a range of characteristics, including variations in writing style, page conditions, and manuscript production details. This dataset is part of the following work: Paraskevi Platanou, John Pavlopoulos, and Georgios Papaioannou. 2022. Handwritten Paleographic Greek Text Recognition: A Century-Based Approach. In *Proceedings of the "Thirteenth Language Resources and Evaluation Conference"*, pages 6585–6589, Marseille, France. European Language Resources Association.

grc

Open source

Ground truth for the Palatine Anthology (HTR_CPgr23)

htr_cpgr23

catalogue

Ground Truth dataset for the Codex palatinus graecus 23 (Palatine Anthology), byzantine writing from the X^th^ century.

grc

Open source

La Correspondances Jacques Doucet - René Jean

LaCorrespondanceDoucetReneJean

catalogue

Projet entrepris dans le cadre du programme La Bibliothèque d’art et d’archéologie de Jacques Doucet : corpus, savoirs et réseaux de l’Institut national d’histoire de l’art à partir d’un corpus de lettres et documents conservés au Département des manuscrits de la Bibliothèque nationale de France sous la cote NAF 13124, une des principales sources sur la relation entre Doucet et René Jean qu’il engagea comme bibliothécaire le 2 juin 1908.

fra

Open source

Les Papiers Barye

LesPapiersBarye

catalogue

Ensemble de documents autour du sculpteur Antoine-Louis Barye. Paris, Bibliothèque de l’Institut national d’histoire de l’art, collections Jacques Doucet, Archives 166. Institut National de l’Histoire de l’art (INHA) / Set of documents about the sculptor Antoine-Louis Barye. Paris, Library of the Institut national d'histoire de l'art, Jacques Doucet, Archives 166. National Institute of Art History (INHA)

fra

Open source

Les carnets de fouilles manuscrits de Jean-Jacques Hatt: verité de terrain pour un modèle HTR

nkl.116al580

catalogue

Ce jeu de données contient la transcription et la segmentation de 107 pages sélectionnées parmi les carnets de terrain de l’archéologue Jean-Jacques Hatt. Ces opérations ont été réalisées à l’aide du logiciel eScriptorium (instance INRIA). Le corpus contient des paires Texte-Image au format XML ALTO-JPG. L’échantillon a été choisi pour privilégier des pages avec : - des lignes de texte standard, - une écriture de gauche à droite, - un minimum d’insertions graphiques. Certaines pages incluent cependant des zones graphiques, identifiées par une segmentation manuelle. Ces transcriptions ont servi à l’entraînement d’un modèle spécifique adapté à l’écriture de Jean-Jacques Hatt. Les documents couvrent la période 1942-1944.

fra

Open source

DISTINGUO : Ground truth for Handwritten Text Recognition (HTR) on Collections of Distinctions (late 13th to late 15th century)

nkl.48ad8b8d

catalogue

This dataset contains normalized transcriptions of collections of distinctions, specifically "Summa de abstinentia" by Nicolas of Biard and "Dictionarium bovis" by Thomas of Pavia. They were prepared as part of the DISTINGUO project, dedicated to the study of distinctiones in medieval Latin preaching and led by Marjorie Burghart in 2019-2024.

lat

Open source

TranscriboQuest 2025: Medieval Latin

zenodo.17062009

catalogue

Dataset from TranscriboQuest 2025, Medieval Latin group. This dataset focuses on layout. All manuscripts are glossed latin manuscripts with complex layouts. The dataset contains 5000 typed lines, 700 of which have been transcribed.

lat

Open source

Ground truth for Neue Zürcher Zeitung black letter period

3333627#.YhN1G1vMLUQ

catalogue

The Neue Zürcher Zeitung (NZZ) has been publishing in black letter from its very first issue in 1780 until 1947. From this time period, we randomly sampled one frontpage per year, resulting in a total of 167 pages. We chose frontpages because they typically contain highly relevant material and because we want to make sure not to sample pages containing exclusively advertisements or stock information. During certain periods, the NZZ was published several times a day, and there were supplements, too. Due to incomplete metadata, the sampling included frontpages from supplements. We then manually corrected the pages, so it can be used as a ground truth to improve the OCR of black letter in historical newspapers.i

deu

Open source

Gwalther Handwriting Ground Truth

4780947#.YhN5pVvMLUQ

catalogue

This is ground truth for Rudolph Gwalther’s (1519-1586) handwriting taken from his book "Lateinische" Gedichte", where he accumulated writings between 1540 and 1580. Data collection and ground truth creation: At the time we collected the data, we found 150 images with corresponding transcriptions by Peter Stotz on e-manuscripta (reference: Gwalther, Rudolf: Lateinische Gedichte. Zürich, 1540-1580. Zentralbibliothek Zürich, Ms D 152, https://doi.org/10.7891/e-manuscripta-26750 / Public Domain Mark) . We removed 8 images with too many corrections or vertical texts. Next, we uploaded the images into the Transkribus platform, applied the line recognition tool and manually copied the transcribed text lines into the recognised line boxes. During this process, we made some corrections, which were mainly due to inconsistencies in punctuation and capitalised letters.

lat

Open source

BiblIA

5167263

catalogue

This dataset for Handwritten Text Recognition includes layout segmentation (regions, toplines and linepolygons) and unicode-transcriptions in alto 4.2 XML for 202 images of Medieval Hebrew manuscripts from the Bibliothèque nationale de France (BnF, National Library of France) and the Biblioteca Apostolica Vaticana (BAV, Vatican Library) corresponding to the article "BiblIA - a General Model for Medieval Hebrew Manuscripts and an Open Annotated Dataset" by Daniel Stökl Ben Ezra, Bronson Brown-DeVost, Pawel Jablonski, Benjamin Kiessling, Elena Lolli, and Hayim Lapin, published in HIP@ICDAR 2021 held in Lausanne, September 2021.

heb

Open source

The POPP datasets

6581158

catalogue

The POPP datasets is a set of 3 datasets created within the POPP project (Project for the Oceration of the Paris Population Census) for the task of handwriting text recognition. These datasets have been published in "Recognition and information extraction in historical handwritten tables: toward understanding early 20th century Paris census" at DAS 2022. The 3 datasets are called “Generic dataset”, “Belleville”, and “Chaussée d’Antin” and contains lines made from the extracted rows of census tables from 1926. Each table in the Paris census contains 30 rows, thus each page in these datasets corresponds to 30 lines.

fra

Open source

Wien ÖNB Cod. 2160 f. 164-184 Ground Truth from HTR Winter School 2022

7467027#.Y6LRj3bMK3B

catalogue

This is Ground Truth data created during the HTR Winter School 2022 for the Cod. 2160 ÖNB that contains one version of the so called Lex Dei.

lat

Open source

Padeřov-Bible-handwriting-ground-truth

7467034#.Y6LQZBWZM2w

catalogue

This is ground truth based on the Padeřov Bible (Vienna, Austrian National Library, shelfmark Cod. 1175, 1432–1435), the bible of the third redaction of the Old Czech Bible translation. The transcription rules were based on semi-diplomatic transcription rules set by PERO OCR and Směrnice pro vydávání starších českých textů set by Jiří Daňhelka (https://vokabular.ujc.cas.cz/moduly/edicnipoznamka.aspx?id=DanhelkaSmernice). Abbreviations were tagged and expanded.

ces

Open source

Belfort

8041668

catalogue

This dataset includes minutes of Belfort municipal council drawn up between 1790 and 1946. Documents include deliberations, lists of councillors, convocations, and agendas. The dataset includes 24,105 text-line images that were automatically detected from pages. Up to four transcriptions are available for each line image: * two from human annotators (in `Transcriptions/callico_1/` and `Transcriptions/callico_2/`) * two from automatic models (in `Transcriptions/dan/` and `Transcriptions/pylaia/`)

fra

Open source

ARletta

11191457

catalogue

Open-source handwritten text recognition models for historic Dutch

nld fra

Open source

The Revolutionary City Corpus (1758-1805): Ground Truth for Handwritten Text Recognition (HTR) for 18th Century Documents in English

15776323

catalogue

The Revolutionary City is a partnership between the American Philosophical Society, the Historical Society of Pennsylvania, and the Library Company of Philadelphia to digitize all manuscript material related to Philadelphia and the American Revolution (1763-1804). This dataset is a transcribed subset of the larger digitized corpus, provided in ALTO format with an intended use in training Handwritten Text Recognition (HTR) models. The material is overwhelmingly in English, though a few letters in French have been included. The material contains a mixture of correspondence and journals. The correspondence has been annotated to distinguish between the different parts of a letter (Salutation, Date and Address, Addressee, Address, Closing, Postscript). The transcriptions were produced by staff and interns at the American Philosophical Society. Each document was reviewed at least once by another transcriber. The corpus exhibits a wide variety of variation in hands, handwriting styles, paper quality and levels of damage. The corpus encompasses material from 1758 to 1805, but the majority of the documents fall between the years 1774 to 1783.

eng fra

Open source

EPARCHOS

4095301

catalogue

The dataset originates from a Greek handwritten codex that dates from around 1500-1530. This is the subset of the codex British Museum Addit. 6791, written by two hands, one by Antonius Eparchos and the other by Camillos Zanettus (ff. 104r-174v) and delivers texts by Hierocles (In Aureum carmen), Matthaeus Blastares (Collectio alphabetica) and, notably, texts by Michael Psellos (De omnifaria doctrina). The writing delivers the most important abbreviations, logograms and conjunctions, which are cited in virtually every Greek minuscule handwritten codex from the years of the manuscript transliteration and the prevalence of the minuscule script (9th century) to the post-Byzantine years. This dataset consists of 120 scanned handwritten text pages, containing 9285 lines of text, 18809 words (6787 unique words). For each page, a PageXML is provided containing the following groundtruth: 1. Text region polygon coordinates 2. Text line polygon coordinates with the corresponding transcription text 3. Word polygon coordinated with the corresponding transcription text

grc

Open source

Stavronikita Monastery Collection No. 79

5578136

catalogue

It comprises manuscripts made of paper, written in the 16th century and its dimensions are 220X165 mm. The manuscript is embellished with epititles and red initials. Tachygraphical symbols and abbreviations are encountered in the manuscript as well. The dataset of XΦ79 consists of 803 lines of text containing 4389 words (2069 unique words) that are distributed over 40 scanned handwritten text pages. For each page, a PageXML is provided containing the following ground-truth: 1. Text region polygon coordinates 2. Text line polygon coordinates with the corresponding transcription text 3. Word polygon coordinated with the corresponding transcription text

grc

Open source

Stavronikita Monastery Collection No. 114

5578251

catalogue

It comprises manuscripts made of paper, written at the end of the 15th century and its dimensions are 218X150 mm. In various pages, we find red initials and epititles which enrich the manuscript’s decoration. The dataset of ΧΦ114 consists of 1051 lines of text containing 5467 (2877 unique words) words that are distributed over 44 scanned handwritten text pages. For each page, a PageXML is provided containing the following ground-truth: 1. Text region polygon coordinates 2. Text line polygon coordinates with the corresponding transcription text 3. Word polygon coordinated with the corresponding transcription text

grc

Open source

Stavronikita Monastery Collection No. 53

5595669

catalogue

The collection is one of the oldest Stavronikita Monastery on Mount Athos. It is a parchment, four-gospel manuscript which has been written between 1301 and 1350. It comprises 54 pages with dimensions that are approximately 250x185 mm. The script is elegant minuscule and the use of majuscule letters is rare. Tachygraphical symbols and abbreviations are encountered in the manuscript as well. Furthermore, the manuscript is enriched with chrysography, elegant epititles and initials. The dataset of ΧΦ53 consists of 1038 lines of text containing 5592 words (2374 unique words) that are distributed over 54 scanned handwritten text pages.

grc

Open source

Ground-Truthed Data Set of Zenon Papyri for Handwritten Text Recognition

6565706

catalogue

Diplomatic transcription of papyri found in the Zenon archive [see en.wikipedia.org/wiki/Zenon_of_Kaunos] Manually prepared as PageXML with Transkribus within D-Scribes project.

grc

Open source

ANR e-NDP Ground Truth

7575693

catalogue

This repository hosts HTR ground truth created within the context of the ANR e-NDP project. This dataset based on 512 pages from the 26 registers of the Notre-Dame de Paris cathedral chapter. The volumes containing the chapter conclusions were conceived to serve as memorial records, but above all as documents for regular use and consultation in the daily practice of administration and management. The registers were written using a Cursive script (ca. late XIIIe - XVIe) and their content is were written mainly in Latin, the rest in French. There are no fewer than 18 hands in these pages. The transcriptions were manually completed in two rounds by a group of 12 contributors, including historians and paleographers, over the course of 2021-2022 using eScriptorium.

fra lat

Open source