This dataset was created in the context of TranscriboQuest 2024 (Medieval Literary Team) held in Lyon (11/09/2024-13/09/2024). We opted to focus on medieval scientific documents that are damaged, in several different languages. The result is 808 lines transcribed by experts in the field. The dataset contains the images of the manuscripts and ALTO-XMLs.
The Ground Truth was produced by the participants of the HTR Winter School 2022 in the Late Latin Group (more information: https://www.oeaw.ac.at/imafo/veranstaltungen/detail/introduction-into-handwritten-text-recognition). The Ground Thruth includes the following folios: 1-3r, 6-8, 11r, 27 and is still work in progress. We are adding more pages soon. If you find any errors we kindly ask you to contact Jan Odstrčilík (jan.odstrcilik@oeaw.ac.at). The Supervisors of the Late Latin Group: Jan Odstrčilík PhD, Austrian Acadamy of Sciences, Daniela Mairhofer PhD, Princeton University, Tobias Hodel PhD, University of Bern.
Annuaire des propriétaires et des propriétés de Paris et du département de la Seine. Lien dans le catalogue de la BNF : https://catalogue.bnf.fr/ark:/12148/cb32697229h. Crédits : Bibliothèque nationale de France. Données vérité de terrain résultant de la transcription et la segmentation manuelle d’un échantillon de 169 pages des annuaires appartenant aux volumes 1898 et 1923. Un modèle de transcription HTR+ a été entrainé à partir de cet échantillon grâce à Transkribus et est disponible sur cette plateforme en mode public. Ce modèle est valable pour transcrire automatiquement les volumes de 1903 et 1913 et tout autre document imprimé à deux colonnes et en utilisant l'alphabet latin et particulièrement en français. Le choix de l'échantillon est fait par critère alphabétique car c'est le mode d'organisation de l'information dans ce document. Les accolades présentes dans le document n'ont pas été segmentées. 118 pages pour entrainer et 51 pages pour validation. Contexte et financement : Subvention DAHN (Dispositif de soutien à l'archivistique et aux humanités numériques) par le MESRI. Equipes : Consortium Paris Time Machine - TGIR Humanum EHESS / CNRS / LATTICE / INRIA Contact si besoin d'anonymiser les noms de personnes : carmen.brando@ehess.fr.
Training and validation set. Transcribed records available upon request. The transcribed corpus of records from the Jewish Consumptive Relief Society contains data that include individually identifiable health information, among other sensitive information regarding persons and people. All individuals for whom records are provided have been deceased for at least 70 years, but were they still living today, these records would be recognized as being protected health information under the US Health Insurance Portability and Accountability Act of 1996 (HIPAA). While HIPPA and other privacy laws no longer apply to these individuals, in providing these data the University of Denver wishes to foster research practices that express the utmost respect for the human beings whose lives are represented, at least in some part, in these collections. In addition, we ask researchers respect the lives of these individuals’ ancestors and their communities. To foster practices that honor patients, staff, nurses and physicians connected with the JCRS Sanitorium, as well as their families, ancestors and communities, we ask that researchers disclose their intended use of the collection for review by our Advisory Board (see reverse). This Board is comprised of ethicists, historians, librarians, attorneys, physicians, and members of the Jewish community. In addition, we ask researchers agree to conduct their work under the following set of principles: 1. I affirm the role of JCRS patients and staff as data creators and will avoid exploiting and/or dehumanizing them by treating them simply as data. 2. My research will, when possible and appropriate, account for the contexts surrounding the JCRS subjects as data arise. My work will recognize that all data and datasets are shaped by decisions about how histories are recorded, remembered, and valued. 3. If the nature of my work is such that I am sharing the life stories and/or narratives of individuals in these data, and I can do so with no potential harm to their reputation or that of their ancestors, I will honor them by naming them. If the nature of my work is such that I am exploring large-scale patterns in the dataset, and naming individuals serves no specific research purpose, I will anonymize and/or redact names within the data. 4. If I am publishing the results of research conducted with these data, I will, if possible and appropriate, include a note of recognition and/or gratitude in my publication. We suggest a version of: “This work was made possible in part by the patients, staff, nurses, physicians, and community of the Jewish Consumptive Relief Society (JCRS). The people who lived, worked, and died at the JCRS sought to relieve human suffering. I am grateful to them.”
Dataset for handwritten text recognition on medieval notarial charters written on parchment (1208-1499). The dataset is comprised of 100 digitized manuscripts (3,369 lines), carefully selected to represent the large variation that is present in the sources, encompassing at least 80 distinct hands and various document types (from sales and inventories to last wills and marriage contracts). Written primarily in Medieval Latin with fragments in Medieval Catalan, these manuscripts exhibit varying stages of preservation and degrees of deterioration.
Ground truth (GT) data (jpg and alto xml files) for an OCR model that recognizes printed text in Devanagari script. The GT data was trained on Transkribus with the HTR+ engine. The training was performed on appr. 220 pages with appr. 27,000 words. The validation set was 10% of the training set. The training material is comprised of letterpress printings from the Naval Kishore Press (Lakhnau, North India) from the late 19th and early 20th century in the Hindi, Sanskrit, Braj Bhasha and Awadhi languages. Transcription was performed by Nicole Merkel-Hilf (CATS Library / Heidelberg University Library) with support by Daria Peshcherova (CATS Library / Heidelberg University Library).
Ground Truth (GT) data (JPG and ALTO XML files) which can be used to train OCR models that recognize printed text in Malayalam script. The training material is gathered from 19th and 20th centuries prints. The GT data was trained in Transkribus with the HTR+ and the PyLaia engine with a resulting CER of 2.29% on validation set with HTR+ and 3,20% with PyLaia. The training was performed on 43 pages with appr. 9,000 words. The validation set consisted of 5 pages (ca. 1,000 words). Transcription was performed by Tübingen University Library, the Ground Truth data was created by Elena Mucciarelli (University of Groningen) with support and model training by Dorothee Huff (Tübingen University Library). (2022-11-02)
This data set contains the training data for the following three published Transkribus models\: German Incunabula (Reichenau) Latin Incunabula (Reichenau) Latin/German Bilingual Incunabula (Reichenau) This data set represents an excerpt of a collection of incunabula and post-incunabula of the former Reichenau monastery, now held at the Badische Landesbibliothek in Karlsruhe (see https://digital.blb-karlsruhe.de/topic/view/7530707). As, typically, 1-20 pages were drawn from single prints, it reflects a wide range of typefaces used by early printers from the German language area and Northern Italy. The data was created as part of the project Digitalisierung und Volltexterkennung der ehemals Reichenauer Inkunabeln at the Badische Landesbibliothek, which was funded by the Stiftung Kulturgut Baden-Württemberg.
Dataset on an Arabic corpus of Christian-Islamic theology.
This dataset contains ten pages of Ground Truth from the Dresden Court Diaries of elector Johann Georg II. as Page XML, Alto XML and jpg.
Ground truth of 140 folios of ÖNB Cod. Syr. 1. This ground truth was produced by participants of the Vienna 2024 HTR Winter School, who used Transkribus to manually correct a preliminary automatic transcription that had been generated using Kraken/eScriptorium.
Twenty pages of Ground Truth from the "Hofdiarium des Kurfürsten Johann Georgs II. 1673" (SLUB Mscr.Dresd.K.117; https://www.wikidata.org/wiki/Q134220291). The handwriting is a typical late 17th century Saxon kurrent ("Kanzleikurrent"), with occasional words written in bastarda or fraktur-like script. This transcription is part of a larger project regarding the Dresden court diaries. Check https://slub-dresden.academia.edu/StefanBeckert for further updates.
Twelve pages of Ground Truth from the "Hofdiarium des Kurfürsten Johann Georgs II. 1653-1656" (SLUB Mscr.Dresd.K113; https://www.wikidata.org/wiki/Q133883726). The handwriting is a typical late 17th century Saxon kurrent ("Kanzleikurrent"), with occasional words written in bastarda or fraktur-like script. This transcription is part of a larger project regarding the Dresden court diaries. Check https://slub-dresden.academia.edu/StefanBeckert for further updates.
This dataset was created by a collaborative working group with the aim of transcribing medieval vernacular religious texts across a range of European languages. To reflect the linguistic expertise of the group members, the project included Old and Middle French, Old and Middle Irish, Old Castilian, Old Swedish, and Early New High German (Bavarian). Religious texts were chosen as a common thread because of their wide diffusion in the European vernacular tradition, their high survival rate in manuscripts, and their relevance for the study of medieval cultural and textual practices. The dataset is based on manuscripts preserved in France, Spain, Sweden, Germany, and Ireland, dating from the 11th to the 15th centuries, with a particular concentration in the 15th century. All manuscripts belong to the category of medium to highly decorated literary manuscripts. They are written in clearly identifiable scripts, predominantly in one or two columns; two manuscripts also include marginal texts.
In a project of the Staatsbibliothek zu Berlin, we completely transcribed two manuscripts (Ms. germ. oct. 842, Ms. germ. fol. 1045) related to the Augustinian canonesses in Inzigkofen, dating from the 15th century and containing German texts. They were written by three scribes in three different Gothic scripts.
Ground truth of 133 bifolio images of MS Jerusalem, Saint Mark’s Monastery 36. This ground truth was produced by participants of the Vienna 2025 HTR Winter School, who used Transkribus to manually correct a preliminary automatic transcription that had been generated using Kraken/eScriptorium.
The data set corresponds to 60 pages printed in 1494 by Estanislao Polono and Meinardo Ungut in Seville. These pages are taken from the Regimiento de los Prínçipes (also known as 'Glosa castellana al Regimiento de prínçipes'), and the exemplar used is the INC/901 of the Biblioteca Nacional de España. The type used for this incunabulum is 97G (Martín Abad and Moyano Andrés, Estanislao Polono, 2002, p. 61). This type was used between 1494 and 1500. For other incunabula produced in this period, see op. cit, p.112-121.
Ground Truth for "Urfehdenbuch X der Stadt Basel (1563-1569)" at Staatsarchiv Basel-Stadt (StABS).
The data set is the publication of the data of the scholarly edition "Urkunden und Akten des Klosters und der Hofmeisterei Königsfelden".
6000 ground truth of VOC and notarial deeds and 3.000.000 HTR of VOC, WIC and notarial deeds The National Archives of the Netherlands and Noord-Hollands Archief conducted a project using the Transkribus HTR (Handwritten Text Recognition) platform. The aim was to semi automatically transcribe 2 million pages of old Dutch texts. The transcribed archives are 17th and 18th century documents from the Dutch East-Asia Company (VOC). And 19th century notarial deeds from Noord-Hollands Archief and other archives in the provinces. In order to train the HTR software a team produced transcriptions of approximately 6000 scans. The scans are randomly selected from the dataset and contain hundreds of hands. With these transcriptions a model is trained that can recognize more than 90% of the characters correctly. Transkribus transcribed the 2 million scans automatically using the trained model. Later on, 1 million extra scans concerning the West India Company (WIC) were transcribed automatically without adding extra ground truth or training. These archives are from the 17th and 18th century. The datasets published in Zenodo contain the ground truth (scans in JPG, transcription in PAGE XML) and the HTR results (in PAGE XML and TXT). See the overview on the Zenodo page. A specification on which archives have been transcribed (both GT and HTR) can be found on the Zenodo. For open data access of scans and inventories of the National Archives click here: https://www.nationaalarchief.nl/onderzoeken/open-data/archiefinventarissen-digitale-objecten-en-scans-van-archieven Disclaimer: due to a variety of languages used and the bad state of the documents the HTR results of "1.05.21, Dutch series Guyana" can be of poor quality.
HTR/OCR open access gold corpus for spanish late medieval sources, based on the allographetic transcription of more than 300 pages of several manuscripts of the Regimiento de los Prínçipes, as well as a first set of general transcription models trained with kraken and out-of-domain test data. See https://doi.org/10.5281/zenodo.7387376 for full description of the dataset.
This is ground truth for the vast collection of sermons of Nikolaus von Dinkelsbühl (ca. 1360 to 17th March 1433), translated and reorganised by a German redactor, from the 15th century has never been edited until now. It consists of 361 folios of parchment and paper. The text speaks about various topics such as fasting and other religious practices. Being one of the leading intellectuals of his time, Nikolaus von Dinkelsbühl also contributed to the development of the University of Vienna. The manuscript was probably produced in the vicinity of Klosterneuburg in Austria and is still kept there today (Shelfmark: Cod. 48). Data collection and ground truth creation: The edition at hand was produced by an international team of researchers from various fields in the context of the Vienna HTR Winter School 2022 with the help of Transkribus Expert Client. We uploaded the images of the manuscript into the Transkribus platform, applied the line recognition tool and manually copied the transcribed text lines into the recognised line boxes. Various models were trained with the ground truth (20% of the entire codex) created by the team. Images of the Klosterneuburg, Augustiner-Chorherrenstift, Cod. 48 are available at: https://manuscripta.at/diglit/AT5000-48/0001
This dataset contains layout annotations for ca. 370 pages sampled from 8 public domain classical commentaries, published in the 19th century in English, German and Latin. The commentaries concern Ancient Greek and Latin works from prose and poetry (caveat: AGreek poetry is slightly over-represented). Pages were annotated according to a taxonomy mapped to the SegmOnto controlled vocabulary.
This dataset was designed for training machine learning models in the context of the [iForal project](https://iforal.hypotheses.org/), which focuses on transcribing medieval Portuguese texts, specifically forais (charters). It includes images of medieval manuscripts, along with corresponding line-level transcription labels, to facilitate the development of models capable of recognizing and transcribing historical handwriting. The dataset is ideal for OCR/HTR tasks and segmentation tasks within the domain of medieval document transcription. It serves as a critical resource for advancing automated transcription tools for medieval texts, making historical archives more accessible.
HTR data sets from medieval manuscripts (13th-14th c.) collecting "fabliaux" funded by Biblissima+
HTR datasets of medieval manuscripts (14th-15th c.) with Pierre Bersuire’s translation into Old French of the work of Titus Livius and Nicolas Trevet Commentaries
Digital editions of the second part of the Genevan Spanish chapbooks collection (19th c.).
Multilingual dataset from various corpus of the EHRI project
Various prints (academic, archives, novels…)
Novels written in Spanish
French Manuscripts of the 18th
Transcriptions of French 16th c. prints
French novels
Archives and novels
Swiss art exhibitions catalogues
Groundtruth for 19th/20th sale/exhibition catalogues, mainly printed in France but not only.
HTR data made with the Kunsthistorisches UZH corpus.
Some transcriptions of minute books from military court councils during the First World War
Corpus d'entrainement pour l'HTR composé de manuscrits français du 15e s.
Corpus d'entrainement pour l'HTR constitué d'imprimés du 16e siècle
Corpus d'entrainement pour l'HTR composé d'imprimés français du 17e s.
Corpus d'entrainement pour l'HTR constitué d'imprimés du 18e siècle
Corpus d'entrainement pour l'HTR composé d'incunable français du 15e s.
This repository contains all data relating to the LiDi 1.0 project. In particular HTR GT of 16th antiquarian Pirro Ligorio, used to create Transkribus public model Ligorio 0.3 PyL.
An anonymous Irish commentary on the Gospel of Matthew (end of the 8th/early 9th c.), from a manuscript from Salzburg/Saint Amand possibly under bishop Arn (785-821). Files: 2 Lines: 7835 Latin, Carolingian Minuscule (9th century)
WWI’s Poilus' testaments edited by the Archives National during the Testaments de Poilus project.
Manuscripts of the 16th century
Various Manuscripts of the 17th century
Manuscripts of the 18th century
Manuscripts of the 19th century
Manuscripts of the 20th century
Ground truth for medieval latin manuscripts. Formerly `CREMMA-Medieval-LAT`.
Collection of book samples in early print forms, 16th to 17th century, in Latin and pre-orthographic French.
Transcription corpora for training HTR models for medieval manuscripts from the 12th to the 15th century.
The CREMMA-WIKIPEDIA project aims at creating a collection of ground truth to train HTR models on contemporary French handwriting. Each image represents an exerpt from a randomly selected Wikipedia page, copied by hand by volunteers. We then took care of the alignment between the handwritten portion and the original text, also present on the image.
OCR ground Truth dataset based on French 20th typewritten letters
Ground truth for Maître Bronod’s registers, notary in Paris during the 18th century.
Ground truth for the Registres des Contrats de Mariages et des Séparations et Divorces in Paris. The documents are written in Franch during the 19th century, contain many names and addresses. The information is organized in tables spreading on two pages. The table’s headers and the preamble are printed.
Ground truth for various Parisian registries of notary deeds written in French during the 19th century. The information is organized following pre-printed tables (with printed headers) and contain many names, addresses, numbers and abbreviations.
Ground truth based on a variety of French typewritten documents from the 20th century. Contains exerpts plays, poems, letters and administrative reports.
Ground-Truth for French 19th century pre-printed documents created by administrative services.
Ground truth of Old French and Middle French manuscripts. Manuscripts vary in themes, period, etc. Most manuscript have at most 10 columns transcribed.
Transcription of samples of Medieval Italian manuscripts
Ground truth of Latin medieval manuscripts. Manuscripts vary in themes, period, etc. Most manuscript have at most 10 columns transcribed.
Ground truth of medieval manuscripts from Spain. Manuscripts vary in themes, period, etc. Most manuscript have at most 10 columns transcribed.
Dataset for modern roman languages created within the context of the HTRomance project, using manuscripts from the Gallica digital library.
Ce dataset pour la reconnaissance des écritures automatiques est composé d’un mélange de transcriptions de documents du 17e-18 siècle (actes de mariage, preuves de noblesse etc.), essentiellement en français, et provenant de la série M, titre III "Titres nobiliaires" des Archives nationales de France.
Parts of Antoine Vérard’s editions princeps of "Tristan", "Merlin" and "Gyron le Courtoys".
The OCR-D Ground Truth text and structure corpus was created between 2015-2017. In the years since 2017, this corpus has been further curated and supplemented with metadata where appropriate. The corpus includes page XML files within annotations of the text and structure include. The data is based on transcription data stored in the German Text Archive (DTA) (https://www.deutschestextarchiv.de/).
This repository hosts all the documents, including transcriptions, bibliographical references and introduction that serve the team Boccace for the validation of the course "Bonnes pratiques du developpement collaboratif : initiation à Git" (prof. Thibault Clérice), of the first semester - Master Humanités Numériques ENC-PSL 2021-2022. At the same time it and constitutes part of the biannual project "Per un’edizione digitale della Genealogia deorum gentilium" di Boccaccio" (dir. F. Duval, M. Maulu). Financed in 2021, this project foresees to put on line in XML format the unpublished translation in Middle French entitled "De la genealogie des dieux".
Le document sur lequel nous travaillons porte sur le Château de Chavigny à Lerné en Touraine. Au XVIème siècle, c’est la famille des seigneurs Leroy qui possède ce château. Avant 1568, en pleine guerre de religion, François Leroy, du parti du roi et des catholiques, participe à la capture et la rançon du prince de Condé, du parti protestant. En 1568, François Leroy, en tant que capitaine de 50 lances au service du roi, part en campagne avec lui. L'objectif est de transcrire cinq feuillets d'un manuscrit à l'aide d'eScriptorium. Le but étant d'apprendre à utiliser git et github pour mener à bien notre premier projet collaboratif.
Nous avons choisi de transcrire le deuxième chapitre de l’ouvrage de Maxime Kovalewsky : Coutume contemporaine et loi ancienne : droit coutumier ossétien, éclairé par l’histoire comparée. Paris, L. Larose, 1893.
"Les données sources ont été téléversées sur le site From the page par les Archives de l’Université Stanford qui en sont les propriétaires. Elles ont ensuite été retranscrites par des bénévoles anonymes ; c'est leur travail nous a servi de base pour corriger nos propres retranscriptions. Les documents sources choisies sont des lettres de diffé rents auteurs portant sur les obsèques de Jane Lathrop Stanford. Les lettres sélectionnées étaient les lettres : 42, 43, 46, 49, 50, 54, 57 à 60, 69, 75, 76 [section 1, retranscrites par Perrine MAUREL] ; 80 à 93 [section 2, retranscrites par Ingrid GUIMARÃES] ; 241 à 242 [section 3, retranscrites par Yagmur OZTURK].
Le premier ouvrage s’intitule *Pontenôvu* a été écrit par Petru Rocca et publié par la "Stamparia di a Muvra" en 1927. Il s'agit d'un recueil de poèmes en corse et en français dont les thèmes varient. *A Muvra* est un journal autonomiste corse d'influence maurassienne qui a existé pendant toute la période de l'entre-deux-guerres. Se revendiquant comme étant une revue culturelle, la dimension politique de la revue (incarnée par le PCA, ou Partitu corsu d'azione), en a fait un mouvement controversé. C'est dans ce contexte de lutte politique et d'éveil culturel corse que s'inscrit ce recueil. Le second ouvrage s'intitule *A nostra Santa Fede - Catechismu Corsu*, écrit par Ageniu Grimaldi en 1926 sous le pseudonyme de Saveriu Malaspina. Proche de Petru Rocca, ce-dernier est l'un des théoriciens de l'autonomisme corse de l'entre-deux-guerres et fidèle muvriste. Dans l'ouvrage, il est fait mention notamment de la façon dont un vrai corse doit se comproter vis-à-vis de sa foi envers Dieu et son île. Bien qu'il ne s'agisse pas réellement d'un recueil de poèmes, le style d'écriture de cet ouvrage est particulièrement intéressant. Il reprend un style qui se rapproche des écrits bibliques.
L’argus des brevets de 1910 se présente sous la forme d’un imprimé contemporain, organisé en rubriques regroupant de manière chronologique puis thématique les brevets déposés en France. Cette énumération et présentation succincte des brevets est répartie en deux colonnes et présente des abréviations normalisées. Dès lors, ce présent guide de contribution au projet entend présenter l’ensemble des normes de transcriptions adoptées au cours de ce projet de transcription, réalisé sur la plateforme E-scriptorium, dans le cadre du cours Git du master TNAH à l’ENC.
Le projet vise à la consitution de vérités de terrain pour l’entraînement de modèles HTR à partir d'un manuscrit français des années 1430-1455 : le manuscrit 5070 de la Bibliothèque de l'Arsenal (reproduit sur Gallica). Ce manuscrit contient la traduction française du Decameron de Boccace par Laurent de Premierfait. Nos vérités de terrain recouvrent la description de la peste à Florence située dans le prologue de l'ouvrage.
Le Congrès international des sciences ethnographiques de 1878 a eu lieu à l’occasion de l'Exposition universelle de 1878, à Paris. Édité en 1881 par l'Imprimerie nationale, le compte rendu de ce congrès a été mis à disposition par le Conservatoire numérique des Arts et Métiers.
Nous avons choisi de travailler sur la correspondance active de Hector Berlioz adressée à sa sœur Anne-Marguerite "Nanci" Berlioz. L’ensemble des lettres adressées à Nanci Berlioz représentait un volume trop important pour notre projet, aussi nous les avons sélectionnées, par souci de cohérence, selon un ordre chronologique (voir le tableau de gestion) pour la liste exacte des lettres transcrites).
Le Projet Notre-Dame consiste en une transcription des journaux quotidiens de l’année 1860 (https://mediatheque-patrimoine.culture.gouv.fr/sites/mediatheque/files/jnd_1860.pdf) des travaux de restauration effectués de 1844 à 1865 à la cathédrale Notre-Dame de Paris sous la direction d'Eugène Viollet-le-Duc et Jean-Baptiste Lassus. Celle-ci a été effectuée sur eScriptorium à partir de la numérisation des journaux des travaux (https://mediatheque-patrimoine.culture.gouv.fr/travaux-de-notre-dame-de-paris-1844-1865) réalisée par la Médiathèque de l'architecture et du patrimoine.
Dataset produced as for the project to edit Gasparo Sardi’s Toponomasia from codex 174 of the Burgerbibliothek of Bern. Images are available on request by writing to: pauline.jacsont [ at ] unige.ch.
Ensemble de formulaire de recensement
Ground Truth dataset for Spanish 19th typewritten OCR. The archives come from the events of the Occupation of Araucania (1850-1881) in Chile. They are archived in the ’Colección manuscritos' of the Archivo Central Andres Bello - Universidad de Chile.
OCR data for the SETAF project, 16th-century French prints in Gothic characters.
OCR data for the SETAF project, 16th-century French prints in Gothic and Roman characters.
OCR data for the SETAF project, 16th-century French prints in Gothic characters.
Jeu de vérités de terrain pour la transcription automatique produit avec eScriptorium dans le cadre du cours HNU2000 à l’Université de Montréal au trimestre d'automne 2024. Le jeu de données contient des pages tirées aléatoirement des numérisation du "Journal de Célestine Doniau-Danest sur les débuts de la Guerre 1914-1918" mis en ligne par les Archives départementales de la Somme. *Ground Truth dataset for automatic text recognition created with eScriptorium during the HNU 2000 course at the Université de Montréal during the Fall 2024 semester. The dataset contains pages taken randomly from the digitization of the "Journal de Célestine Doniau-Danest sur les débuts de la Guerre 1914-1918" (Diary of Célestine Doniau-Danest on the beginning of the 1914-1918 war), published by the departmental archives of Somme.*
This dataset is composed of pages of text written in 2023 by a single person, copying texts taken from Guillaume Apollinaire's poems published in Alcools, and taken from Guillaume Apollinaire's Wikipedia page.
Ground Truth for the Digital Peraire project.
This project draws inspiration from the CREMMA WIKIPEDA data set, with the objective to create a ground truth repository of contemporary Québécois handwriting to train HTR models. It is based on a collection of randomly selected Wikipedia summaries. Each text comprises between 125 and 175 words and was copied by hand by volunteers. The texts were ordered in a way to prioritize texts that presented rare character 1- and 2-grams. Non-French characters were replaced with "-". In general, the copy of one text took between 1 and 2 pages. In total, 267 volunteers copied 265 texts (2 texts were unfortunately copied twice by two different volunteers). We took care of the alignment between the handwritten portion and the original text.
HTR ground-truth of the CHI-KNOW-PO project (Collex-Persée), that aimed to digitize a corpus of belletristic anthologies, scholarly collections, dictionaries and encyclopedias from the Chinese medieval period (ca. 200-1000) and to process them using HTR.
The Dataset is made up of 250 images, with their related ground truth stored in a XML file (pageXML format). Images come from fifteen manuscripts selected among the collections of the BULAC Library (Paris), in Magribi Arabic. It extends RASAM 1 by covering a very wide variety of hands, text density, and cursiveness. This dataset is the result of a collaborative transcription. All the participants are credited on the official deposit. With the support of the French Ministry of Higher Education, Research and Innovation, the Research Consortium Middle-East and Muslim Worlds (GIS MOMM), Calfa and the BULAC library.
The dataset has been collated within the frame of the TariMa project (Tarih al-Maghrib. Writing History in the Maghreb in the modern and contemporary era), sponsored by the French agency Collex-Persee and led by Antoine Perrier (CNRS). It comprises different image resolution and size (width from 982px to 8049px), different layouts (double page, multiple columns), and state of conservation. It also mixes microfilms, scans and lithographies. It presents a very wide variety representative of the Maghrebi Arabic production.
Imprimés classiques
150 transcribed images from "Tables Décennales" French Civil Registry. Those come from Sermaises and Romilly-sur-Seine municipalities.
XML transcriptions and JPEG images exported from Transkribus as ground truth for an eScriptorium-Kraken HTR model (CER 11-12%) trained on the correspondence of Joseph Dalton Hooker (1817-1911), primarily letters to William Turner Thiselton-Dyer (1843-1928) during the late-19th/early-20th century. Many transcriptions in this dataset were generated by a small team of anonymous volunteers as part of the Joseph Hooker Correspondence Project based at Kew Gardens. All images in this dataset are reproduced with the kind permission of the Board of Trustees of the Royal Botanic Gardens Kew (© RBG, Kew). Contact archives@kew.org for more information. HTR Model: Schaefer, John, & Litvine, Alexis. (2023). Joseph Hooker HTR Model. Zenodo. https://doi.org/10.5281/zenodo.8038689
Ground truth dataset for a selection of printed books from NuBIS, the digital library of the Bibliothèque Interuniversitaire de la Sorbonne.
Ground truth for minuscule caroline of the late 9th century from the grammatical work "de uerbo" of Eutychès.
This dataset comprises PageXML for training segmentation models in Transkribus and Kraken. It is designed to capture the specific layout of medieval canon law collections. Compiled from several 11th-century manuscripts of the Decretum Burchardi, it supports the ongoing edition project Burchards Dekret Digital. Annotations are tailored to project-specific needs but can be adapted for other use cases. The data was first prepared using Transkribus and then remasked in eScriptorium for usage in Kraken.
Ground truth data in German and English of Shakespeare and Scott prints in original and different translations.
The Paris Bible Project aims to understand the production and diffusion of medieval Latin Bibles in Europe. The dataset includes ground truth from Paris Bibles produced in the 13th and 14th centuries. We also provide the most recent version of our list of Paris Bible manuscripts found in the world along with information about them.
This dataset contains 165,673 image and corresponding text line files (.png for images and .txt for the texts) in a random 80/10/10 training, validation and test set split. The source is the extensive correspondence of Swiss reformer Heinrich Bullinger (1504-1575) and his over 800 different correspondents. It therefore contains great variety in handwriting styles. Furthermore, it is multilingual since there are Latin and Early New High German (and sometimes mixed) letters. The data is split into Latin and Early New High German (determined with langid) and put into separate folders (de for Early New High German and la for Latin).
This ground truth repository is a work in process; it currently accounts for a part of our complete Caroline Minuscule training pool of around 70 manuscripts used for our OCRopus Caroline Minuscule model (see ocropus-models repository).
La correspondance de Constance de Salm (femme de lettres française) comprend différents spécimens d’écriture du début du XIXe siècle. Le jeu de données atteste les mains de quatre copistes différents.
This repository contains Handwritten Text Recognition training data (layout segmentation and transcriptions ) for the Sloane Lab HTR model. The HTR model is trained on the handwriting of Hans Sloane (1660-1753). Funding: Enlightenment Architectures: Leverhulme Trust Project Grant 2016-21 The Sloane Lab: Towards a National Collection – AHRC AH/W003457/1
Ground Truth for Astori’s letters (see the README.md file for details)
The HPGT dataset consists of images of Handwritten Paleographic Greek Text, derived from the Bodleian Libraries' Greek manuscript collection, specifically the Barocci collection, which dates from the 8th to the 17th centuries. This dataset is divided into two editions: HPGTR.N, which contains 77 unsegmented images categorized by century from the 10th to the 16th, and HPGTR.S, which features carefully segmented lines from selected images to facilitate machine learning tasks. The dataset captures a range of characteristics, including variations in writing style, page conditions, and manuscript production details. This dataset is part of the following work: Paraskevi Platanou, John Pavlopoulos, and Georgios Papaioannou. 2022. Handwritten Paleographic Greek Text Recognition: A Century-Based Approach. In *Proceedings of the "Thirteenth Language Resources and Evaluation Conference"*, pages 6585–6589, Marseille, France. European Language Resources Association.
Ground Truth dataset for the Codex palatinus graecus 23 (Palatine Anthology), byzantine writing from the X^th^ century.
Projet entrepris dans le cadre du programme La Bibliothèque d’art et d’archéologie de Jacques Doucet : corpus, savoirs et réseaux de l’Institut national d’histoire de l’art à partir d’un corpus de lettres et documents conservés au Département des manuscrits de la Bibliothèque nationale de France sous la cote NAF 13124, une des principales sources sur la relation entre Doucet et René Jean qu’il engagea comme bibliothécaire le 2 juin 1908.
Ensemble de documents autour du sculpteur Antoine-Louis Barye. Paris, Bibliothèque de l’Institut national d’histoire de l’art, collections Jacques Doucet, Archives 166. Institut National de l’Histoire de l’art (INHA) / Set of documents about the sculptor Antoine-Louis Barye. Paris, Library of the Institut national d'histoire de l'art, Jacques Doucet, Archives 166. National Institute of Art History (INHA)
Ce jeu de données contient la transcription et la segmentation de 107 pages sélectionnées parmi les carnets de terrain de l’archéologue Jean-Jacques Hatt. Ces opérations ont été réalisées à l’aide du logiciel eScriptorium (instance INRIA). Le corpus contient des paires Texte-Image au format XML ALTO-JPG. L’échantillon a été choisi pour privilégier des pages avec : - des lignes de texte standard, - une écriture de gauche à droite, - un minimum d’insertions graphiques. Certaines pages incluent cependant des zones graphiques, identifiées par une segmentation manuelle. Ces transcriptions ont servi à l’entraînement d’un modèle spécifique adapté à l’écriture de Jean-Jacques Hatt. Les documents couvrent la période 1942-1944.
This dataset contains normalized transcriptions of collections of distinctions, specifically "Summa de abstinentia" by Nicolas of Biard and "Dictionarium bovis" by Thomas of Pavia. They were prepared as part of the DISTINGUO project, dedicated to the study of distinctiones in medieval Latin preaching and led by Marjorie Burghart in 2019-2024.
Dataset from TranscriboQuest 2025, Medieval Latin group. This dataset focuses on layout. All manuscripts are glossed latin manuscripts with complex layouts. The dataset contains 5000 typed lines, 700 of which have been transcribed.
The Neue Zürcher Zeitung (NZZ) has been publishing in black letter from its very first issue in 1780 until 1947. From this time period, we randomly sampled one frontpage per year, resulting in a total of 167 pages. We chose frontpages because they typically contain highly relevant material and because we want to make sure not to sample pages containing exclusively advertisements or stock information. During certain periods, the NZZ was published several times a day, and there were supplements, too. Due to incomplete metadata, the sampling included frontpages from supplements. We then manually corrected the pages, so it can be used as a ground truth to improve the OCR of black letter in historical newspapers.i
This is ground truth for Rudolph Gwalther’s (1519-1586) handwriting taken from his book "Lateinische" Gedichte", where he accumulated writings between 1540 and 1580. Data collection and ground truth creation: At the time we collected the data, we found 150 images with corresponding transcriptions by Peter Stotz on e-manuscripta (reference: Gwalther, Rudolf: Lateinische Gedichte. Zürich, 1540-1580. Zentralbibliothek Zürich, Ms D 152, https://doi.org/10.7891/e-manuscripta-26750 / Public Domain Mark) . We removed 8 images with too many corrections or vertical texts. Next, we uploaded the images into the Transkribus platform, applied the line recognition tool and manually copied the transcribed text lines into the recognised line boxes. During this process, we made some corrections, which were mainly due to inconsistencies in punctuation and capitalised letters.
This dataset for Handwritten Text Recognition includes layout segmentation (regions, toplines and linepolygons) and unicode-transcriptions in alto 4.2 XML for 202 images of Medieval Hebrew manuscripts from the Bibliothèque nationale de France (BnF, National Library of France) and the Biblioteca Apostolica Vaticana (BAV, Vatican Library) corresponding to the article "BiblIA - a General Model for Medieval Hebrew Manuscripts and an Open Annotated Dataset" by Daniel Stökl Ben Ezra, Bronson Brown-DeVost, Pawel Jablonski, Benjamin Kiessling, Elena Lolli, and Hayim Lapin, published in HIP@ICDAR 2021 held in Lausanne, September 2021.
The POPP datasets is a set of 3 datasets created within the POPP project (Project for the Oceration of the Paris Population Census) for the task of handwriting text recognition. These datasets have been published in "Recognition and information extraction in historical handwritten tables: toward understanding early 20th century Paris census" at DAS 2022. The 3 datasets are called “Generic dataset”, “Belleville”, and “Chaussée d’Antin” and contains lines made from the extracted rows of census tables from 1926. Each table in the Paris census contains 30 rows, thus each page in these datasets corresponds to 30 lines.
This is Ground Truth data created during the HTR Winter School 2022 for the Cod. 2160 ÖNB that contains one version of the so called Lex Dei.
This is ground truth based on the Padeřov Bible (Vienna, Austrian National Library, shelfmark Cod. 1175, 1432–1435), the bible of the third redaction of the Old Czech Bible translation. The transcription rules were based on semi-diplomatic transcription rules set by PERO OCR and Směrnice pro vydávání starších českých textů set by Jiří Daňhelka (https://vokabular.ujc.cas.cz/moduly/edicnipoznamka.aspx?id=DanhelkaSmernice). Abbreviations were tagged and expanded.
This dataset includes minutes of Belfort municipal council drawn up between 1790 and 1946. Documents include deliberations, lists of councillors, convocations, and agendas. The dataset includes 24,105 text-line images that were automatically detected from pages. Up to four transcriptions are available for each line image: * two from human annotators (in `Transcriptions/callico_1/` and `Transcriptions/callico_2/`) * two from automatic models (in `Transcriptions/dan/` and `Transcriptions/pylaia/`)
Open-source handwritten text recognition models for historic Dutch
The Revolutionary City is a partnership between the American Philosophical Society, the Historical Society of Pennsylvania, and the Library Company of Philadelphia to digitize all manuscript material related to Philadelphia and the American Revolution (1763-1804). This dataset is a transcribed subset of the larger digitized corpus, provided in ALTO format with an intended use in training Handwritten Text Recognition (HTR) models. The material is overwhelmingly in English, though a few letters in French have been included. The material contains a mixture of correspondence and journals. The correspondence has been annotated to distinguish between the different parts of a letter (Salutation, Date and Address, Addressee, Address, Closing, Postscript). The transcriptions were produced by staff and interns at the American Philosophical Society. Each document was reviewed at least once by another transcriber. The corpus exhibits a wide variety of variation in hands, handwriting styles, paper quality and levels of damage. The corpus encompasses material from 1758 to 1805, but the majority of the documents fall between the years 1774 to 1783.
The dataset originates from a Greek handwritten codex that dates from around 1500-1530. This is the subset of the codex British Museum Addit. 6791, written by two hands, one by Antonius Eparchos and the other by Camillos Zanettus (ff. 104r-174v) and delivers texts by Hierocles (In Aureum carmen), Matthaeus Blastares (Collectio alphabetica) and, notably, texts by Michael Psellos (De omnifaria doctrina). The writing delivers the most important abbreviations, logograms and conjunctions, which are cited in virtually every Greek minuscule handwritten codex from the years of the manuscript transliteration and the prevalence of the minuscule script (9th century) to the post-Byzantine years. This dataset consists of 120 scanned handwritten text pages, containing 9285 lines of text, 18809 words (6787 unique words). For each page, a PageXML is provided containing the following groundtruth: 1. Text region polygon coordinates 2. Text line polygon coordinates with the corresponding transcription text 3. Word polygon coordinated with the corresponding transcription text
It comprises manuscripts made of paper, written in the 16th century and its dimensions are 220X165 mm. The manuscript is embellished with epititles and red initials. Tachygraphical symbols and abbreviations are encountered in the manuscript as well. The dataset of XΦ79 consists of 803 lines of text containing 4389 words (2069 unique words) that are distributed over 40 scanned handwritten text pages. For each page, a PageXML is provided containing the following ground-truth: 1. Text region polygon coordinates 2. Text line polygon coordinates with the corresponding transcription text 3. Word polygon coordinated with the corresponding transcription text
It comprises manuscripts made of paper, written at the end of the 15th century and its dimensions are 218X150 mm. In various pages, we find red initials and epititles which enrich the manuscript’s decoration. The dataset of ΧΦ114 consists of 1051 lines of text containing 5467 (2877 unique words) words that are distributed over 44 scanned handwritten text pages. For each page, a PageXML is provided containing the following ground-truth: 1. Text region polygon coordinates 2. Text line polygon coordinates with the corresponding transcription text 3. Word polygon coordinated with the corresponding transcription text
The collection is one of the oldest Stavronikita Monastery on Mount Athos. It is a parchment, four-gospel manuscript which has been written between 1301 and 1350. It comprises 54 pages with dimensions that are approximately 250x185 mm. The script is elegant minuscule and the use of majuscule letters is rare. Tachygraphical symbols and abbreviations are encountered in the manuscript as well. Furthermore, the manuscript is enriched with chrysography, elegant epititles and initials. The dataset of ΧΦ53 consists of 1038 lines of text containing 5592 words (2374 unique words) that are distributed over 54 scanned handwritten text pages.
Diplomatic transcription of papyri found in the Zenon archive [see en.wikipedia.org/wiki/Zenon_of_Kaunos] Manually prepared as PageXML with Transkribus within D-Scribes project.
This repository hosts HTR ground truth created within the context of the ANR e-NDP project. This dataset based on 512 pages from the 26 registers of the Notre-Dame de Paris cathedral chapter. The volumes containing the chapter conclusions were conceived to serve as memorial records, but above all as documents for regular use and consultation in the daily practice of administration and management. The registers were written using a Cursive script (ca. late XIIIe - XVIe) and their content is were written mainly in Latin, the rest in French. There are no fewer than 18 hands in these pages. The transcriptions were manually completed in two rounds by a group of 12 contributors, including historians and paleographers, over the course of 2021-2022 using eScriptorium.
Écriture manuscrite norvégienne.
Référence anglaise manuscrite.
Journaux numérisés XIXe (OCR + images).