Focus On: In Codice Ratio
30, JanFocus-on is our exclusive series of articles describing cutting-edge projects exploiting machine learning in Italy and around the world. Our initial installment is devoted to In Codice Ratio, an interdisciplinary research aiming at the analysis and knowledge discovery of historical documents taken from the collections of the Vatican Secret Archives. Our guest writer is Donatella Firmani, assistant professor at Roma Tre and one of the main contributors to the project, together with Paolo Merialdo (Professor) and Elena Nieddu (P.h.D. Student).
The In Codice Ratio (ICR) Project
Historical handwritten documents are an essential source of knowledge concerning past cultures and societies. Automatic text processing methods promise to empower scholars with a quantitative and data-driven tool to study culture and society, but their power has been limited by the amount of digitally transcribed sources. Due to the many challenges involved in a fully automatic handwriting transcription (such as irregularities in writing, ligatures and abbreviations), many researchers in the last years have focused on solving easier problems, most notably keyword spotting. However, as more and more libraries worldwide digitize their collections, greater effort is being put into the creation of full-fledged transcription systems.
In Codice Ratio is an interdisciplinary project for the automatic transcription of the Vatican Registers, a corpus of more than 18.000 pages contained as part of the Vatican Secret Archives. Our workflow consists of a character recognition phase, featuring a deep convolutional neural network, and a proper transcription phase, relying on statistical language models. The Vatican Registers corpus consists of more than 18.000 pages of official correspondence of the Roman Curia in the 13th century, including letters, opinions on legal questions, addressed from and to kings and sovereigns, as well as to many political and religious institutions throughout Europe. Never having been transcribed in the past, these documents are of unprecedented historical relevance.
The main contribution of the In Codice Ratio project so far is an end-to-end transcription pipeline based on fine-grained segmentation of text elements into characters and symbols. Our pipeline first partitions sentences and words into text segments. Most segments contain actual characters, but there are also segments with spurious ink strokes. (Perfect segmentation cannot be achieved without transcription. This result is known as Sayre’s Paradox.) Then, the pipeline submits all the segments to a deep convolutional neural network (CNN) for optical character recognition (OCR), and reassemble such noisy labels into words and sentences using language statistics.
Our OCR network has a total of 23 classes (including minuscule characters of the Latin alphabet) and is designed following recent progresses in deep learning, especially recent neural networks models for character-level classification. Its most notable feature is a special "non-character" class, handling spurious stroke combinations from the segmentation step. Other features include 56 x 56 single-channel images input and 8 adaptable layers: 3 convolutional layers, each applying 2 x 2 stride 2 max-pooling, and 2 feed-forward layers.
We trained the network by using a custom crowdsourcing procedure. Specifically, we implemented a dedicated crowdsourcing platform and employed more than a hundred high-school students to manually label the dataset. Each student was required to select, like in a jigsaw-puzzle, all the pieces in a word image to visually match a given character symbol, with the least possible amount of extra-strokes. Above we show a screenshot of a sample labeling task. To overcome the complexity of reading ancient fonts, we provided positive examples of each symbol (in green) and students were told to leverage visual patterns, rather than trying to read. After a data augmentation process, the result is a high-quality dataset of 23.000 characters, which is publicly available online.
Our deep CNN trained on this dataset achieves an overall accuracy of 96%, which is one of the highest results reported in the literature so far. We observed that while humans can easily distinguish character symbols from strokes combination that casually resemble writing patterns, this turns out to be a hard task for an automatic classifier. Respective to the non-character class, indeed, our classifier achieves 95% precision but only 74% recall. To this end, our end-to-end pipeline leverage language statistics to tolerate a certain amount of "false characters" from the OCR step. Our end-to-end system was able to produce good transcription for almost 80% of the examined words, providing paleographers a solid basis to speedup the transcription process at a large scale.
Publications
- Serena Ammirati, Donatella Firmani, Marco Maiorino, Paolo Merialdo, Elena Nieddu. In Codice Ratio: Machine Transcription of Medieval Manuscripts. IRCDL 2019: 185-192.
- Donatella Firmani, Marco Maiorino, Paolo Merialdo, Elena Nieddu. Towards Knowledge Discovery from the Vatican Secret Archives. In Codice Ratio - Episode 1: Machine Transcription of the Manuscripts. KDD 2018: 263-272.
- Donatella Firmani, Paolo Merialdo, Elena Nieddu, Simone Scardapane. In Codice Ratio: OCR of Handwritten Latin Documents using Deep Convolutional Networks. AICH@AIIA 2017: 9-16.
- Serena Ammirati, Donatella Firmani, Marco Maiorino, Paolo Merialdo, Elena Nieddu, Andrea Rossi. In Codice Ratio: Scalable Transcription of Historical Handwritten Documents. SEBD 2017: 65.
News
- Artificial Intelligence Is Cracking Open the Vatican's Secret Archives - The Atlantic (Italian version, by Internazionale)
- AI tackles the Vatican’s secrets - MIT Technology Review
- AI Helps Researchers Unlock Mysteries of Vatican Archives - NVIDIA Developer
- Storia segreta svelata - Nova, Il sole 24 ore (in Italian)
- New Tech for Old Texts: How Deep Learning Deciphers Historical Documents - NVIDIA
- Alleanza tra Roma tre e liceo Keplero nel segno dell’alternanza scuola-lavoro - Sole 24 Ore
- In Codice Ratio: Scalable Transcription of Vatican Registers - ERCIM News
- How artificial intelligence is cracking the code of the Vatican Secret Archives - Aleteia
If you liked our article, remember that subscribing to the Italian Association for Machine Learning is free! You can follow us daily on Facebook, LinkedIn, and Twitter.