Information for libraries

  • our website

You are here: Home Archives 2022/2 Reviewed articles Artificial intelligence helps to access manuscript heritage

Artificial intelligence helps to access manuscript heritage

Summary: The topic of the study is the scientific and methodological context of the European project of basic research READ and application of the results of this research in Slovakia and the Czech Republic. The study is part of the ongoing applications of the READ project. It shows the progress of research, applications and experiments undertaken by the digital humanities international community involved in the READ-COOP association since 2019. Part of these activities is also a Slovak project of applied research with the acronym of SKRIPTOR, planned for 2020-2024. Based on information survey and selection of the latest information sources, there has been some progress in research and applications in the field of OCR. The core of the study is focused on the user-centred rather than IT-based approach to the use of the Transkribus platform for automatic text recognition of historical documents. It describes the experience and knowledge gained in adopting the Transkribus platform that uses artificial intelligence of the OCR machine and the HTR+ method. The study explains and illustrates the main steps of the experiments, the process of training of the machine, the creation of new models of transcription, and the results of automatic transcription of printed Fraktura texts and manuscripts by Andrej Kmeť. The study also presents the first new efficient transcription model for printed historical type of Slovak Fraktur (Gothic) script in the Transkribus platform. First, it explains a unique experiment with the transcription of printed Slovak and Czech Fraktur texts. This is followed by a description of the advanced experimental transcription of Andrej Kmeť’s handwritten letters. It presents the possibilities of making transcribed collections and documents available on local networks and on the Internet.

Keywords: digital humanities, OCR, READ‑COOP, artificial intelligence, Transkribus platform, HTR+, SKRIPTOR project, Andrej Kmeť, schwabacher, fraktur, antiqua, read & search

 

Introduction

The most significant progress in research, development and applications in digitisation in the social sciences and humanities, i.e. in digital humanities has occurred mainly in the last ten years. The subject of professional interest is automatic optical character recognition (OCR) OCR of ordinary printed documents has long been sufficiently managed with the help of high-quality OCR tools. In the recent years alone, dozens of researchers and experimenters have been working on the more demanding issue of OCR of historical manuscripts and prints using artificial intelligence. Progress was made with the implementation of the READ project, , which, as a scientific basic research project, was directly subordinate to the European Commission and was evaluated annually by independent reviewers . Other platforms, applications, and transcription tools are also being developed. The main outcome of the READ project is the usable Transkribus platform and tool, which is a global innovation focused on transcription of historical manuscripts and documents. So far, Slovakia has been the only Central and Eastern European country that seeks to elaborate on the initiatives of the European READ fundamental research in the SKRIPTOR applied research project.

Digital humanities and project READ

We consider digital humanities as the common name and summarising methodology for all applications of information and communication technology in social sciences and humanities, in the related fields and disciplines and the corresponding practice. Such methodology was comprehensively applied in the READ Project, which was implemented under the Horizon 2020 program . The READ Project was supported by the European Union with a sum of EUR 8,2 million. The financing period ended on June 30th, 2019. Since 2016, the University of Innsbruck has conducted research into the basic technologies of text segmentation, recognition of handwritten text, key word search for historical documents, and instruments for publication of results. Teams of universities in Valencia, Rostock and the Vienna Technical University and other research institutions participated in all areas of research. Cooperation with other partners from Europe and the world has been developed. The research and development activities continue. Thousands of users of the Transkribus platform create new models of transcription based on historical manuscript and printed collections of national institutions, especially libraries and archives. Collaboration with the community of researchers around the Transkribus platform can be useful for Slovak and Czech expert community in digital humanities.

The common vision of scientists, experts and other users is for publicly available transcription models to gradually become a useful shared tool for automatic transcription of historical documents. It is necessary to attain such a level that it is no longer necessary to create separate models for each collection of manuscripts and printed materials. For users, it should be a kind of a "black box", in which artificial intelligence selects the most suitable model from integrated models for transcription of historical prints, manuscripts, typescripts and other documents that the user wants to study or make available. However, there is a long way to go to this goal and a need to create many partial models.

It is important that Slovak and Czech experts be part of the joint international effort and that the future "black box" be ready to provide assistance to all in transcription of historical collections and documents. At this stage of development, it is important to focus on preparing partial transcription models for historical manuscripts printed materials based on larger collections that contain hundreds and thousands of pages. . We recommend focusing on documents in Western Slavic languages, Czech, Slovak, Upper Sorbian and Lower Sorbian, and Polish. The nature of the collections also requires that attention be paid to Latin, German and Hungarian. We ought to create one integrated model for handwritten documents and one for old and rare prints based on our own models. This is a task that no one else will do for us.

Current State of Research and Applications

The existing information resources on OCR, on the one hand, relate to ongoing theoretical research on artificial intelligence itself. The authors of the theoretical works are mainly computer scientists and mathematicians . On the other hand, there are works whose authors are from the environment of social sciences and humanities, i.e. digital humanities . They focus on the topic of OCR and HTR from user perspective in terms of the practical applicability of existing OCR tools and platforms. Moreover, theoretical papers or user contributions can be divided into two groups according to whether they concentrate on OCR of printed or handwritten works (HTR).

A comprehensive overview of the READ project is included in a project study (Mühlberger 2016) and a collective study of READ researchers (Mühlberger et al. 2019), which is the first published overview of the use of HTR+ by a broad expert community and showing the current application manuscript recognition technology in cultural heritage sector. The collective study (Mühlberger, et al. 2019b) points to the development of character recognition methods.

Since the mid-20th century, character recognition of printed and handwritten documents has developed along with OCR. First, the scanned images of the printed text were converted to machine code and compared with the finished script templates. Printed documents contain characters from predefined, ready-made character sets, making comparison easier. However, even OCR software for printed characters are capable of further "training".

Unlike printed texts, handwritten texts pose a different problem due to the many differences in handwriting, authors and their hands, and changes in handwritten materials over time, numerous glyphs, tokens, personal and linguistic styles, etc.. Manuscripts have become a new challenge for computer scientists. First, in the 1980s, research and development on handwriting recognition was developed using statistical methods. This was followed in the 1990s by research and development of pattern recognition combined with artificial intelligence and the development of deep neural networks in the 2000s and 2010s. It was also a period of significant development and increasing the capacity of information and communication technology.

Mass digitization projects have been implemented in several developed countries and massive digital repositories and archives of printed and manuscript documents have been created . After mass digitization, the time has come to use the digital content obtained by digitizing manuscripts. To obtain usable, editable text from scanned images of handwritten documents, advanced Transkribus recognition technology – HTR+ and PyLaia machines can be used .

The project has all the attributes of a digital humanities methodology. In particular, these attributes include: (a) the cooperation of researchers; (b) scientification in the social sciences and humanities; c) interdisciplinarity; d) teamwork (interinstitutional, interstate, universities, libraries, archives, galleries, museums); e) heavy involvement of IT professionals in research, education, and knowledge dissemination; f) artificial intelligence (artificial neural networks, Hidden Markov Model (HMM)).

Advances in Research Hodel described the progress of printed text recognition based on optical type recognition (Hodel et al 2021). Hodel also deals with the most important practical aspect of transcription, namely the question of what accuracy or error rate in transcription is. Based on empirical data from the READ research and based on the findings of Günter Mühlberger (2019), Hodel lists three error classes. He considers it to be confirmed and verified to state that that: a) if the character error rate (CER) is less than 10%, which means 10 or less errors per hundred characters, then the transcription result is good, readable and, if appropriate, further editing of the output is possible; b) if the CER is ≤ 5%, then the transcription result is very good; (c) if the CER is below 3%, then the transcription results can be considered great, and any CERs below 2.5% are excellent.

Hodel pursues the goal of transcription without training and states that in order to create an optimal universal model of transcription of manuscripts of various hands, styles, scripts, periods, etc., which would not always require the preparation of separate models, it is necessary to have as many excellent models as possible. He believes that such transcription models should probably be developed for various similar classes of manuscripts, such as the cursive Gothic script of the 19th century, which is the subject of his attention.

Strobel too contributes to progress in the field of optical character recognition (OCR) (Strobel et.al 2020). Based on an analysis of the effectiveness of some OCR systems on printed German historical newspapers (Fraktur), the authors concluded that a sufficient training sample (so-called ground truth) is 50 newspaper pages. They base their findings on a comparison of five OCR systems: 1) ABBYY FineReader XIX10 (FRXIX) , version 2005, 2) ABBYY FineReader Server 11 (FRS11) integrated in previous versions into the 3) Transkribus and HTR+ Transkribus systems, 4) Kraken, 5) Tesseract.

Drobac (2020) provides insight into the effectiveness of OCR in historical newspapers and magazines published in Finland. The National Library of Finland has created an OCR corpus of more than 11 million pages using ABBYY FineReader for historical text. The estimated accuracy of OCR text was between 87% – 92% at the character level, which is quite low for scientific research.

Martinek et.al. (2020) presents in his theoretical experimental study a system of segmentation of printed text and OCR. He deals with a set of methods enabling to perform OCR of historical prints in German based on a small amount of training data. He describes his OCR system that uses recurrent neural networks. He focuses on partial processes of OCR system, mainly on page layout analysis, including the segmentation of the text block and lines, and on OCR itself. The experiments described are aimed at determining the best way to achieve good OCR results for historical German printed documents. For the experiment, they used digitised archival materials from the Porta fontium project from the Czech-Bavarian border. Specifically, it was 10 pages of newspapers Ascher Zeitung from the second half of the 19th century printed with Fraktur. They used 7 pages for training, 1 page for validation and 2 pages for evaluation of effectiveness. Additional 15 pages were used for page template identification and segmentation training. The authors consider the obtained results to be comparable or even better than the results of several recent systems. In the case of a Fraktur from a German newspaper, they achieved the following CER values in comparison with other systems: Porta fontium CER 0.024. Tesseract (deu_frak) CER 0.053. Tesseract (Fraktur) CER 0.045. Transkribus CER 0.027. It is not known whether the Czech experiments, including the application Pero OCR aim to create a competitive or supportive activity against the Transkribus platform and towards a specific freely available tool for the transcription of historical manuscripts and prints.

Martin Kišš (2018) deals with the topic of recognition of modern printed texts written by fracture in his thesis. He based his research on the TensorFlow tool, originally developed by Google and available as open-source machine learning platform. Part of his approach is a built-in generator of artificial historical texts. Using the generator, he created an artificial data set on which he trained a neural network for line recognition. He tested this neural network on real historical lines of text and achieved a success rate of 89.0% character accuracy after training.

Significance and Features of the Transkribus Platform

In addition to basic research, creating the Transkribus research platform was one of the main objectives of the READ project. About 2.5 million euros of 8.2 million euros were invested in the development of the research infrastructure . Now follow-up projects arise, in which basic and applied research continues. Adopting the Transkribus platform can also have significant economic effects.

According to data from the internal documentation of the READ project, the market prices of manual transcription of historical manuscripts range from 10 EUR to 30 EUR or more for simple English, German, Latin for a particular manuscript. Assuming 15 per page as the average cost, the READ Project's researchers generated monetary value of 4 to 6 million euros. This data represents an added value and a potential source of development of the newly established READ-COOP association and a convincing confirmation of the basic concept of research directed towards new knowledge and, at the same time, towards the commercial use of tools that are the results of the application of new knowledge. The approximate cost of transcription, including VAT, is shown in the table below .

Tab. 1 Automated transcription pricing

Representatives of digital humanities in Slovakia have various attitudes towards this initiative. Ranging from enthusiastic expressions of approval and admiration to very reserved or negative attitudes (such as "it's nothing for us", "we have other worries", "artificial intelligence will not replace us as experts"). These are often reactions that, on the one hand, verbally declare an interest in "digitisation" and "artificial intelligence", but on the other hand they show a lack of understanding and knowledge of the issues and possibilities of digitisation and the use of artificial intelligence. The attitudes suggest a preference for traditional paradigms of work and research rather than an actual effort to seek innovative tools to access and interpret our vast historical written heritage as part of Europe's cultural heritage.

As concerns the transcription of the Slovak language, this language was listed in the final report of the READ project thanks to our initiative, without any support and essentially without any interest of national institutions, archives, libraries, museums, and the academic sector. It was a job, to which the project’s author has devoted more than 3,000 hours since 2017 and financed until 2020 only his own resources. The results, know-how and experience achieved led us to an effort to introduce the revolutionary and innovative Transkribus platform in Slovakia and the Czech Republic , especially into the educational system, as well as into the practice of memory institutions through research and development projects. Of course, we also respect other transcription tools.

The Transkribus platform is free (open source) software with a guarantee of safe use for registered clients of the platform. Anyone can create their own account and then download the Transkribus Expert Client for free, or use the simpler Transkribus Lite tool. An API is available to connect clients' computers or mobile devices to the platform. Most of the software tools are free software that can be obtained from GitHub.

Alternatives to the Transkribus platform

In the study, we focus exclusively on the Transkribus platform and the transcription of manuscript collections and, marginally, on the transcription of printed materials. However, there are a number of other transcription tools. For example, OCR4all was developed to digitize old printed texts. Furthermore, the eScript application, which is used to transcribe handwritten and printed materials. The Rescribe tool is designed for desktop computers to use OCR on image files, PDFs, and Google Books. Applicable transcription tools include Pero.cz. The ABBYY Cloud OCR SDK is a high-quality cloud-based application via a web API. There are also more than 10 alternatives to ABBYY Cloud OCR SDK. The best alternative is Online OCR, which is free of charge. Other great sites and apps similar to ABBYY Cloud OCR SDK include Kofax Omnipage, Geekersoft OCR Word Recognition, and i2OCR. Quartex (Adam Matthew Digital 2018) is comerrcially available. In the future, researchers are faced with the task of developing a meta-analysis with criteria for evaluating the functionality and quality of transcription tools, applications, and platforms. However, the purpose of this study is not to evaluate other transcription systems.

READ COOP

The project was completed on June 30th, 2019. Subsequently, an international association was READ-COOP SCE (Societas Cooperativa Europeae – SCE) was established on July 1st, 2019. Its goal is to maintain and further develop the Transkribus platform. The experts and institutions are interested in the continuation and development of the Transkribus service. Presently, in October 2022, there are more than 90 000 users of Transkribus who work with the platform.

Figure. 1 Distribution of the Transkribus platform in Europe (Source: readcoop.eu, as of September 2022. As of August 2022: Members of READ-COOP SCE – READ-COOP (readcoop.eu))

Project SCRIPTOR

Slovak experts have responded to the new trends in OCR and research of historical documents with the SKRIPTOR project (Katuščák and Nagy, et al. 2019). The project has both European and national dimensions. The SKRIPTOR project is a direct continuation of the European READ Project. The technological and scientific innovations of the READ project are based on the use of artificial intelligence and digital humanities methodology. The task of the SKRIPTOR project researchers is to implement and disseminate the latest technological innovation and knowledge about the effective access of the professional and lay public to the Slovak and foreign written heritage.

The strategic objective of the SKRIPTOR project is to create conditions at the national level for a competent partnership of Slovak researchers with top European research, to establish and then actively engage in multilateral scientific European cooperation. The SKRIPTOR project is implemented in the field of history and archiving. It also spills over into library and information science.

The SKRIPTOR project focuses on modern documents. However, the collections that are subject to investigation and access may also include major recent texts and documents and incunabula, 16th-century printed materials, historical magazines, newspapers, as well as valuable 18th-20th-century materials.

The aim of creating new models using the Transkribus platform is to confirm its effectiveness and achieve in our collections a reduction in the price of transcription from 30 euros for manual transcription of a page to less than one euro per page for automatic transcription of texts.

In the SKRIPTOR project, we have preliminarily selected the following collections for research and experimental transcription: 1. Slovak and Czech Fraktur (Schwabacher and Antiqua); 2. Andrej Kmeť - personal handwritten letters; 3. Martin Lauček - Collectanea; 4. Postil of Izák Abrahamides Hrochotský from 1600 – 1601; 5. Postil of Juraj Schmideli-Kováčik from 1598 – 1607; 6. Canonical visitations of the Banská Bystrica Diocese from 18th to 19th centuries; 7. Hurban, J. M., handwritten documents; 8. Roman Catholic registries; 9. Forest Land Registers of Theresian Regulation; 10. Plot Records of Stable Land Register; 11. Congregation Records, Regional Records; 12. Other collections of written documents identified during archival research.

Fig. 2 Handwriting of Martin Lauček. From neat to more freeform handwriting

So far, in 2022, some outputs and related activities are available in the SKRIPTOR project: Publications: NAGY, I. (2021), TOMEČEK, O. (2021), BÔBOVÁ, M. (2021), KATRENIAK, M. (2022), KATUŠČÁK, D. (, 2020, 2021), KOVÁČOVÁ, K. (2022). Furthermore, the draft of the HITEXT project in the Czech Republic TAČR (2020) and NAKIIII(2022): KATUŠČÁK, D. (2020 and 2022). Participation in a student scientific conference in Opava, activities in the student grant competition SGS/5/2022 (SGS SU Opava). It is important to learn the functionality of the Transkribus platform and transfer knowledge to the education process in Slovakia and the Czech Republic.

Transcription Workflow

Based on our own experience, we understand transcription as a complex process, which presupposes mainly determination, availability of financial resources, and infrastructure. The main processes include:

Preparation. In particular: Information archival research (heuristics), identification of possible collections and documents, resolution of conditions of availability of collections and documents, quantification and selection of documents for transcription (number of pages and homogeneity of manuscripts), agreement with the owner or custodian of the collection on the place and method of scanning and rights,

Scanning. In particular: scanning, photographing documents, naming and organising directories and files on a computer, archiving source files (TIFF, RAW) and backing up derived files (JPG, PDF, PNG, etc.)

Fig. 3 A student of librarianship at the Silesian University scanning a manuscript text for her thesis in the archive in Jeseník using ScanTent and DocScan

Installation of Transkribus Expert Client and work with the Transkribus platform. In particular: consulting the Transkribus documentation, choosing the image format for Transkribus, quality control and preparing images for uploading to Transkribus, choosing the method of uploading files, creating your own collection, uploading selected files to the Transkribus platform in a collection

Manual transcription. In particular: selection of samples of pages for manual transcription according to the specifics of the manuscript, decision on sharing the collection with collaborators and their roles, manual transcription of the sample for the training set

Segmentation of pages and metadata. In particular: segmentation of pages or entire sets, quality control and correction of manual transcription and segmentation, document metadata, page metadata, structural metadata, comments, KWS.

Creation of a transcription model in the Transkribus Expert Client. In particular: training the machine to learn the transcription model, checking the quality and efficiency of the model and correcting the training set, restarting the model creation and checking the quality of the model, selecting ground truth for quality pages, using the model to transcribe all segmented pages in the collection

Access and use of transcription results. In particular: exporting results in different methods and formats, editing and correcting transcription results in Transkribus Lite, using a transcription model, making transcription results available on a local network or publishing transcription results on-line for use via read&search (see below).

Experiment with the Collection of Letters of Andrej Kmeť

Automatic transcription of handwritten text is what historians, linguists, archivists, librarians, documentalists and all others who come into contact with handwritten text have dreamt of for decades. Step by step, automatic transcription of manuscripts becomes a reality. In the background, there is massive international basic research in artificial intelligence and thousands of hours of work.

Of course, Transkribus, is not a substitute for professional and scientific erudition of historians and archivists. Therefore, their reserved attitude is understandable. Artificial intelligence does not compete with experts. It helps them. Automatic transcription is only one step in the scientific work of historians. This is followed by historical research of the text and the context of the transcribed texts and information, editing texts obtained by transcription, identification of entities, generation of keywords and metadata that are discovered in the text (dates, names of people, geographic locations, corporations, etc.).

The goal of more extensive transcription using cutting-edge Transkribus platform is to facilitate reading and make available unique collections of documents, archival units, preserved in the archives usually only in one copy. That is the difference between the occurrence of units in libraries and archives. The archives are unique, authentic original documents, collections, archival units, while libraries hold titles of documents that often come in hundreds to thousands of items. Unique archival materials need to be made available. The path to access leads through their transcription.

After transcription of historical texts and manuscripts, digital content may be edited, rendered, used and made accessible for use on a larger scale also in public information systems and services. In addition, the transcribed original text, for example, in Latin, Hungarian, German or another language can be at least approximately automatically translated in another language. This quite substantially changes the nature of the work of archivists and historians. The result of our work include transcription models of different quality. An overview of the models is provided in the table.

 

Tab. 2 An overview of experiments with transcription models of Andrej Kmeť's handwritten correspondence

Explanatory notes to the table:

Date: The date the model was created (YYYYMMDD).

Method: Selected handwritting transcription method (HTR+).

ID: The identification number of the model in our collections and among all Transkribus models on the remote server.

Training set: The number of pages and the number of lines that have been manually transcribed and used to train the machine in the Transkribus platform. In total, 211 pages were transcribed for the exercise. Of these, 185 are used for training and 26 for validation. The basic transcription contained 50 pages used to prepare the first model. From the transcription results, more edited pages were added to the training model and further models were created.

Validation set: The number of pages and lines selected from the total number of transcribed pages to verify the training accuracy.

CER accuracy: Percentage of character errors in the input file and in the validation set. For manuscripts, it is practically impossible for manual transcription to be 0.0%.

Number of cycles: The number of cycles (stages) that the machine used for learning (training).

CER/WER: The values reflect the actual practical, user-friendly accuracy or character error rate CER and word error rate WER in the six 2019-2021 models owned by the author. We tested all models on a single, as precisely prepared double page as possible in FINAL quality in the ID 115514 collection. It is a letter from Andrej Kmet to Ľ. V. Rizner (document ID 621673).

The average converted character error rate in six models is 5.0%, and five of them were generated on training sets and pages of different quality, which were mainly in status In Progress. However, for practical transcription of hundreds of other pages, it would be best to use the 36009 model created using 185 pages of the training set and 26 pages of the validation set. It appears that the lowest CER accuracy values in the validation set do not mean that the models that are in the first five lines of the sixth column in Table 2 and not created on the ground truth pages are most suitable for further transcription.

For the final preparation of this model, I used well-prepared pages in ground truth quality. In terms of the accuracy of transcription of other letters by Andrej Kmeť, I consider the results of the model 36009 with CER values of 2.48% and WER 7.73% as the best.

The data provided in the CER/WER column do not reflect the accuracy of transcription when creating a model with pre-prepared files for training (1.87%) and validation (5.79%), but the best values that apply to individual pages. That's why the values are different. The CER/WER of 2.48% and 7.73% are only the best values that refer to one page in a given model, which must be selected randomly from the collection and which is not transcribed in any way in advance. The WER value itself does not make any practical sense, because if we use Tools/Compare text version in Transcribus, we will find that, for example, punctuation, length, caron, dot etc. has a distinctive role in the word, and if it is added to or missing from the transcribed text compared to GT (Ground Truth), then the machine will consider the word to be erroneous, although the text is clearly understandable to the user and does not make it difficult to use. WER values are mostly used in mathematical linguistics, e.g. in machine translation.

We continuously organize and publish the results of document transcription on the Internet through a tool developed by the READ-COOP team called read&search. Public access to documents is possible through the read&search site – https://Transkribus.eu/r/slovakia-state/#/ website, the interface of which we have translated into Slovak. For comparison, we tested all the models listed in the table on a single, most precisely prepared double page in FINAL quality in the ID 115514 collection. It is a letter from Andrej Kmet to Ľ. V. Rizner (document ID 621673). The error rate of words is de facto irrelevant, because an erroneous character (e.g. punctuation) also causes the error rate of a word in most cases. However, for the practical transcription of hundreds of other pages, it will be most appropriate to use the 36009 model, which we created from 185 pages of a training file and 26 pages of a validation file. It turns out that the lowest CER accuracy values in the validation set do not mean that models that are in the table in the first five rows in the sixth column and are not created on the ground truth pages are best suited for further transcription.

For the final preparation of this model, we used well-prepared ground truth quality pages. From the point of view of the accuracy of transcription of Andrej Kmeť's other letters, we consider the results of Model 36009 with CER values of 2.48% and WER values of 7.73% to be the best. In the future, based on further experience, we will consider providing this model of ours for free use for similar manuscript collections.

Collection selection

A collection of handwritten correspondence of Andrej Kmeť, mostly in Slovak, kept in the Library of the Slovak National Museum in Martin, was selected for the experiment after the previous gracious consent of the Museum’s director. Some of the letters are in Latin, Hungarian and parts of the letters are also in German and Czech. This concerns letters written by Andrej Kmeť from 1841 – 1908. In the field of scientific approach to correspondence of scholars in modern times in the spirit of digital humanities methodology, the most comprehensive source of knowledge is undoubtedly the international research initiated and led by Howard Hotson in 2014 – 2018 (Hotson 2019). In this study, we are only interested in correspondence as an extensive set of handwritten materials suitable for experiments with automatic transcription.

Andrej Kmeť and his correspondence is a subject of systematic research by Karol Hollý, who also provides additional resources relating to Kmeť's literary remains (Hollý 2013, 2019).

Scanning

Capturing by scanning, or, more accurately, by photography, took place between 23rd to 30th May 2018 in the Library of the Slovak National Museum in Martin. The ScanTent (scanning tent) equipment and the freely available DocScan application was used for scanning. ScanTent was used with the purpose to verify the entire recommended Transkribus workflow. It is well-known that many archives have scanned some parts of their collections at a more or less good quality. The equipment selected in this case is useful in cases where collections have not been scanned yet. It is known that ordinary scientists and users are not allowed to remove archival material from study rooms. Amateur photography of pages with smartphones or cameras is problematic when it comes to larger files (thousands of pages). Therefore, ScanTent and DocScan are a potential and affordable choice, although with some practical issues (format, focus, quality). It should be noted, however, that in this case it is photography rather than scanning. In the future, I would definitely use a professional scanner for scanning in the highest achievable quality (300 to 600 DPI).

Five full archival boxes were scanned. Some of the letters were on multiple pages, there were also some incomplete pages, blank pages etc. One image can also contain more pages of a handwritten document. In the scanning step, images are created and not actual pages, unless a page is scanned individually. Sometimes it is preferable to scan sheets by pages, individually, because if a sheet is scanned as a double page, then one will have to organise pages in the right order in the image post-processing step. However, in the next step of text segmentation, it is possible to arrange the individual pages as blocks of text in the correct order. The individual pages in Andrej Kmeť's letters did not follow each other, so on the scanned image there were, for example, pages 3 and 1, on the next 2 and 4.

The total scanning time for about 3,000 pages was approximately 15 to 20 hours. Scanning was performed in manual single-page mode by individual sheets, not in series (not with automatic scanning after a page is turned), as the handwritten material is on separate sheets of different formats. A part of the materials comprises original letters, another part consists of photocopies. In particular, original letters are often on brittle paper which requires some conservation and preservation actions. Business cards and similar smaller paper sizes – DocScan required to zoom in on a scanned object, this was resolved by placing a blank A4 page underneath the missing areas of the sheets.

Some sheets were damaged (a missing corner, damaged edges of a sheet). In such cases, the system reported "no page found". This was resolved by using a white sheet as background sheet under the scanned page and its missing parts, then DocScan was capable of focusing.

Some components needed re-scanning, because not enough attention was paid to focusing. DocScan focuses on a sheet's surface in several spots, indicated by red and green markers. When focus is satisfactory, "OK" appears, then one can pull the trigger. For making pictures, we used a Samsung Galaxy 6 mobile phone with the Android operating system, with which DocScan worked at the time. Initially, there were some issues in the download of data from Samsung (Android) to MacBook Air (iOS). DocScan software is also currently available for the iOS operating system. Finally, a Windows PC was used to download images from the Samsung device. We consider the use of the DocScan system and the Samsung mobile phone to be an emergency solution, because in my further work, especially during segmentation, we discovered a relatively large number of blurred parts of pages. Because some parts of the pages were blurred, the segmentation was inaccurate and subsequently not even transcribed. In the future, we recommend using high-quality professional scanners for large valuable collections and capture in the highest achievable quality.

 

Fig. 4 Handwriting of Andrej Kmeť. Letter to J. V. Rizner

When scanning, the DocScan system can be connected directly to the server and the Transkribus platform (in Innsbruck or Rostock) and then scan and transfer images directly to the Transkribus platform. This option was not used for insufficient connectivity. We considered it necessary to check the accuracy and quality of the scan. Some operations on the Transkribus platform required the use of such tools as Preview, Adobe Acrobat, FileZilla Client v. 3.61.0, ABBYY FineReader PDF 15, Zoner Photo Studio X, and others. We used the tools to adjust the text orientation, eliminate duplicates, arrange pages in the set, merge files, etc.

The scanned digital content (images) was a) prepared for further processing in DocScan software (content identification, metadata), b) uploaded without modification on CD ROM for use in the collection’s owner at the discretion of the management, c) prepared for upload to the Transkribus platform and for further processing in the Transkribus software. Loading to the Transkribus server, segmentation, model generation, and transcription of the handwritten text followed.

The digital content was divided based on the arrangement as found in the archival boxes. Five compact discs (CDs) were recorded and handed over to the director of the Ethnographic Museum of the Slovak National Museum in Martin, dr. Mária Halmová. The collection's custodians can now use and publish the digital content. Furthermore, a CD may be placed in each box. They can decide whether to allow access to the collection on the CD or work with the relatively brittle original paper archival sheets. Gradually, we make the transcribed content available through the read&search software used as "software as a service (SaaS)". We are still exploring the possibilities of optimal preparation of metadata for documents and collections for publication via read&search.

Uploading Digital Image Files

The scanned images can be processed either locally or edited after being imported to the remote Transkribus server. Before importing to the server and before using the Transkribus platform, it is necessary to register, download Transkribus Expert Client. It is also possible to work with Transkribus Lite, in which, however, it is not possible to create custom transcription models. Then one needs to create a private collection, which is available only to the person who created it, unless the person decides to share it with other users. It is possible that a transcriber allows access to certain operations to students, operators, collaborators. It allows access to one's own collection for preparation of training samples, editing after transcription etc. Automatic transcription is carried out exclusively on the remote server using Transkribus Expert Client. Locally, it is possible to work with own documents and collections as needed.

Before importing files, one needs to create an own collection with own files for transcription. A single upload and import of images is possible up to 500 megabytes at a time. If the size of imported images is larger, they can be divided and uploaded in multiple batches. Larger image files can also be uploaded and imported using an FTP client such as WinSCP, also via URL or DFG Viewer METS. Images can be imported as PDFs, JPGs, TIFFs and other formats. The collection of images, created by scanning letters of Andrej Kmeť, was 11.7 gigabytes in size at 300 DPI resolution.

Our experience shows that before importing it is advisable to check the digital images, the quality, sharpness, bleed-through, completeness, page orientation, and so on. After some experience, we also imported large PDF files via faster, simple-to-use WinSCP software.

Segmentation

Having imported the files on the server, an automated process of segmentation must be performed on the server. For segmentation of text and images, the client application must be connected to the server. Segmentation means that the image of the handwritten text of the document, which is still on the server as an image, is automatically divided into blocks, areas, and lines of text. Manual corrections can be made as necessary. These include, for example, arranging, merging and splitting blocks, expanding a polygon, adjusting the base line below a line, segment boundaries, and the like. Segmentation is of key importance to transcription itself. High-quality scanned pages with sharp handwriting are usually segmented flawlessly. However, sometimes it is necessary to carefully check or adjust the manual order of text regions (TR) after segmentation, lines reading order, lines and polygons created by a machine (artificial intelligence).

HTR Machine Training

The Transkribus Expert Client machine is first trained on pages selected for the training set. The machine repeatedly, e.g. in 50 cycles, reads each page of the training set, and gradually identifies characters that cannot be unambiguously identified or that arose due to incorrect transcription of pages in the ground truth set.

The Transkribus system first creates a model on the pages of the training set. Characters that the machine considers to be incorrect are included in the incorrect characters in the training set. In statistics, this is the CER value on training set. The HTR machine must first be trained for a particular hand. As a rule, a learning machine should "see" 100 examples of each character contained in the document, which is usually about 50 pages of the training set prepared manually (Mühlberger et al. [2016]).

After training the model on the pages that have been selected for the training set, Transkribus Expert Client will automatically use the trained model created on the training set pages to validate it on the pages selected for the verification set. A verification (validation) set is used for practical testing of the model. The machine accesses the text in the verification file repeatedly each time, as if it was doing so for the first time and applies the model that it "learned" using the training set. At the end of this process, we have a model for automatic manuscript transcription. The most important value for evaluating the transcription accuracy of the created model is the value that expresses the character error rate in transcription in the validation set. This is the CER value on Validation Set.

Thus, a sample dataset of pages is selected from the collection imported based on a certain algorithm, which is then used for training the machine and setting up of a model for a certain handwriting type. It is necessary to show the machine some correct examples of text. Then, the machine learns the patterns of letters and words in accordance with the training set. If a collection of texts is written with more than one hand, then it will be necessary to select the appropriate size of both training and test sample by each hand. Page selection can be performed using a certain algorithm or automatically so that a batch sample is prepared that contains about 20,000 words. The training dataset is created directly in the Transkribus Expert Client both locally and on the server. Basically, it is necessary to transcribe a manuscript carefully and very precisely in the editor according to the lines, without correcting anything. Text needs to be transcribed using the language used at the time of creation, including all grammatical errors and also by further instructions and manuals that are available for this operation. The author and creator of the transcription model should determine the order of text parts, tagging, selecting and editing keywords, descriptive metadata, and so on. The outcome of transcription is then viewable and can be evaluated on a test set. If the outcome is satisfactory, the remaining files or the entire collection can be transcribed automatically. Simply, once the machine learning process and creation of the model if completed, the model is available to the owner, who can use it or share it with other users and apply it to any document. Correct and incorrect reading data become the basis of the model.

Automatic Transcription

Automatic transcription serves as the basis for scientific editing, in which the text can be modified, corrected, proofread, explicitly enriched with more data, context data, data deciphering, tagging, adding notes, metadata, annotations, corrections of diacritical marks, abbreviations, uppercase and lowercase letters, paleographical processing, ligatures etc. Automatic transcription was made after a run of training and testing. A custom model of transcription was used using HTR+.

 

Fig. 5 A screen displaying data after automatic conversion using the ID 36009 custom model

The result of learning in the automatic handwritten text transcription of Andrej Kmeť's letters was an excellent CER value of 1.37 % in the training dataset and 1.76 % in the test dataset. The training set contained 29,411 words and 4,573 lines. We used the model for other sheets and corrected them so that they were of ground truth quality.

In the process of familiarisation with the Transkribus Expert Client platform and despite our trials and errors, we made improvements in 2019 from an error rate of 22.81% in 2018 to an error rate of 1.76% with the HTR machine in 2021. Transcription effectiveness improved significantly after the HTR+ machine became available. At first, we only worked with training sets that were not of ground truth quality. The basic transcribed training set had 50 pages. We relatively easily expanded this basic set to 185 pages by transcribing additional pages using the older model. We corrected them and added them to the training set. We tried to correct the new pages as accurately as possible into ground truth quality.

Finally, we created the mentioned model no. 36009 of ground truth quality from the pages, which can achieve good to excellent transcription results depending on quality, images, character sharpness, handwriting quality, and segmentation quality.

Preliminarily, it can be stated that many transcription errors relate to punctuation. A detailed analysis of the causes of inaccuracies will be the subject of further research, as well as research into the correlation between scan quality and segmentation with respect to transcription quality.

Fig. 6 Text segmentation, transcription in the Transkribus editor and the result of automatic transcription

Transcription of Fraktur (Swabacher)

The experiment concerned the application of artificial intelligence to the automatic transcription of Slovak and Czech Fraktur and Schwabacher (Voit 2006). Fraktur is a Gothic typeface that has been used widely used since the 15th century in Czech and Slovak books, newspapers and magazines in the modern age and later, practically until the 1950s.

Fig. 7 Transcription of J. N. Bobula's Jánošík (printed) published in read&serch (besides the text at the top, overlay at the bottom)

As part of teaching the subject of digitisation at the Silesian University at the Institute of Czech Studies and Librarianship, we used the artificial intelligence tools of Transkribus Expert Client to prepare probably the first very successful transcription of Slovak and Czech printed text - Fraktur - the historical newspaper Moravské noviny, Opavský besedník and the Slovak publication titled Jánošík. We prepared transcription models for Slovak and Czech Fraktur scripts (Table 1). In the training set, we achieved an character error rate of 0.39%.. However, a higher value of 0.44% achieved on the validation set (CER on validation set) is decisive for the practical use of this model.

 

Tab 3 Transcription of Fraktur

From now on, we are able to transcribe Fraktur in Slovak and Czech historical printed materials with an accuracy of about 99%. In our case, the accuracy level is 99.56%. The error rate is 0.44%. The transcription results of the Czech text fracture are available after logging in the Transkribus platform in the FRAKTURA_CZ collection (114429, Owner) and on the Internet in the beta version of the read&search browser.

 

Fig. 8 Example of segmentation of the Moravské noviny 1849 (Antiqua and Fraktur)

 

Fig. 9 Example of transcription of the Moravské noviny 1849 using custom model

 

Fig. 10 Cut-out of transcription and display of text over transcribed text in read&serch

 

Further research

In further research, it will be appropriate to focus on the following areas: a) selection and standard description of larger Slovak and Slovak-related manuscript collections of European and national significance, b) digitisation of selected historical documents according to the experimental plan to confirm or improve known procedures and values with regard to the subsequent text segmentation process and automated transcription (correlation among various scanning conditions and quality and transcription, c) thorough analysis and description of text segmentation results, d) sharing of digital documents with archives and other institutions that will be able to use them at their own discretion as a replacement for paper documents, e) creation of models, training and analysis of automatic transcription models according to new-age and modern collections and languages (especially Slovak, Czech, Hungarian, Latin, German, Polish), f) verification and evaluation of usability of finished, available transcription models from research in the READ project, g) familiarisation with the best practice of automatic recognition of texts of historical documents in Europe, especially in Germany, Austria, Spain, Hungary, Great Britain, Finland, the Netherlands, Serbia, use of information and experience in Slovakia, h) automatic transcription of a substantial part of the Lauček's manuscript collection and its virtualisation, i.e. a single virtual digital presentation of volumes located in geographically various locations (Slovak National Library in Martin, Slovak National Archives in Bratislava, University Library in Bratislava, Országos Széchenyi Konyvtár in Budapest), i) research into the possibilities of increasing the efficiency of recognition of manuscript texts and texts of historical documents through the Transkribus platform and related tools, j) making transcribed and interpreted collections available to the general public via a digital repository, k) creating documentation that will be used for archives, libraries, academic institutions as well as individuals for automatic transcription of texts, l) building a digital humanities cabinet with a focus on transcription of historical documents.

Conclusion. Effectiveness of the Transkribus platform

Our experience verified by experiments confirms that handwritten materials can be automatically transcribed, the error rate can be very low and the results are excellent. The transcription results are readable and can be exported in various formats such as DOC, TXT, PDF, TEI, METS, further edited, adjusted, corrected, and used.

In the experiment, the accuracy level of 94.21% was achieved on Andrej Kmeť's handwriting with a character error rate (CER) of 5.79%. In transcription of printed Fraktur, the accuracy level was 99.56% with a character error rate of 0.44%.

In terms of perception, understanding and use of transcribed text in general, the authors of Transkribus platform hold that a) if error rate of "words" is counted strictly and if word error rate is up to 30 %, the text is still understandable and usable for humans, b) if error rate of "characters" is counted strictly and if the character error rate is up to 15 %, the text is still understandable and usable for humans.

The Transkribus platform is an excellent tool for patient and conscientious scholars. While it cannot substitute them in any way, they may find it very helpful when fine-tuning transcription by editing and correcting the results. The platform is not, and hardly ever will be, intended merely for "clickers", i.e. users who are accustomed to "clicking" rather than innovating patiently.

 

List of bibliographic references

KATUŠČÁK, D., I. NAGY, M. BÔBOVÁ, P. KUNC, A. KURHAJCOVÁ, P. MALINIAK, M. MIKUŠKOVÁ, L. NIŽNÍKOVÁ, I. POLÁKOVÁ, B. SNOPKOVÁ a O. TOMEČEK. (2019) SKRIPTOR Projekt APVV-19-NEWPROJECT-17816 (2020–2024). Inovatívne sprístupnenie písomného dedičstva Slovenska prostredníctvom systému automatickej transkripcie historických rukopisov. [Innovative disclosure of written heritage of Slovakia through the automatic transcription of historical manuscripts]. Organizácie: Univerzita Mateja Bela v Banskej Bystrici (zodpovedný riešiteľ doc. Imrich Nagy, PhD) a Štátna vedecká knižnica v Banskej Bystrici – partner (garant prof. PhDr. Dušan Katuščák, PhD).

ADAM MATTHEW DIGITAL, 2018. Handwritten text recognition: artificial intelligence transforms discoverability of handwritten manuscripts, [cit. 2.10.2021]. Dostupné z: www.amdigital.co.uk/products/handwritten-text-recognition.

BÔBOVÁ, M., 2021. Projekt Skriptor, keď stroj sa stáva žiakom. In: Vedecká online konferencia NON SCHOLAE, SED VITAE DISCIMUS, dňa 7. júna 2021 v gescii ŠVK v Prešove.

DROBAC, S., 2020. OCR and post-correction of historical newspapers and journals (Doctoral dissertation). Helsinky: University of Helsinki, 2020. ISBN 978-951-51-6511-4 (paperback), ISBN 978-951-51-6512-1 (PDF), [cit. 10.6.2022]. Dostupné z: https://helda.helsinki.fi/bitstream/handle/10138/319496/OCRandpo.pdf?sequence=1&isAllowed=y.

HODEL T., D. SCHOCH, C. SCHNEIDER a J. PURCELL, 2021. General Models for Handwritten Text Recognition: Feasibility and State-of-the Art. German Kurrent as an Example. Journal of Open Humanities Data, 7, 13. [cit. 1.10.2022]. Dostupné z: https://openhumanitiesdata.metajnl.com/articles/10.5334/johd.46/.

HOLLÝ, K., 2013. Veda a slovenské národné hnutie: snahy o organizovanie a inštitucionalizovanie vedy v slovenskom národnom hnutí v dokumentoch 1863–1898. Bratislava: Historický ústav SAV v Typoset Print s. r. o., 2013.

HOLLÝ, K., 2015. Andrej Kmeť a slovenské národné hnutie: Sondy do života a kreovanie historickej pamäti do roku 1914. Bratislava: Veda, Historický ústav SAV, 2015. 279 s. ISBN 978-80-224-1480-7.

HOTSON, H. a T. WALLNIG (eds.) , 2019. Reassembling the Republic of Letters in the Digital Age. Göttingen: Göttingen University Press, 2019. 470 s. [COST Action IS1310; 2014–2018. ISBN 978-3-86395-403-1. DOI: https://doi.org/10.17875/gup2019-1146. [cit. 1. 10. 2022] Dostupné z: https://www.univerlag.uni-goettingen.de/handle/3/isbn-978-3-86395-403-1.

KATRENIAK, M. (2022). Automatická transkripcia rukopisných historických textov na príklade vybraných kanonických vizitácií. Dostupné z: https://opac.crzp.sk/?fn=detailBiblioForm&sid=BDC2D20A28F62792149F199B8B08.

KATUŠČÁK, D. ,2008. Súčasný stav formovania stratégie digitalizácie na Slovensku. In: Kolokvium knihovních a informačních pracovníků zemí V4+. 6.–8. července 2008, Brno, ČR. Elektronický sborník, s. 30–46.

KATUŠČÁK, D., 2021. Pochybná hodnota za veľa peňazí? In: Kultúrny kyslík. 2021, č. 2, s. 14–17. ISSN 1339-6919. [cit. 3. 10. 2021]. Dostupné z: https://via-cultura.sk/kulturny-kyslik-2-2021/.

KATUŠČÁK, D. a M. KATUŠČÁK., 2011. Základná koncepcia národného projektu digitálna knižnica. In: Knižnica, 2011, 12(2), 6–10. [cit. 2.10.2021] Dostupné z: https://www.snk.sk/images/snk/casopis_kniznica/2011/februar/06.pdf

KATUŠČÁK, D., 2011a.Digitálna knižnica a digitálny archív. Národný projekt. Operačný program informatizácie spoločnosti OPIS2. Implementácia 2010–2015. Martin: Slovenská národná knižnica, 2011. [Kompletný projekt k žiadosti o nenávratný finančný príspevok zo štrukturálnych fondov Európskej únie ca 4000 s.].

KATUŠČÁK, D. , 2011b. Národný projekt digitálna knižnica a digitálny archív. In: Bulletin Slovenskej asociácie knižníc. Bratislava: SAK, 2011. 38 s. [Opis projektu] Dostupné: http://dusan.katuscak.net/2011/12/02/digitalna-kniznica-a-digitalny-archiv-opis2/.

KATUŠČÁK, D., 2011c. Situační zpráva o národním projektu SNK Digitální knihovna a digitální archiv. In: 12. konference Archivy, knihovny, muzea v digitálním světě 2011. Praha: SKIP, 30. listopadu a 1. prosince 2011 v konferenčním sále Národního archivu v Praze, Archivní 4, Praha 4 - Chodovec. [cit. 2.10.2021] Dostupné z: http://old.skipcr.cz/dokumenty/akm-2011/Katuscak.pdf.

KATUŠČÁK, D., 2021. Progress in making available blackletters typefaces and handwritten written heritage using artificial intelligence. Preprint. Researchgate. 2021, 25 s.

KOVÁČOVÁ, K., 2022. [bakalárska práca] Výběr pozoruhodných rukopisných sbírek Jesenicka. [cit. 2.10.2022]. Dostupné z: https://is.slu.cz/th/bum3h/FPF_BP_2022_53474_Kovacova_Klara.pdf.pdf

KIŠŠ, M., 2018. Rozpoznávání historických textů pomocí hlubokých neuronových sítí. Brno, 2018. Diplomová práce. Vysoké učení technické v Brně, Fakulta informačních technologií. Vedoucí práce Ing. Michal Hradiš, Ph.D.

MARTÍNEK, J., L. LENC a P. KRÁL, 2020. Building an efficient OCR system for historical documents with little training data. Neural Computing & Applications 32, 17209–17227 (2020). [cit. 2.10.2021] Dostupné z: https://doi.org/10.1007/s00521-020-04910-x.

MINISTERSTVO KULTÚRY SLOVENSKEJ REPUBLIKY, 2019. Revízia výdavkov na kultúru. Priebežná správa. Október 2019. Kap. 4.4 Projekt digitalizácie, s. 75–78. [cit. 2.10.2021] Dostupné: https://www.culture.gov.sk/wp-content/uploads/2019/12/Revizia_vydavkov_na_kulturu_priebezna_sprava_compressed.pdf.

MINISTERSTVO KULTÚRY SLOVENSKEJ REPUBLIKY, 2020. Revízia výdavkov na kultúru. Záverečná správa. Júl 2020. Kap. 4.9 Digitalizácia kultúrneho dedičstva, 132–139. [cit. 2.10.2021] Dostupné: https://www.culture.gov.sk/wp-content/uploads/2020/10/Revizia_vydavkov_na_kulturu_-_zaverecna_sprava_compressed.pdf.

MÜHLBERGER, G.,2016. READ (Recognition and Enrichment of Archival Documents) – 2016–2019. [Projektová štúdia]. [cit 6.10.2021.] Dostupné z: https://www.academia.edu/22653102/H2020_Project_READ_Recognition_and_Enrichment_of_Archival_Documents_-_2016-2019.

MÜHLBERGER, G., L. SEAWARD, M. TERRAS, S. ARES OLIVEIRA, V. BOSCH, M. BRYAN, S. COLUTTO, H. DÉJEAN, M. DIEM, S. FIEL, B. GATOS, A. GREINOECKER, T. GRüNING, G. HACKL, V. HAUKKOVAARA, G. HEYER, L. HIRVONEN, T. HODEL, M. JOKINEN, P. KAHLE, M. KALLIO, F. KAPLAN, F. KLEBER, R. LABAHN, E.-M. LANG, S. LAUBE, G. LEIFERT, G. LOULOUDIS, R. McNICHOLL, J.-L. MEUNIER, J. MICHAEL, E. MüHLBAUER, N. PHILIPP, I. PRATIKAKIS, J. PUIGCERVER PÉREZ, H. PUTZ, G. RETSINAS, V. ROMERO, R. SABLATNIG, J.-A. SÁNCHEZ, P. SCHOFIELD, G. SFIKAS, C. SIEBER, N. STAMATOPOULOS, T. STRAUSS, T. TERBUL, A.-H. TOSELLI, B. ULREICH, M. VILLEGAS, E. VIDAL, J. WALCHER, M. WEIDEMANN, H. WURSTER a K. ZAGORIS, 2019. Transforming scholarship in the archives through handwritten text recognition: Transkribus as a case study. Journal of Documentation, 75(5), 954–976. Dostupné z: https://doi.org/10.1108/JD-07-2018-0114.

MÜHLBERGER, G., J. ZELGER a D. SAGMEISTER, 2014. User-driven correction of OCR errors: combining crowdsourcing and information retrieval technology. In: ANATONACOPOULOS, A. & K. U. SCHULZ. (Eds.), DATeCH’14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, Madrid, Spain, 19–20 May 2014 (s. 53–56). New York, NY: Association for Computing Machinery. Dostupné z: https://doi.org/10.1145/2595188.2595212.

MÜHLBERGER, G., S. COLUTTO a P. KAHLE [2016, Preprint] Handwritten Text Recognition (HTR) of Historical Documents as a Shared Task for Archivists, Computer Scientists and Humanities Scholars. The Model of a Transcription & Recognition Platform (TRP). Dostupné z: https://www.academia.edu/8601748/Preprint_Handwritten_Text_Recognition_HTR_of_Historical_Documents_as_a_Shared_Task_for_Archivists_Computer_Scientists_and_Humanities_Scholars_The_Model_of_a_Transcription_and_Recognition_Platform_TRP_?bulkDownload=thisPaper-topRelated-sameAuthor-citingThis-citedByThis-secondOrderCitations&from=cover_page

MÜHLBERGER, G., 2002. Digitising instead of mailing or shipping: a new approach to interlibrary loan through customer-related digitisation of monographs. Interlending & Document Supply, 30(2), 66–72. Available at: https://doi.org/10.1108/02641610210430523.

NAGY, I., 2021. Možnosti aplikácie metódy digitálnej transkripcie historických rukopisných textov pri sprístupňovaní archívnych fondov = The Possibilities of application the method of digital transcription of historical manuscript texts in the process of accessing the archival fonds. In: Slovenská archivistika. Bratislava: Ministerstvo vnútra Slovenskej republiky, 2021, 51(2), 53–67. ISSN 0231-6722. Available at: https://www.minv.sk/swift_data/source/verejna_sprava/odbor_archivov_a_registratur/archivnictvo/slovenska_archivistika/SA%202-2021,%20roc.%2051.pdf.

POOLE, A. H., 2017. The Conceptual Ecology of Digital Humanities. In: Journal of Documentation, 2017. 73(1), 91–122. [accessed on 03-10-2021]: Dostupné z: https://www.academia.edu/27862789/The_Conceptual_Ecology_of_Digital_Humanities.

STROBEL, P. B., S. CLEMATIDE a M. VOLK, 2020. How Much Data Do You Need? About the Creation of a Ground Truth for Black Letter and the Effectiveness of Neural OCR. In: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 3551–3559 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA).

ŠTUDENTSKÁ grantová súťaž SGS/5/2022 (SGS SU Opava). Tvorba modelu automatické transkripce historického rukopisu s využitím umělé inteligence. Řešitel: prof. PhDr. Dušan Katruščák, PhD., Ing. I. Kyselová, PhD., od októbra 2022 aj K. Kováčová. KOVÁČOVÁ, K. a I. KYSELOVÁ, 2022. Robot čte rukopisnou kuchařskou knihu z roku 1667? In Študentská vedecká konferencia. Slezská univerzita v Opavě, April 5, 2022.

TOMEČEK, O., 2021. Metales Banskej Bystrice z roku 1820. Reambulácia juhozápadného úseku mestských hraníc spoločných so susedným teritóriom rodiny Radvanských = Metales of the town Banská Bystrica from 1820. Perambulation of the southwest part of town borderline common with neighbouring domain of Radvanský family / Oto Tomeček. In Acta historica Neosoliensia : vedecký časopis pre historické vedy. Banská Bystrica: Vydavateľstvo Univerzity Mateja Bela – Belianum, 2021, 24(2), 112–133. ISSN 1336-9148. Available at: https://www.ahn.umb.sk/tomus-24-num-2-tomecek-o-metales-banskej-bystrice-z-roku-1820-reambulacia-juhozapadneho-useku-mestskych-hranic-spolocnych-so-susednym-teritoriom-rodiny-radvanskych/.

VOIT, P., 2006. Encyklopedie knihy: starší knihtisk a příbuzné obory mezi polovinou 15. a počátkem 19. století. Praha 2006. Švabach – Encyklopedie knihy. [cit. 2.10.2022]. Dostupné (časť šwabach): Švabach – Encyklopedie knihy. Available at: https://www.encyklopedieknihy.cz/index.php/%C5%A0vabach.

Acknowledgements

PhDr. Mária Halmová, Mgr. Viera Varínská ,and PhDr. Anna Peťová, for their help in scanning the manuscripts of Andrej Kmeť in the Ethnographic Museum in Martin.

Oľga Kuchtová from Banská Štiavnica for her help in finding out information about the life and conditions of Andrej Kmeť's work in Prenčov. Mgr. Mária Bôbová, PhD., State Scientific Library in Banská Bystrica, for assistance and cooperation in manual transcription and page segmentation for the training model and transcription of Andrej Kmeť's letters. Lucie Valjentová, a student of librarianship from the 4th year of the Institute of Czech Language and Library Science of the Silesian University in Opava, for her help in transcribing the Czech Fraktur texts. Aleš Drahotušský for providing newspapers from the Digital Library of the State Scientific Library in Ostrava.

Notes

1 ORCID: 0000-0001-7444-1077. Silesian University Opava. Faculty of Philosophy and Science in Opava; Institute of Czech Language and Library Science. State Scientific Library in Banská Bystrica.

2 The study is the output of the project APVV-19-0456 SKRIPTOR – Innovative disclosure of the written heritage of Slovakia through a system of automatic transcription of historical manuscripts.

3 OCR – Optical Character Recognition

4 READ Recognition and Enrichment of Archival Documents, a project implemented in 2016-2019 under the Horizon 2020 programme. [accessed on 02-10-2021]. Available at: https://cordis.europa.eu/project/id/674943.

5 Dušan Katuščák was one of the three evaluators of the READ project for the European Commission.

6 Transcribus. A comprehensive platform for digitizsation, AI-powered text recognition, as well as transcription and retrieval of historical documents – from any location, any time, and in any language. In Transkribus Lite, it is possible to use the Transkribus Expert Client collections in the browser of PCs and smartphones. Many of the features from the Transkribus Expert Client can also be used in Transkribus Lite. The platform integrates tools developed by research groups across Europe, including the For Pattern Recognition and Human Language Technology group of the Technical University of Valencia and the CITlab University Rostock group. As of October 2022, Transkribus had more than 94,000 users, 40 million images, 20 million recognized pages. The platform was developed in the context of two EU projects, tranScriptorium (2013-2015) and READ (2016-2019).

7 SKRIPTOR. Project APVV-19-NEW PROJECT-17816 (2020-2024 Inovatívne sprístupnenie písomného dedičstva Slovenska prostredníctvom systému automatickej transkripcie historických rukopisov. [Innovative access to written heritage of Slovakia through the automatic transcription of historical manuscripts]. Research organizations: Matej Bel University in Banská Bystrica (principal investigator doc. Imrich Nagy, PhD); State Scientific Library in Banská Bystrica – partner (guarantor prof. PhDr. Dušan Katuščák, PhD).

8 The research was previously funded as part of the tranScriptorium project. This project has received funding from the European Union's Seventh Framework Programme for Research and Technological Development under grant agreement No 600707.

9 If you are interested in transcribing individual shorter documents, you can try using one of the publicly available transcription models with a similar font, print or handwriting.

10 HTR – Handwritten Text Recognition

11 HTR+ – Handwritten Text Recognition. Transkribus' HTR+ software cannot start automatic transcription immediately, but must first be trained in a specific font and handwriting.

12 In Slovakia, it was an extraordinary and unprecedented national project of mass digitization and conservation in the European context under the auspices of the Slovak National Library (SNK) in Martin called Digital Library and Digital Archive 2012–2015, initiated and authored by Dušan Katuščák (Katuščák et al. 2008, 2011a, 2011b, 2011c, 2021 and others). The project was partially implemented on the basis of a contract between the SNK and the Office of the Government of the Slovak Republic of 7 March 2012 on the provision of a non-refundable financial contribution in the amount of more than EUR 49 million. A unique infrastructure has been built: 20 scanners, including 10 digitizing robots and semi-automated machines, a digital archive for the long-term preservation of digital content, the Slovakiana platform for access to digital documents, and 73 new jobs have been created. The aim was to digitize about three million documents and in fact the entire Slovak library collections, books, newspapers, magazines, anthologies, etc. The project is unique in integrating mass industrial digitization and industrial preservation of deteriorating acidic paper. After substantial management changes in 2012, only about 10% of the planned volume was digitized by 2021 and a total of about 60 million euros were used in the SNL. Mass deacidification of paper is not done, so paper as a carrier further degrades irreversibly (irreversible thermodynamic process). Digital documents are not available online. The state of digitization is partially critically described in the analyses of the Ministry of Culture of the Slovak Republic (MKSR, 2019 and MKSR, 2020).

13 PyLaia is a tool for handwritten text recognition that is supported in addition to the CITlab-HTR+ engine. The two engines work quite similarly, and so the results are usually similar in character error rate (CER). The only difference is that in PyLaia users can set several parameters by themselves. The network structure of PyLaia can also be changed– which is an opportunity for those familiar with machine learning. Modifications to the neural network can be made through the Github repository. HTR+ will usually give better results with curved or inverted lines, but it's possible that PyLaia will soon be able to keep up with this. HTR+ is required if it is necessary to use Text to Image tool as this has not been implemented in PyLai yet. Documents that have been transcribed using the PyLaia model can be searched using the full-text search (Solr) in Transribus.

14 CER (Character Error Rates) is a measure of character errors (compares for a given page the total number of characters (n) including spaces with the minimum number of insertions (i), substitution (s) and deletion (d) of characters that are required to get the result of Ground Truth. These are therefore errors compared to the exact text. The formula for calculating CER is as follows: CER = [ (i + s + d) / n ]*100. Every small mistake in the transcription is a statistically complete error. This means that any missing comma, "u" instead of "v", extra space, or even a capital letter instead of a lowercase letter are included in the CER as an error.

15 READ-COOP. [accessed on 01-10-2022] Available at: About us – READ-COOP (readcoop.eu). In October 2022, the association had 113 members from 27 countries. Slovakia was the only member country from Central and Eastern Europe at that time.

16 Manual transcription: 10–15 euros/page; automatic transcription – Transcribus: ca 0.12 € – 0.14 €/page. Calculated by: Transkribus Credits & Pricing – READ-COOP (readcoop.eu).

17 In 2017, the author worked with the version of Transkribus Expert Client v1.3.7. Version 1.22.0 was available in October 2022.

18 HITEXT. In 2022, the Silesian University in Opava prepared a proposal for an applied research project with the acronym HITEXT in the NAKI III program. The project is being assessed in 2022. In addition, we are addressing the issue as part of education and in the student grant competition project in 2022.

19 Project of the Slovak Research and Development Agency - APVV-19-NEWPROJECT-17816 (2020–2024). Inovatívne sprístupnenie písomného dedičstva Slovenska prostredníctvom systému automatickej transkripcie historických rukopisov. [Innovative disclosure of written heritage of Slovakia through the automatic transcription of historical manuscripts].

20 KWS (The Keyword Spotting) is a powerful search tool that helps find similar images of words in documents. The main advantage is that there is no need for documents to be definitively transcribed. It simply launches some text transcription model, and then the documents can be searched immediately. KWS reliably finds words and phrases (variants of text images). This tool will show the pages containing the specified keyword and display a preview snippet. In addition, it provides an image between values of 0 and 1 (0 = lowest and 1 = highest) to evaluate the quality of the search results.

21 I remember how much effort and time Pavol Vongrej had to spend in the past to transcribe 20,400 verses of the manuscript work of Mator Michal Miloslav Hodža, or Viliam Sokolík to transcribe part of the correspondence between A. Kmeť and V. Rizner. In 1991, in cooperation with Ing. Ján Mišík, I tried to use a character recognition system for automatic transcription of handwritten cataloguing records from the old catalogue of the Slovak National Library (Matica slovenská). As a result, the IRIS OCR transcription efficiency was approximately 35/40% and the transcription was unusable. I published signal information about working with the Transkribus platform in 2018 in a blog and in a Facebook status. I was surprised by the declared interest in this job. This is understandable, because many historians, linguists, librarians, educators are increasingly educated in the use of new technologies in their work and understand that innovations that make their work easier are important.

22 WER – Word Error Rates

23 The transcription states are: New (newly uploaded documents), In Progress (automatic change of status after page editing), Done (page transcribed), Final (page transcribed and checked), Ground Truth (100% correctly transcribed page). This means that work with each individual page is recorded, and different states can be assigned to the page version, depending on how much progress has already been made on them.

24 Petr Voit is an excellent expert in writing. In his works, there are examples of variants of the typeface of Czech historical prints, which definitely need to be examined from the point of view of transcription.

25 There are several types of the Gothic script. For example, the French textura with a very sharp fracture and a slim structure, the Italian wider and rounder rotunda with milder arch breaks, the mixed type – bastarda, in Germany, the Swabacher script – the font of wider, more oval shapes, and Fraktur – the script of narrower and more pointed shapes with ornamental features. With the invention of the printing press (in 1450 by Johann Gutenberg), this typeface became very widespread, especially in German-speaking countries.

26 Martin Lauček (May 12th, 1732 – †February 9th, 1802) was a Slovak Protestant priest, translator and religious writer. He is the author of the monumental manuscript work Collectanea containing about 24 volumes and about 20,000 pages. Collectanea is an invaluable source of knowledge and information on the history of the Protestant Church and a source on the history of Protestantism. Our goal is to collect all available volumes and create one virtual publicly available digital collection. Next, we will analyze the texts and try to have them automatically transcribed and published for everyone.

 

KATUŠČÁK, Dušan. Umelá inteligencia pomáha sprístupňovať písomné dedičstvo. Knihovna: knihovnická revue. 2022, 33(2), ,,,,,. ISSN 1801-3252.

Dec 28, 2022