What is this website
Background
This website is the result of the Decoding Hidden Heritages (DHH) project run by the University of Edinburgh and Dublin City University from 2021–24 and funded by UKRI-AHRC and the Irish Research Council under the ‘UK-Ireland Collaboration in the Digital Humanities Research Grants Call’ (grant numbers AH/W001934/1 and IRC/W001934/1). As part of this project, a large text corpus of folktales from Scotland and Ireland was created, drawing from the Tale Archive of the School of Scottish Studies Archives and from the Main Manuscript & Schools’ Collections of the Irish National Folklore Collection. This material is currently being analysed and the results will be published in 2026.
Source data
The source data was in manuscript or printed form. The manuscript material comprises transcriptions of oral storytelling collected in Scotland and Ireland between 1792–1975. The handwritten transcriptions were usually based on audio recordings. Many of the audio recordings no longer exist, especially the Irish ones. Where the audio recordings do exist, they may be available, or may become so, on Tobar an Dualchais or on Dúchas. The printed material comprises editions of stories also transmitted orally.
Automatic text recognition
The DHH project used manuscript and print material already scanned, and scanned further such material. Automatic text recognition models (ATR) were then trained, for both printed (OCR) and handwritten (HTR) text, to create digital text representations of the folktales. As part of this process, a portion of the handwritten material (less than 10%) was transcribed manually to produce training data, and the remainder was transcribed using these new AI models. As a result, the transcriptions are often not completely accurate, but with an error rate of less than 5%, they are extremely useful for search, discovery and analysis. It is also worth noting here that the original handwritten transcriptions of the spoken word were often deliberately dialectal in form. Most of these transcriptions are being made publicly available for the first time. Users who cannot read Irish or Gaelic may wish to submit texts that interest them to a machine translation system. The results may be variable, however, and users are advised to consult with fluent speakers of these languages to ensure correct interpretation.
The automatic transcription was performed using Transkribus, an AI-powered platform for automated text recognition and transcription of historical documents. The Decoding Hidden Heritages project was one of the winners of the Transkribus 100k Giveaway, celebrating innovative uses of the platform in historical research.
Results
The resulting dataset, which is aggregated and searchable on this website, comprises over 5,500 folktales. Many of the tales are categorised by Aarne–Thompson (AT) folktale types which means that the material can be looked at thematically and typologically. The Irish data is retrieved dynamically via the Dúchas API. Irish transcriptions can be corrected via Meitheal Dúchas.ie.
Statistics
Scotland:
- Total tales: 3,846
- GD tales: 2,346
- EN tales: 1,052
- Mixed language tales: 448
- Total pages: 24,135
- GD pages: 15,776
- EN pages: 4,107
- Mixed language pages: 4,252
- Total words: 2,525,222
- GD words: 1,845,542
- EN words: 21,855
- GD or EN words: 657,825
- Total AT types: 310
Ireland:
- Total tales: 2,062
- GA tales: 1,809
- EN tales: 246
- Mixed language tales: 11
- Total pages: 21,039
- GA pages: 20,983
- EN pages: 718
- Mixed language pages: 44
- Total words: 2,594,806
- GA words: 2,529,384
- EN words: 80,780
- GA or EN words: 5,532
- Total AT types: 360