Double annotated corpora

Doubly annotated corpora: “Nature & State” and Reichstag protocols as resources for innovative automatic metaphor analysis

Duration: 1.2020 – 7.2021

---

Project team:

Dr. Steffen Eger | FB 20, Computer Science – AIPHES

Prof. Dr. Petra Gehring | FB 2, Philosophy of Technoscience

---

Project description:

Modeling, identifying, and understanding metaphors clearly is one of the most challenging problems both in humanities and computational contexts – because metaphors are irregular phenomena with a multitude of sometimes diverging metaphor theories. As a consequence, few metaphor datasets exist, particularly in languages other than English. However, such datasets are necessary both for computational purposes, to train machine learning systems capable of detecting metaphors (which then can be used for a deeper understanding of texts), and for digital, hermeneutically oriented metaphor research (which is interested in understanding the type and incidences of metaphors in historically varying text genres, also for contrastive purposes). Here, we aim to fill this gap and (1) provide large annotated metaphor datasets from research-relevant corpora of German, and then (2) train machine learning models capable of identifying metaphors in texts. For contrastive purposes, we annotate metaphors according to an innovative theory vs. a more classical approach, and across historical datasets at the heart of developments leading up to the 1st and 2nd world war.

As results, we have corrected over 57,000 lines of OCR errors in our historical corpora across several different books and historical parliamentary debates. These were used in automatic methods to reduce OCR errors on average by over 30% in unseen portions of the data. We have finally identified several hundred metaphors in our data based on the two diverging theories and annotations from up to five different annotators, with moderate annotator agreement levels. These will subsequently be used to train metaphor identification models based on our two theories, following our first initial experiments.