CLARIAH-AT Summer School

CLARIAH-AT Summer School:

Machine Learning for Digital Scholarly Editions

Let's begin

8 - 12^th, September 2025

Elisabethstraße 59/III, 8010 Graz, Austria
Department of Digital Humanities

About the Summer School

Machine learning is increasingly shaping research in the Digital Humanities, offering powerful tools for analyzing and enriching textual data. Using the Python library BERTopic, participants will explore various steps of topic modeling. Building upon BERTopic’s modular architecture, students will be introduced to several essential machine learning methods, such as embedding, dimensionality reduction, and clustering. Through practical sessions, students will learn to apply these techniques to historical texts. The aim is to give non-experts a high-level practical overview of how to use the BERTopic library and the essential theory behind its modules.

The school is intended for both students and researchers with an interest in the intersection between digital scholarly editing and Machine Learning. After attending the school, participants will have a basic understanding of machine learning algorithms and be able to assess their possible applications as well as strengths and limitations. Participants will be able to practically use BERTopic on their own data.

Meet the Speakers

Clemens Neudecker

Clemens Neudecker studied Philosophy, Computer Science and Political Science at LMU Munich and works as Head of Data Science in the Information and Data Management Department of the Staatsbibliothek zu Berlin - Preußischer Kulturbesitz. The main focus of his work and research lies in Computer Vision, Natural Language Processing and Machine Learning/Artificial Intelligence and their applications in the context of digitization and the Digital Humanities.

Keynote 1: Tuesday, September 9, 6pm (CEST), Elisabethstraße 50b (SR 19.02)

Online link (Unimeet)

Context matters. Opportunities and challenges when working with artificial intelligence and cultural heritage data

The advances made in the field of machine learning/artificial intelligence (ML) offer a range of opportunities for libraries and digital scholarship. In projects such as Mensch.Maschine.Kultur, the Staatsbibliothek zu Berlin - Preußischer Kulturbesitz (SBB) is developing ML technologies for a wide range of applications: from text and layout recognition and image analysis to information extraction, machine-assisted subject indexing and, last but not least, the provision of collections as data and their digital curation. On the other hand, the historical and cultural contexts must always be taken into account when using ML technologies in combination with historical sources and cultural heritage materials. Collections digitized by libraries are heterogeneous in terms of the period covered, the perspectives, places or regions they contain and the cultural contexts in which they must be placed. Historical documents often contain distortions that no longer correspond to today's ethical values. While historians are trained to classify sources and apply source criticism as a methodological tool, AI systems developed by industry are primarily trained on modern texts from the Internet and cannot do this. Using the example of SBB's experience with machine learning and AI, this talk aims to provide insights into practical applications while at the same time raising awareness for a conscious and responsible approach to ML and cultural heritage data.

Ulrike Henny-Krahmer

Ulrike Henny-Krahmer is Junior Professor for Digital Humanities at the University of Rostock. Her research focuses on digital scholarly editing, computational text analysis, and questions on the sustainability and evaluation of digital scholarly outputs. Since 2012, she is a member of the Institute for Documentology and Scholarly Editing (IDE) and since 2019, she has been one of the managing editors of the journal RIDE - A Review Journal for Digital Editions and Resources. Since 2025, she is a member of the Technical Council of the Text Encoding Initiative (TEI).

Keynote 2: Friday, September 12, 1:30pm (CEST) (online)

Online link (Unimeet)

Machine learning and scholarly editing - a contradiction or an exciting partnership?

Traditionally, scholarly editions aim to produce a reliable text based on historical documents that can be used as a basis for further research in the respective subject area(s). Depending on the type of source, this methodologically requires a precise text comparison and a detailed examination of the nature of the underlying documents and their textual contents. How does this fit in with machine learning methods that recognise patterns based on large amounts of data so that we can obtain models with which we can make probability-based predictions for further data? Are these approaches even compatible with each other and how can we resolve the methodological contradictions or seek connections between the methods? The lecture will discuss these questions using the concrete example of letters from the edition of the works of the German writer Uwe Johnson (1934–1984), for which topic models were created. It will also be about how far humanities scholars, digital humanists, and computer scientists can delve into the other domain in order to understand the respective methods. This understanding not only provides exciting opportunities, it is also a prerequisite for the successful application of machine learning methods in the humanities.

Roman Bleier

Roman Bleier is a postdoctoral researcher at the Department of Digital Humanities at the University of Graz. His research focuses on digital scholarly editing, text encoding, and digital history. He was part of the editorial team for The Imperial Diets of 1576 and is currently co-PI of the FWF-DFG project History as a Visual Concept: Peter of Poitiers' "Compendium historiae". Roman is also a member of the Institute for Documentology and Scholarly Editing.

Lucija Brozić

Lucija Brozić is a PhD student and university assistant at the University of Graz, specializing in Digital Humanities and Natural Language Processing. Her doctoral research examines attitudes towards migration and minority groups in Austrian historical newspapers. She has led a CLARIAH-AT funded small project on sentiment annotation, developing annotation guidelines and training annotators. Her academic interests include machine learning for DH texts, topic-specific corpus building, annotation practices, sentiment analysis and migration studies.

Selina Galka

Selina Galka is a research assistant at the University of Graz at the Department of Digital Humanities. Her research focuses on digital editing and data modelling. After completing her master's degrees in “German Philology of the Middle Ages and Early Modern Period” and “Digital Humanities,” she is currently a PhD candidate in the field of digital humanities.

Bernhard Geiger

Bernhard Geiger received the Dipl.-Ing. degree in electrical engineering in 2009 and the Dr. techn. degree in electrical and information engineering from Graz University of Technology, Austria, in 2014. He was a Senior Scientist and Erwin Schrödinger Fellow at the Institute for Communications Engineering, Technical University of Munich from 2014 to 2017. He is currently Assistant Professor at the Signal Processing and Speech Communication Laboratory, Graz University of Technology and Research Area Leader at the Know Center Research GmbH. His research interests include domain-aware machine learning, information theory for machine learning, and information-theoretic model reduction for Markov chains and hidden Markov models.

Michael Jantscher

Michael Jantscher is a PhD student in Computer Science at Graz University of Technology and a senior researcher at Know-Center Research GmbH. His work focuses on Natural Language Processing in the medical and clinical domain, with a particular emphasis on causal reasoning in healthcare and (neuro)radiology. Currently, he is also exploring the research and implementation of agentic AI systems across both industrial and healthcare sectors.

Sarah Lang

Sarah Lang is Head of Digital Humanities at the Max Planck Institute for the History of Science (Berlin). Previously, she was a Postdoctoral Fellow at the Centre for Information Modelling at the University of Graz. Trained in History and Classics in Graz and Montpellier, she completed a PhD on early modern alchemical literature in 2021, combining Digital Humanities and the history of science, for which she received the Bader Prize of the Austrian Academy of Sciences. As convenor of the Empowerment Working Group of the German Digital Humanities Association (DHd), where she is also on the board of directors, Sarah Lang is interested in issues like (gender) data gaps, data feminism, diversity in DH, decolonizing data, data ethics and related topics. Her research focuses on computational approaches to historical sources, particularly distant reading and viewing of chymical print. A DH professional since 2016, she blogs at latex-ninja.com, serves on the council of the Society for the History of Alchemy and Chemistry.

Martina Scholger

Martina Scholger is a senior scientist at the Department of Digital Humanities, University of Graz, where her research focuses on digital scholarly editing, text encoding, text mining, and LLM applications. She is co-PI of the FWF-DFG Early Manila Hokkien project and contributes to the digital edition of Joseph von Hammer-Purgstall’s correspondence, the Visual Archive Southeastern Europe, and Picturing Migrants' Lives. She has been an elected member of the TEI Technical Council since 2016, a member of the Institute for Documentology and Scholarly Editing since 2012, and is managing editor of RIDE (Review Journal for Scholarly Digital Editions and Resources).

Max Toller

Max Toller is a post-doctoral researcher at Know Center GmbH and a lecturer at Graz University of Technology. His research focuses on data mining and machine learning for time series data, anomaly detection, and similar topics. He was a nominated finalist for the Prize for Excellence in Teaching at Graz University of Technology in 2024, and has received the 'Outstanding Reviewer' award for his services at the 2025 International Conference on Knowledge Discovery and Data Mining.

Gunter Vasold

Gunter Vasold is a research software engineer in the Department of Digital Humanities at Graz University. Thirty years ago, he began working on the pioneering and highly ambitious critical digital editions project, Fontes Civitates Ratisponensis. While he enjoyed working with medieval documents, he discovered an even greater passion for developing software for the project. Since then, he has been involved in numerous research and software initiatives. Currently, his primary focus is on software engineering, research infrastructures, and the long-term preservation of research data. Additionally, Gunter is an award-winning lecturer with over 25 years of experience.

Klara Venglarova

Klara Venglarova is a PhD student of Linguistics and Digital Humanities at the Palacky University in Olomouc, Czech Republic. She is involved in the FWF-funded project The Making of the Incredibly Differentiated Labor Market: Evidence from Job Offers from Ten Decades at the University of Graz (PI Jörn Kleinert), specifically engaged in layout analysis, OCR, post-correction, information extraction and other NLP and machine-learning tasks.

Elisabeth Raunig

Elisabeth Raunig works as a Project Manager at the Department of Digital Humanities at the University of Graz, where she is also part of the Summer School’s organising team. Having completed her Master’s in Digital Humanities there, she now contributes to project and event management.

Schedule

Time	Monday (Sep 8)	Tuesday (Sep 9)	Wednesday (Sep 10)	Thursday (Sep 11)	Friday (Sep 12)
8:30-9:00	Registration
9:00-10:30	Welcome and setup (Georg Vogeler, Walter Scholger) (Roman Bleier, Martina Scholger)	Embeddings (Michael Jantscher)	Clustering (Max Toller)	Tokenization and weighting (Klara Venglarova)	Experiments (Roman Bleier, Martina Scholger)
10:30-11:00	Coffee break	Coffee break	Coffee break	Coffee break	Coffee break
11:00-12:30	BERTopic: overview and example (Selina Galka)	Embeddings (Michael Jantscher)	Clustering (Max Toller)	Topic finetuning (Lucija Brozić)	Machine learning and DSE wrap up (Sarah Lang)
12:30-13:30	Lunch	Lunch	Poster session	Lunch	Lunch
13:30-15:00	Introduction to Python (Gunter Vasold)	Dimensionality reduction (Bernhard Geiger)	-	Built your BERTopic pipeline (Roman Bleier, Sarah Lang, Martina Scholger)	Keynote Ulrike Henny-Krahmer (online)
15:00-15:30	Coffee break	Coffee break	Excursion "Buschenschank" back in Graz at 21:30	Coffee break	Goodbye coffee
15:30-17:00	Prepare a dataset (Roman Bleier, Sarah Lang, Martina Scholger)	Dimensionality reduction (Bernhard Geiger)	-	Experiments (Selina Galka, Michael Otto, Roman Bleier, Martina Scholger)
18:00		Keynote Clemens Neudecker Location: Elisabethstraße 50b (SR 19.02)

Teaching Material

The teaching material is available at https://github.com/DHGraz/clariah2025-dse-ml/tree/main/materials under the creative commons license CC-BY-NC. Please cite the notebooks and presentation by attributing the author and the GitHub repository.

Find us

Get there:

You can reach Graz by train, car or plane. Graz has an airport, but you can also fly to Vienna and get a train or Flixbus to Graz.

Stay:

Graz offers a lot of different options to stay: hostels, Hotels and Airbnb. Please check Graz Tourism.

Get around:

Graz is a city of walking and cycling. You can reach the Department of DH in 20 minutes on foot from the city center. Trams and busses are available too.