A Hands-On Workshop on Parsing Wikitext
Experience Level: intermediate
In this hands-on workshop, you will learn to parse wikitext from beginning to end. Using data from the German Wiktionary, we will fetch the data, parse the XML, and process the wikitext to extract linguistic content such as parts of speech, meanings, and inflections, and more.
- timeslot: Sunday, 6th April 2025, 14:30-16:30, Room W1
- number of participants: 1-10
- intro video: https://youtu.be/CRG4qCL2Z78
- requirements: see below
We will cover the following points:
- Fetching the Data: Learn two ways to retrieve wiki data—using the wiki Special Export tool or downloading wiki dump files.
- Parsing the XML Files: Once the data is retrieved in XML format, this section explains how to parse the files to extract the wikitext.
- Parsing the Wikitext: In the final part, we will parse the wikitext and extract elements such as headings, sections, word forms, meanings, inflections, and more.
What You Need
We will use Google Colab, a free, cloud-based platform for running Python code in a Jupyter Notebook environment.
To participate:
- Sign in with your Google account.
- Have a stable Internet connection.
- You can follow along on a tablet, but for editing, a laptop is recommended.
All materials will be available at:
If you prefer to run the code on your own machine, the website also includes the installation instructions and download links to the source code and data.
Carolina Lennon
Economist by training with a passion for Python programming, deeply grateful to the Python and open-source community for the countless hours of learning and joy.
