What is XML, and should I learn it?

Heads up! This post contains affiliate links, meaning if you click through and buy something, I will earn a commission (at no additional cost to you).

Around this time last year, I was talking to an old friend who’d recently started a graduate program & was beginning to get her feet wet in digital humanities. We were talking about whether, and to what extent, one needed to learn code/to code to do DH, and she told me one of her advisors had told her she “has to learn XML.”

I’ve heard similar sentiments expressed by others, and it always surprises me a little. Not because learning XML isn’t valuable, but because that instruction is painfully vague. By design, there is no one “XML” that you can go learn and usefully apply to a project; what makes XML useful is the specific XML standard you choose to follow.

Saying “go learn XML” is like saying “go learn grammar.” Yes, you can absolutely learn the basics of grammar– what is a noun, what is a verb, what is syntax– but you can’t put that knowledge to good use until you know whether you’re working with French or Chinese.

In this post, I’ll teach you the grammar that is XML using a pretend standard, show why standards exist, and suggest a few of the most important standards to at least know about if you’re doing digital humanities work.

Table of Contents

What is XML?
Sandwiches Syntax
What if we want more than sandwiches?
Standards to know [about] as a digital humanist
- TEI
- HTML
- MODS
Useful tools for XML editing
Limitations of XML formats

What is XML?

XML stands for Extensible Markup Language. Let’s break that down word-by-word: Extensible, able to be extended. Markup, a way of annotating. Language, a set of rules for communicating.

That tells us that XML is a set of rules for annotating something that you can add to based on your needs.

Notice how I defined language: XML isn’t a “programming language” like Python or Java, it is specifically, a “markup language.” XML can’t automate tasks for you; it’s a way of encoding information so both people and machines can easily interpret it.

So how do we set out encoding?

Sandwiches Syntax

In a moment, we’ll talk about how to encode a sandwich in XML. But first, I want to introduce some vocab.

A few key terms

Content: The thing you’re trying to describe. Usually this is some text. This goes between…
Tag (aka element): A label for a piece of content. Tags are differentiated from content using angle brackets, like <this>. Tags always come in pairs, and to distinguish the first tag from the second, we use a slash, like <this></this>
Attribute: The name of a property the content has, like “color”
Value: The actual thing an attribute is trying to capture, like “purple.”

A very tall sandwich! Six slices of bread with tomato, cheese, meat, and lettuce between each pair.

Okay, but let’s get to sandwiches

I made you a BLT! 🍞❔🥓🥬🍅 🍞 🥰🍽️

Let’s say you want to describe this emoji recipe in XML. We start out by letting everyone know what we’re making, a sandwich:

<sandwich>🍞❔🥓🥬🍅🍞</sandwich>🥰🍽️

Here, I’ve created a tag called sandwich around our content so you know where the sandwich recipe starts and where it ends. The last two emoji aren’t part of our recipe, so we don’t include them in our <sandwich></sandwich>.

Let’s say we want to make this a little clearer, though. What are all those ingredients in our sandwich?

<sandwich>
    <bread>🍞</bread>
    <sauce>❔</sauce>
    <bacon>🥓</bacon>
    <lettuce>🥬</lettuce>
    <tomato>🍅</tomato>
    <bread>🍞</bread>
</sandwich>

I know what you’re thinking, I picked sandwich as a metaphor and I’m not even “sandwiching” all the ingredients between a <bread></bread> tag. You technically could if that’s how your standard says to describe sandwiches! But, that would sort of suggest that all of the ingredients inside the <bread></bread> tag are bread. Remember, tags label content.

That’s cool, but maybe I want even more information, like what kind of bread?

<sandwich>
    <bread type="rye">🍞</bread>
    <sauce>❔</sauce>
    <bacon>🥓</bacon>
    <lettuce>🥬</lettuce>
    <tomato>🍅</tomato>
    <bread <type="rye">🍞</bread>
</sandwich>

Here, I’ve added an attribute called type to each slice of bread, with the value rye. This tells us that both slices are rye bread. If I wanted, I could make the sandwich with two different kinds of bread!

<sandwich>
    <bread type="rye">🍞</bread>
    <sauce>❔</sauce>
    <bacon>🥓</bacon>
    <lettuce>🥬</lettuce>
    <tomato>🍅</tomato>
    <bread type="wheat">🍞</bread>
</sandwich>

Great! Now we have a BLT on one slice of rye and one slice of wheat. But say, what’s in that sauce, anyway?

<sandwich>
    <bread type="rye">🍞</bread>
    <sauce>
        <mayonnaise amount="1 TBSP">⚪</mayonnaise>
        <mustard amount="2 TSP">💛</mustard>
    </sauce>
    <bacon>🥓</bacon>
    <lettuce>🥬</lettuce>
    <tomato>🍅</tomato>
    <bread type="wheat">🍞</bread>
</sandwich>

Turns out our secret sauce is actually a tablespoon of mayonnaise and two teaspoons of mustard mixed together. To describe that, I’ve nested tags within <sauce></sauce> just like I’ve nested the rest of our ingredients inside <sandwich></sandwich>.

Let’s see if this sandwhich follows sandwich rules

We’ve successfully described our sandwich with XML, but let’s say we want to know whether it conforms to the standard rules of sandwiches. We decide that, in order to be considered a valid sandwich, all sandwiches must follow these rules:

Start and end with a slice of bread
Have at least one protein
Have at least one vegetable
Use at least one condiment, which can be made of other condiments

Going down our sandwich description, we see that yes it starts and ends with bread, but uh-oh… We smart people might know that <bacon> is a protein, but a program trying to check our sandwich might not!

If we want to be able to write any sandwich without having to know all possible sandwich ingredients, we need to generalize!

We don’t want to have to make a list of all possible sandwich ingredients, so let’s try re-writing our sandwich in a way that makes the standard rules of sandwiches easy to confirm, without losing information:

<sandwich>
    <bread type="rye">🍞</bread>
    <condiment type="special sauce">
        <condiment type="mayo" amount="1 TBSP">⚪</condiment>
        <condiment type="mustard" amount="2 TSP">💛</condiment>
    </condiment>
    <protein type="bacon">🥓</protein>
    <vegetable type="lettuce">🥬</vegetable>
    <vegetable type="tomato">🍅</vegetable>
    <bread type="wheat"><strong>🍞</strong></bread>
</sandwich>

Now, instead of giving the name of the ingredient as each tag, I’ve used one of the categories from our standard rules of sandwiches for each tag and described the ingredient with an attribute-value pair. Now we can easily check whether our sandwich follows the sandwich rules!

What if we want more than sandwiches?

What if we want to allow any recipe to be described in our XML standard? Let’s make a standard rule of recipes instead! These are the recipe rules:

Every recipe has a title
Every recipe has one or more ingredients
Ingredients can be grouped into named components within the same recipe (like a sauce)
Every ingredient must have an quantity and a unit (like 1 tablespoon)
Ingredients may (but are not required to) have a special note about preparation (like minced or hot)

Now that we have our rules, let’s re-encode our sandwich:

<recipe title="BLT">
    <ingredient quantity="2" unit="slice">Rye bread</ingredeint>
    <ingredient quantity="1" unit="slice">Tomato</ingredient>
    <ingredient quantity="2" unit="leaf">Lettuce</ingredient>
    <ingredient quantity="4" unit="slice" preparation="crispy">Bacon</ingredient>
    <component name="sauce">
        <ingredient quantity="1" unit="tablespoon">Mayonnaise</ingredient>
        <ingredient quantity="2" unit="teaspoon">Mustard</ingredient>
    </component>
</recipe>

Unlike our previous encodings, this time I’ve removed the emojis and used the name of the ingredient as the content inside our ingredient tags. We could write our standard rule of recipes to include a “name” attribute instead, as we had before, but this is a little more straightforward. This way, we can take any already-written recipe text and just mark it up with tags to bring it into compliance with the XML recipe standard that we just invented.

Let’s try another recipe, this time for a soft pretzel snack. Here is our non-XML description of the recipe:

Soft preztels with chesse
Dough:
    1 tablespoon yeast
    2 tablespoons vegetable oil
    2 cups warm milk
    1 1/2 cups warm water 
    8 cups flour
Cheese sauce:
    2 tablespoons butter
    2 tablespoons flour
    1 cup milk
    2 cups sharp cheddar cheese, grated

And here is the same recipe in XML, following our recipe rules:

<recipe name="Soft pretzels with cheese">	
	<component name="Dough">
		<ingredient quantity="1" unit="tablespoon">yeast</ingredient>
		<ingredient quantity="2" unit="tablespoons">vegetable oil</ingredient>
		<ingredient quantity="2" unit="cups" preparation="warm">>milk</ingredient>
		<ingredient quantity="1.5" unit="cups" preparation="warm">water</ingredient>
		<ingredient quantity="8" unit="cups">flour</ingredient>
	</component>
	<component name="Cheese sauce">
		<ingredient quantity="2" unit="tablespoons">butter</ingredient>
		<ingredient quantity="2" unit="tablespoons">flour</ingredient>
		<ingredient quantity="1" unit="cup">milk</ingredient>
		<ingredient quantity="2" unit="cups" preparation="grated">sharp cheddar cheese</ingredient>
	</component>
</recipe>

Ta-da!

Now we have two standards-compliant recipes! If we were to formalize our recipe rules into an XML schema and share it with others, they too could markup recipes, and we could develop applications that use these recipes & be confident that (after validating the XML against our schema) our program will be able to use recipes from anyone.

As you might imagine from this exercise, there are XML standards for just about everything– if you can think of it, probably someone out there has tried to standardize it (recipes included!)

It’s almost always best to try to find an existing standard for the kind of work you want to do than it is to invent your own schema as we did here. They’re called standards for a reason, after all. The more interoperable your data, the better it is for everyone!

Standards to know [about] as a digital humanist

Even if you don’t learn the ins and outs of these standards, at least knowing about these three will get you far in digital humanities circles!

TEI

The Text Encoding Initiative is a community of practice that has been around since the 1980s working on, well, text encoding in digital humanities. They publish a set of XML standards for encoding literary texts like novels, poetry, plays, speeches, etc.

Because the number of different things scholars like to do with texts is huge, TEI is massive— it’s very hard to know all the ins and outs, and can feel super overwhelming to learn. I recommend, as I do with nearly all digital skills, jumping in with a specific project in hand & picking up only what you need for that project.

What TEI can do

TEI is used by lots of institutions and projects, including the Walt Whitman Archive and the US Office of the Historian. Creating XML of documents can be a step in the digitization & preservation of a document– as plaintext, XML formats have a better guarantee of longevity than images or rich text recreations of those same documents.

What’s cooler, though, is using TEI-encoded documents to make critical editions of a text. TEI By Example has a section for that, if it’s something you’re interested in, and the Versioning Machine is a popular tool for displaying critical editions that use TEI.

Where to learn TEI

There are a lot of blog posts and tutorials out there for TEI. I recommend TEI By Example. They have easy-to-follow examples for prose, poetry, and critical editions.

Limitations of TEI

I would be remiss if I didn’t admit that I am not a big TEI fan. As I said above, it’s a very big standard, almost to the point of not being a standard at all (please don’t come for me). It’s difficult to learn, and if you’re interested in doing critical editions, XML is frankly not the best way to encode them (imo) from a philosophical/how do texts work point-of-view (more on this at the very end). But, it is the technology with the best tool support at the moment, so 🤷 If you’re interested in the world of digital scholarly editing, I can’t recommend highly enough the book of the same name by Elena Pierazzo.

Amazon affiliate link – as an Amazon Associate, I earn from qualifying purchases

There is a new version of this book out which I haven’t had a chance to read yet (it’s arriving this week! I’ll update this when I get to it), but Digital Scholarly Editing: Theories, Models, and Methods is a great look at what’s what in editing these days. The follow-up edited collection Digital Scholarly Editing: Theories and Practice is also stellar (and open access!)

HTML

Okay, okay, technically HTML is not the same as XML. Technically HTML stands for Hypertext Markup Language and is for describing how something looks, whereas XML is for describing what something is.

Technically.

But, I’m including HTML here because that is a distinction in how it’s used, not how it’s written. At its heart, HTML is about describing “what something is,” it’s just based on very basic ideas of what documents can be.

What HTML can do

HTML is the backbone of the internet. It’s what tells your web browser how to display a webpage. In order to do most of the fancy stuff web pages do these days you need CSS and Javascript, too, and if you’re going to be making a website for your project, a WYSIWYG website builder is usually sufficient. But, understanding the basics of HTML can help you to customize your content & or write up simple web pages.

Where to learn HTML

W3 Schools has great references for HTML and CSS if you want to learn HTML in a “self-study the documentation” kind of way. If you want more of an interactive tutorial, Codecademy has a fun (free!) course. As a starter project, I recommend making yourself a CV page!

Limitations of HTML

HTML can really only effectively describe the way documents should look; like I said before, so it’s not a great choice for capturing complex information about a text.

MODS

The Metadata Object Description Schema (MODS) is an XML standard developed by the Library of Congress. It’s designed to carry much of the same information as MARC records (Machine Readable Cataloging). This post has a great short overview of the history of MARC, MARCXML, and MODS. Basically, MODS is a more efficient (computationally) version of MARC, but MARC records are still very much the standard for machine cataloging.

It is super useful to know about MODS if you’re working with materials from a library or archive! You can often ask for copies of the machine-readable metadata they have and save yourself some data processing time.

The LOC has a good list of other library-related XML (and non-XML) standards that’s worth checking out, too.

Useful tools for XML editing

As a plaintext format, you can write XML just about anywhere you can type. There are some tools that make it easier, though.

By far the most popular XML editor is Oxygen. It’s not free to use, unfortunately (though you can try it for 30 days), but if you’re university-affiliated you may have access through your institution. Oxygen does what you would expect a good IDE to do– it auto-closes and suggests auto-completion, validates XML syntax, and can validate against a schema.

As far as I’m aware, there isn’t a free alternative to Oxygen that offers the same bells and whistles. I use my favorite general-purpose text editor, Notepad++. It’s free and lightweight, and it will do syntax highlighting for damn near any language. You can extend its functionality with plugins, including some that will auto-close your XML tags [ah, by the way, “closing a tag” refers to adding the second half of the pair, the one with the slash. So by auto-close, I mean if you type <sandwich> it will automatically become <sandwich></sandwich>].

Limitations of XML formats

I mentioned this a bit in the TEI section, but XML formats are not perfect for encoding everything. XML is strictly hierarchical, which we know well as humanists, the world is not. Things in the real world overlap each other all the time. Take, for example, a sentence like this:

The quick brown fox jumps over the lazy dog.

If we want to preserve both the italicization and bold, we would want to be able to do something like this:

The quick brown <italic>fox jumps <bold>over the</italic> lazy dog.</bold>

But, because the tags are not strictly nested in one another, this is not valid XML. Instead, we have to do something like this:

The quick brown <italic>fox jumps</italic> <italic><bold>over the</bold></italic> <bold>lazy dog.</bold>

That’s a lot of extra tags! Now imagine you’re making a critical edition of a text that changed many times over its publication history. Imagine trying to capture the textual variation from a single sentence that changed a little at a time across three or more editions. It can be a tedious nightmare!

This is a problem people have been working on resolving for years. One set of solutions involves something called standoff markup, where the text being annotated and the annotates are kept separately. This kind of system means you can refer to locations in the text by character number, making our brown fox example look more like 17-34 italic; 27-44 bold (but, in practice, more complicated than that).

In conclusion

If you made it this far, I’d say you “know XML” as much as anyone can know plain-old XML– congratulations! The next time someone asks you to learn XML, you can ask in reply “Which standard?” There are a million of them out there, and odds are there’s one for whatever you’re trying to do. Go forth and encode!