May 31, 2017

Publishing markup mysteries: Clearing up some confusion

I’ve done a lot of writing and speaking about markup over the years, often getting deep into the weeds of angle-bracket land. So it was refreshing when our marketing staff at Apex asked me to write the copy for a one-page infographic that addresses some of the most basic things about markup and file formats that many people are a bit foggy about.

It’s no wonder people get confused about these things!

No geek-speak. No angle brackets. Just a simple explanation of some of the most common and fundamental things publishers depend on in the digital era—things like HTML and XML and CSS and EPUB—that almost everybody knows are important but that often get misunderstood.

I realized that even the concept of “markup” is slippery.

Isn’t EPUB just a form of XML?

For example, people often say “EPUB is XML, right?” Well, yes and no. The content documents in an EPUB are XML—the words you’re reading on your ereader or phone. But EPUB itself is a file format. It’s a package that contains lots of components that make up a publication. Not just the content documents, but the images and media and other features that together comprise a given publication, the CSS stylesheets and fonts that govern how they look, and metadata and navigation files that make it all work. All this good stuff is gathered up in a systematic package called an EPUB.

Because its current packaging is a .zip file, an EPUB looks like—and is—a single file. Which leads people to think it’s just a file like an XML file. Nope. It’s way more than that.

HTML is not the same as XML. Except when it is.

Those XML content documents in an EPUB aren’t just any XML. They’re XML using a very specific vocabulary: HTML5. Or, to say that the other way around, they’re HTML5 using XML syntax. That’s often referred to as XHTML; but it’s not the old XHTML 1.1 of a few years back.

No wonder people are confused!

XML is a markup language (actually, a way to make up a markup language), but it’s not a vocabulary. You can express virtually any arbitrary vocabulary as XML. HTML is a very specific vocabulary (it’s also APIs, but now we’re getting geeky) that can be expressed in what are called serializations: HTML (used by websites and browsers, among other things) and XML (used by EPUB and other environments where a more rigorous—some would say rigid—syntax is required). Same vocabulary, different ways of expressing it.

XML is for the tags. Unicode is for the characters.

Digging even deeper, there is another standard underpinning XML that is also fundamental: Unicode. Unicode is the system that specifies how all the characters in most of the world’s languages (well, technically, their scripts) are encoded. It also provides standard encoding for symbols, like math and currency symbols, dingbats like bullets and boxes, music and phonetics, and even—yuck!—emojis.

Keep this in mind when designing your publications: use Unicode fonts!

Every character and symbol in an XML file is encoded in Unicode; if it isn’t Unicode it isn’t XML. That’s what makes the intended characters unambiguous, which enables them to be reliably rendered. So why do you get those stupid empty boxes sometimes where a character is supposed to be? Because you need a Unicode font that includes the glyph of that character.

Keep that in mind when designing your publications: use Unicode fonts!

Even InDesign can be XML. But be careful.

Which brings me to the last format our new infographic includes: IDML. That’s the format used for expressing the result produced by Adobe InDesign not as PDF but as XML.

So why don’t we just use IDML in the first place and be done with it? Because IDML is an output from InDesign. It’s almost never an input. It expresses all the lovely spacing and sizing and positioning and the other design aspects that InDesign is so good at—as a result of the designer’s and typesetter’s work.

EPUB looks like—and is—a single file. Which leads people to think it’s just a file like an XML file. Nope. It’s way more than that.

You can actually import almost any arbitrary XML into InDesign as a source, though. Yes, even XHTML. And you can export EPUB from InDesign, whether or not you’ve used XHTML as an input. (Though be careful: depending on how you’ve set up your InDesign workflow, the resulting EPUBs can be anything from reasonably good EPUBs to terrible EPUBs.)

It’s no wonder people get confused about these things!

I hope our new infographic is helpful. If you’re reading this blog, you might already understand most of this; but I guarantee you that many people you work with don’t. Feel free to share it and get in touch if you’d like to learn more.

About Bill Kasdorf

Bill Kasdorf,, is Principal of Kasdorf & Associates, LLC, a publishing consultancy focusing on accessibility, XML/HTML/EPUB modeling, information infrastructure, and workflow. Bill is active in the W3C Publishing Business Group, Publishing Working Group, and EPUB 3 Community Group; chairs the Content Structure Committee of the Book Industry Study Group and is co-editor of the BISG Guide to Accessible Publishing; and is Past President of the Society for Scholarly Publishing (SSP). He is a recipient of SSP’s Distinguished Service Award, the IDEAlliance/DEER Luminaire Award, and the Book Industry Study Group’s Industry Champion Award. Bill has written and spoken widely for organizations such as SSP, IPDF, BISG, DBW, IPTC, O’Reilly TOC, NISO, AAP, AAUP, ALPSP, and STM. General Editor of The Columbia Guide to Digital Publishing and Guest Editor of the January 2018 issue of the Learned Publishing journal devoted to accessibility, he is the author of the chapter on EPUB metadata and packaging for O’Reilly’s EPUB 3 Best Practices and the chapter on EPUB in the book The Critical Component: Standards in Information Distribution, published by the ALA in collaboration with NISO. He serves on the editorial boards of Learned Publishing and the Journal of Electronic Publishing. In his consulting practice, Bill has served clients globally, including large international publishers such as Pearson, Cengage, Wolters Kluwer, Kaplan, and Sage; scholarly presses and societies such as Harvard, MIT, Toronto, Taylor & Francis, Cambridge, and IEEE; aggregators such as VitalSource; and global publishing and library organizations such as the World Bank, the British Library, the Asian Development Bank, OCLC, and the European Union.