×Close

Webinar: Newspaper digitization with Library of Virginia  | October 5       Register Now

May 31, 2017

Publishing markup mysteries: Clearing up some confusion

I’ve done a lot of writing and speaking about markup over the years, often getting deep into the weeds of angle-bracket land. So it was refreshing when our marketing staff at Apex asked me to write the copy for a one-page infographic that addresses some of the most basic things about markup and file formats that many people are a bit foggy about.


It’s no wonder people get confused about these things!


No geek-speak. No angle brackets. Just a simple explanation of some of the most common and fundamental things publishers depend on in the digital era—things like HTML and XML and CSS and EPUB—that almost everybody knows are important but that often get misunderstood.

I realized that even the concept of “markup” is slippery.

Isn’t EPUB just a form of XML?

For example, people often say “EPUB is XML, right?” Well, yes and no. The content documents in an EPUB are XML—the words you’re reading on your ereader or phone. But EPUB itself is a file format. It’s a package that contains lots of components that make up a publication. Not just the content documents, but the images and media and other features that together comprise a given publication, the CSS stylesheets and fonts that govern how they look, and metadata and navigation files that make it all work. All this good stuff is gathered up in a systematic package called an EPUB.

Because its current packaging is a .zip file, an EPUB looks like—and is—a single file. Which leads people to think it’s just a file like an XML file. Nope. It’s way more than that.

HTML is not the same as XML. Except when it is.

Those XML content documents in an EPUB aren’t just any XML. They’re XML using a very specific vocabulary: HTML5. Or, to say that the other way around, they’re HTML5 using XML syntax. That’s often referred to as XHTML; but it’s not the old XHTML 1.1 of a few years back.

No wonder people are confused!

XML is a markup language (actually, a way to make up a markup language), but it’s not a vocabulary. You can express virtually any arbitrary vocabulary as XML. HTML is a very specific vocabulary (it’s also APIs, but now we’re getting geeky) that can be expressed in what are called serializations: HTML (used by websites and browsers, among other things) and XML (used by EPUB and other environments where a more rigorous—some would say rigid—syntax is required). Same vocabulary, different ways of expressing it.

XML is for the tags. Unicode is for the characters.

Digging even deeper, there is another standard underpinning XML that is also fundamental: Unicode. Unicode is the system that specifies how all the characters in most of the world’s languages (well, technically, their scripts) are encoded. It also provides standard encoding for symbols, like math and currency symbols, dingbats like bullets and boxes, music and phonetics, and even—yuck!—emojis.


Keep this in mind when designing your publications: use Unicode fonts!


Every character and symbol in an XML file is encoded in Unicode; if it isn’t Unicode it isn’t XML. That’s what makes the intended characters unambiguous, which enables them to be reliably rendered. So why do you get those stupid empty boxes sometimes where a character is supposed to be? Because you need a Unicode font that includes the glyph of that character.

Keep that in mind when designing your publications: use Unicode fonts!

Even InDesign can be XML. But be careful.

Which brings me to the last format our new infographic includes: IDML. That’s the format used for expressing the result produced by Adobe InDesign not as PDF but as XML.

So why don’t we just use IDML in the first place and be done with it? Because IDML is an output from InDesign. It’s almost never an input. It expresses all the lovely spacing and sizing and positioning and the other design aspects that InDesign is so good at—as a result of the designer’s and typesetter’s work.


EPUB looks like—and is—a single file. Which leads people to think it’s just a file like an XML file. Nope. It’s way more than that.


You can actually import almost any arbitrary XML into InDesign as a source, though. Yes, even XHTML. And you can export EPUB from InDesign, whether or not you’ve used XHTML as an input. (Though be careful: depending on how you’ve set up your InDesign workflow, the resulting EPUBs can be anything from reasonably good EPUBs to terrible EPUBs.)

It’s no wonder people get confused about these things!

I hope our new infographic is helpful. If you’re reading this blog, you might already understand most of this; but I guarantee you that many people you work with don’t. Feel free to share it and get in touch if you’d like to learn more.

About Bill Kasdorf

Bill Kasdorf is VP and Principal Consultant of Apex Content and Media Solutions. Past President of SSP, he is a recipient of SSP’s Distinguished Service Award, the IDEAlliance/DEER Luminaire Award, and the BISG Industry Champion Award. Bill serves on the Steering Committee of the W3C Publishing Business Group and the W3C Publishing Working Group developing the next generation of Web Publications and EPUB; the International Press Telecommunications Council; is Chair of the BISG Content Structure Committee; and is an active member of ABC, the Accessible Books Consortium, the EDUPUB Alliance, and the IDEAlliance Tech Council. Bill has spoken at many industry events, such as SSP, STM, AAUP, DBW, O’Reilly TOC, NISO, BISG, IDPF, IPTC, Seybold Seminars, and the Library of Congress. He serves on the editorial boards of Learned Publishing and the Journal of Electronic Publishing. In his consulting practice, Bill has served publishers such as Pearson, Wolters Kluwer, Kaplan, Sage, Harvard, Toronto, Taylor & Francis, Cambridge, ASME, and IEEE, and organizations such as the World Bank, the British Library, OCLC, and the European Union.

Questions?