On-Demand Webinar: How to create born-accessible publications            Watch Now

May 2, 2017

Moving past PDF’s rigidity toward HTML’s flexibility

The Portable Document Format – better known as the PDF – has been the go-to standard for digital versions of print documents since the 1990s, and for good reason. The PDF has a lot of desirable features. It’s simple to create, the page layout is comfortable for readers, and the documents can be read offline.

However, PDF has always been tied to print publishing. While extensions have increased its flexibility, it remains too limited to support modern publishing and digital document accessibility needs. To stay profitable, current, and compliant with accessibility standards, publishers need a more flexible file type.

Why modern publishing demands a modern file type

mediaPDF files with high-quality images, embedded fonts, bookmarks, and links can be very large. They consume a lot of hard drive space, and can be time consuming to transfer to other parties.

Trying to compress a PDF to a manageable size creates another set of problems. Figures, tables, and images are often degraded beyond usability. Print quality may also be sacrificed. Another downside: a PDF is a static document. Once it’s been created you cannot update, correct, or change it without generating an entirely new document. Because of this, readers have no way to know whether the PDF they’re reading or referencing is the most current version. This can be especially problematic in scholarly publishing, where all versions of a study or paper must match exactly across all platforms. CrossRef.org, a nonprofit organization for scholarly publishers, is striving to overcome this limitation of PDF – but its efforts, as we’ll see, are imperfect.

It can also take considerable time and money to make PDFs compliant with Web Content Accessibility Guidelines (WCAG) for visually and hearing disabled people. WCAG is quickly becoming mandatory, and publishers who ignore it may be prevented from selling in important markets.

There’s a better format for creating documents that address these issues and can adapt to the industry’s online publishing needs: HTML.

Using an XML-first workflow is the easiest, most cost-efficient way to create HTML documents.

Using an XML workflow, publishers can easily create clean HTML content for display via a web browser. HTML tags “describe” every element in the document, from heading to text to references and images. This collective information is the document’s metadata.

The ability to code and use every available piece of metadata in an HTML document benefits publishers, authors, and end-users.

Metadata in use: The benefits of HTML documents

Using the metadata, all references in an HTML document can be embedded in the file at the point of citation so readers can jump directly to the study on PubMed, CrossRef or another site. Publishers also have the option of using digital object identifiers for references, tables, and figures for ease of access from anywhere in the HTML document.

All this richly-embedded data in HTML allows for a more interesting and cohesive end-user experience, where everything they need can be found right in the document. There’s less risk of the reader going down a data rabbit hole in search of relevant information.

The dynamic HTML format also allows for easily implemented real-time updates so end-users will always have the most up-to-date files. This ensures 100% veracity correlation across all platforms so publishers can create a document of record, with no discrepancies or outdated information.

By harnessing all this usable information first, enriched source documents can be transformed into accessible website content, ePubs, PDFs, HTML documents, or print products.

PDFs, by contrast, are static container documents with fixed layout and content embedding. A downloaded PDF document may be out of synch with those maintained dynamically online. CrossRef.org is working to address this with its CrossMark initiative, which readers can use to determine if the PDF they hold is truly current. But that’s assuming the PDF’s publisher is a CrossMark participant.

Regardless, CrossMark is an attempt to remedy a basic flaw of PDF documents. HTML documents produced with an XML workflow do not contain the flaw.

HTML documents can also integrate ORCID IDs. These are author-specific digital identifiers that ensure unambiguous ownership of work. The combination of HTML and digital author IDs can guarantee against plagiarism and lost credit.

Because HTML documents fully utilize all available metadata, publishers can easily create WCAG-compliant digital documents at less expense, and in a shorter amount of time, than they can with PDFs.

Transitioning to HTML documents

Using an XML-first workflow is the easiest, most cost-efficient way to create HTML documents. The XML-first workflow defines all the metadata needed for HTML documents from the author-supplied source documents.

By harnessing all this usable information first, enriched source documents can be transformed into accessible website content, ePubs, PDFs, HTML documents, or print products. The XML-first workflow gives you the widest range of options right from the start. (For more, see Maximize publishing workflows without shocking your culture.)

XML files with HTML documents of record are a path to the future. But making the move from PDFs to HTML documents doesn’t have to be confusing or frustrating.

Apex has the tools and expertise to help you implement workflow changes that keep you current and profitable in these rapidly changing times. Contact us for more information.

About Greg Suprock

Greg is Head of Solutions Architecture at Apex. He has over 20 years of experience in XML workflows, content management, web application development, and prepress. Greg excels at collaborative efforts to achieve project and business goals. He has developed XML workflows for the Public Library of Science, HighWire Press, The Library of Congress, and many more.