August 9, 2017

Managing metadata: The devil’s in the details

hand writing metadata on a chalkboard.The digital age has allowed publishers to make virtually any content available to audiences everywhere. But that’s both a blessing and a curse; the sheer volume of content available creates a monumental challenge for end-users in finding the particular content they need. Making content discoverable and accessible is where the vital role of metadata comes into play.

Often defined as “data that describes data,” one might imagine metadata as the roadmap and signage around content. At its most basic level, metadata is simply information that describes a book, article, or other piece of content, including title, author(s), ISBN number (in the case of books), date of publication, high-level content description, etc.

Any directory of content, from Google to PubMed to the Library of Congress, uses metadata to categorize content for users. Modern metadata, however, does much more than that. Publishers that establish smart metadata processes can use it to:

  • Make their content ever more discoverable to the right audiences.
  • Describe constituent parts of content, such as chapters, sections, or images.
  • Advance the interests of science, as well as the careers of researchers, by providing detailed information about authors.
  • Make content more accessible to the disabled (especially the visually impaired).

For these reasons, establishing a detail-oriented metadata process in line with modern best practices is a crucial task for publishers and libraries alike.

Defining the process

Both the beauty and the burden of metadata is the fact that there is no single way to go about it. You make your own recipe based on your content and your audience. Or, you can follow a standard recipe such as is done by most journal publishers. The process typically begins by identifying (and providing) the core metadata elements required by the content depositories where the publishers seek to place their content.

After that, the metadata that they need to collect depends on how they’re serving up their own content to their end users. An article that would be discoverable via PubMed, for example, might include additional terms or descriptors beyond the high-level meta description. These terms or descriptors can be associated and linked to other rich sources of information. For example, compounds may be linked to PubChem, or proteins to the Protein Database, or genes to Genbank.  A publisher might also reference supplementary materials relevant to the article, so users could access those as well.

The key for publishers is to understand how and where they want to present their content to the public, and then devise a metadata rule-set, baked into their production process, from the early stage, that accounts for those objectives.

A behind-the-scenes glimpse at the researcher

While metadata has traditionally served the purpose of providing detailed information about a piece of content, an additional application is its ability to provide rich information about authors and other content contributors.

Take, for example, a set of research papers coming from a large academic medical center on, say, melanoma. John Doe may be listed first in one such paper, but Jane Smith, listed fourth, may be responsible for 50 percent of the articles that the academic center is producing on melanoma. With the right metadata process in place, Jane Doe can be identified as a more prominent expert in her field and help ensure her other work is a relevant touchstone for users.

That kind of detail is invaluable to researchers. And it’s a big step in the advancement of science.

Academic journals are also implementing metadata to distinguish authors from their peers using ORCID, which supports links to other content specific to that researcher, whether it’s an article of theirs that appeared in a different journal, their bio on a university site, or even their own LinkedIn page.

Or suppose an end-user is searching for the latest papers on that same cancer – but he wants only those papers that have been downloaded more than 1,000 times. Here, again, metadata can allow for this type of filtering.

How modern libraries are using metadata

For libraries, there are other types of metadata constructs that allow, through the MARC record, for the inclusion of rich detail about the content in their stacks. The questions a library might answer with metadata include:

  • Where can users physically locate this book?
  • What particles of information live in which chapters?
  • Which languages are represented, and/or do translations exist?
  • If there are images, how might they be captioned? How might their captions be described?

Again, we see how metadata isn’t just adding data on top of data; it’s exploiting data within data.

Enabling visually-impaired access

Metadata’s ability to render content accessible to disabled users, primarily users who are visually impaired, is also crucial. Publishers must not ignore this audience, not only because doing so is unethical, but also because governments are recognizing that content needs to be readily available for all users.

For example, metadata can provide a translation of an image for a user who cannot see it. A caption to an image is a form of metadata (however rudimentary), often supplemented by alt-text. This is a more detailed description of an image: it tells not only what the image depicts (as a caption might) but also indicates whether it is a photograph, a painting, a flowchart, or a diagram.

Some publishers are going even further by developing long descriptions of images that provide details in text form, suitable for screen readers, that create a mental image of complex figures or charts. A long description might inform the end-user of where a flowchart starts, contents of the steps in the flowchart in order of appearance and branching. Alt-text and long descriptions can be used to drive discovery.

Most publishers are today constructing core metadata adequately — that is, they’re adhering to the core requirements of search engines and content repositories. Yet it’s never a closed process, and meeting the core requirements should be the beginning of the process, not the end.

That’s why publishers and libraries need to pause to carefully consider who their audience is, and how that audience will be most effectively served in trying to find the content that the publisher or the library has. Only then can they build the rule-set that will govern an audience-first, detail-driven metadata process to help them get their content found and downloaded, drive sales, and meet their missions.

About Greg Suprock

Greg is Head of Solutions Architecture at Apex. He has over 20 years of experience in XML workflows, content management, web application development, and prepress. Greg excels at collaborative efforts to achieve project and business goals. He has developed XML workflows for the Public Library of Science, HighWire Press, The Library of Congress, and many more.