September 26, 2017

Newspaper digitization: Why metadata matters

Let us suppose for a moment that you are a scholar of American history, and you are conducting research on the history of slavery and runaway slaves in the 1700’s and 1800’s. For you to conduct your research, you need access to historical resources. And one of the best research resources from that time period is newspapers. In the 1700’s and 1800’s it was common for slave-owners to purchase runaway slave adverts/notices in local newspapers, offering a reward.

The way you access newspapers has transformed itself recently. Not long ago, you would have had to travel to your state or local library and look through their microfilm collections (page-by-page) using a microfilm viewer. Tediously spending hours manually searching for relevant information. Needle in a haystack as they say.

Today, the way to access newspapers has changed … to an extent. Thanks to the National Digital Newspaper Program (NDNP), a Library of Congress-administered project that works to digitize U.S. historical newspapers and make them available online, it’s quite possible to search for and see some of those ads as they originally appeared from anywhere you have an internet connection (assuming you get a bit lucky with your word searches).

That’s the good news. But there’s a catch. While you’ll save the trip to the library and find the search process easier, you’ll still have to spend a significant amount of time hunting for the specific targets of your query – and you may or may not be able to find every single relevant piece of content that lies within the trove of 8+ million digital NDNP newspaper pages. This is because even though there are millions of pages of historic newspapers available online, the way the content of the newspapers is meta-tagged – that is, the way that information is structured through metadata – still makes it difficult to search, and lacks full utility.

Why metadata matters

Within the NDNP program, metadata is captured at the newspaper issue, page, and column levels. Whatever information is contained within specific articles, ads, obituaries, birth announcements, and so on, is not explicitly defined by the metadata. So while such information is viewable, it isn’t fully searchable due to the structure of the digital textual content – the metadata. Specifically, the full range of search is limited because there is no explicit article/advertisement metadata.

Apex has digitized millions of pages of historical newspapers as part of the NDNP program, state-level programs, and similar newspaper projects in other countries. And as someone deeply involved in that process, I can tell you that there is more that can – and should – be done to create excellent digital newspapers.

Let’s take a step back. METS/ALTO, the de facto XML standard for newspaper digitization, was adopted in the early 2000s by the Library of Congress as the encoding standard for digitizing newspapers within the NDNP program. METS/ALTO has since been adopted by hundreds of libraries around the world, and today is considered the gold standard for digitizing historic newspaper collections.

But while the standard is famous among librarians and archivists in the digitization world, what is often overlooked is that METS/ALTO can be taken a step further to provide wider utility and engage new audiences and use cases.

For example, the National Library of Australia (NLA) engaged Apex to digitize hundreds of years of Australia’s historic newspapers. We did so using the METS/ALTO standard. But the NLA team wanted us to do something different – they wanted us to go further. They wanted Apex to capture and encode metadata, granularly, at the article level.

This approach requires marginal extra effort and cost. But it is entirely possible within the METS/ALTO standard – and the results are noteworthy. NLA’s digital newspapers enable users to effectively search and access the depths of Australia’s historic newspapers by article type, keyword, author, and other search factors, and they can better correlate what they find with other information they discover. This in effect creates a richer online experience and taps into wider use of the content.

It is worth noting that there’s another large obstacle in the way to larger adoption of article-level meta-tagging. Many library newspaper programs, including the Library of Congress, use specific technology platforms to make their newspaper content available to their constituents. Many of these platforms lack support for article-level metadata structures features. However, there are flexible newspaper platforms available in the market that support article-level segmentation – and they are worth exploring.

Libraries, consider your options

To summarize: there’s more than one way to digitize a newspaper within the METS/ALTO standard. For libraries with limited budgets (and what public library’s budget is not limited?), it’s important to consider carefully how constituents will use the information that’s being digitized. That means libraries have to know their users.

Those users might include scholars conducting academic research, amateur genealogists researching information about their family trees, ordinary history buffs in search of primary-source information about a person or event, and other examples. All users benefit from rich metadata, of course, but differences in user mixes often lead libraries to make different choices with respect to how they spend their digitization budget dollars.

Recently I’ve noticed a surge in the number of institutions that have taken an interest in article-level meta-tagging within their newspaper digitization programs. That’s great news for history fans and researchers alike, but it leads to plenty of questions from libraries about how to strike the right balance.

At Apex, we excel at helping libraries think through these complex options, and we’re happy to help. Just reach out.

Related Posts

Explore All