Digitizing bi-directional Hebrew and Latin text with JSTOR

CASE STUDY

How JSTOR Successfully Digitized Thousands of Pages of Bi-Directional Hebrew and Latin Text

Challenge

JSTOR maintains a digital library of hundreds of thousands of archival and current scholarly publications for individuals and institutions to access online. When JSTOR looked to add over half a million pages of Hebrew journal content dating back to 1922, it presented several major technical challenges:

Hebrew reads right-to-left and includes a completely different set of characters that English-based OCR software is not designed to process
Hebrew characters with diacritics (dots and dashes in, above, and under characters) oftentimes come out jumbled, incorrect, and out of order post-scanning
Incorporating metadata in both Hebrew and English required integration of two different bi-directional alphabets

JSTOR turned to Apex, a partner of over 20 years with the proven ability to adapt to changing demands and requirements.

Solution

Apex responded by developing a meticulous 40-step workflow that resulted in a highly-accurate digital journal complete with metadata meeting JSTOR’s detailed requirements. Apex integrated our IZAAC conversion software with a third-party software solution to recognize right-to-left text and diacritic markings and accurately convert them. Hebrew language experts based in Israel were recruited to distinguish subtle variations in text unable to be resolved by software alone, further increasing accuracy. IZAAC also enabled the creation and zoning of metadata and non-metadata elements in both Hebrew and English. Work was seamlessly executed across an international team based in the United States, India, and Israel.

Results

Once the software was ready, JSTOR was able to complete its Hebrew journal project and make its collection of historic and current journals available online for scholarly pursuit. Partnering with Apex led to remarkable results:

500,000 journal pages were successfully converted
All JSTOR quality standards were met or exceeded
XML metadata was created for each article, issue, and journal
Reference XML metadata, OCR files, illustration files, and page image files were provided for each journal
Apex continues to work with JSTOR, delivering over 1.5 million pages of content per year in XML, OCR text, and image formats.

Apex continues to work with JSTOR, delivering over 1.5 million pages of content per year in XML, OCR text, and image formats.

Learn more about digitizing your collection with best-in-class metadata. Contact Apex.

Download the Case Study

Back to Resources