January 6, 2020

Building a digital archive for decaying paper documents

Paper documents are still priceless records of the past, even in a digital world. Primary sources stored in local archives throughout Latin America, for example, describe a centuries-old multiethnic society grappling with questions of race, class and religion.

However, paper archives are vulnerable to flooding, humidity, insects, and rodents, among other threats. Political instability can cut off money used to maintain archives and institutional neglect can transform precious records into moldy rubbish.

Working closely with colleagues from around the world, I build digital archives and specialized tools that help us learn from those records, which trace the lives of free and enslaved people of African descent in the Americas from the 1500s to the 1800s. Our effort, the Slave Societies Digital Archive, is one of many humanities projects that have accumulated substantial collections of digital images of paper documents.

The goal is to ensure this information—including some from documents that no longer exist physically—is accessible to future generations.

But preserving history by taking high-resolution photographs of centuries-old documents is only the beginning. Technological advances help scholars and archivists like me do a better job of preserving these records and learning from them, but don't always make it easy.

Collecting documents

Since 2003, the Slave Societies Digital Archive has collected more than 700,000 digitized images of historical records documenting the lives of millions of Africans and people of African descent in North and South America.

Members of the core team, from universities in the U.S., Canada, and Brazil, travel to project sites throughout Latin America, where they train local students and archivists to digitize ecclesiastical and government records from their communities. We give these communities the cameras, computers and other hardware they need to digitally preserve documents piled in the corners of 18th-century church basements, or about to be discarded by space-crunched municipal archives.

We also teach them a crucial skill for archiving and retrieval: how to create metadata, the descriptive information to help people find what interests them—like whether a document is a marriage certificate or a baptism record, and what year and town it's from. Good metadata allows visitors to the project website to, for example, search for all baptism records from 17th-century Colombia.

From digitization to preservation

Over time, we've gotten much better at digitizing documents. In older images, it's not uncommon to see the photographer's finger straying in from the side of the frame. Some of those older images are stored as relatively low-resolution JPEG files, a format that compresses the image file size by deleting some data when it's saved. Most of those files are still completely legible even when a viewer zooms in, but some are not and will need to be digitized again in the future.

Our more recent preservation adheres to the rigorous standards of the British Library, which funds much of our work. Those images are taken in very high resolutions and stored in multiple file formats including TIFF, which remains the archival standard.

Transforming a collection of digitized images into a true digital archive is a time-consuming and detail-oriented effort. Early in this process, we ran into a curious problem involving photographs taken during our first few digitization efforts. Modern software frequently misinterpreted the orientation of these images, giving us pages rotated 90 degrees to the right or left or even completely upside down. In cases where an entire volume was rotated in the same incorrect way, it could be fixed automatically, but others with a range of errors had to be corrected by hand to let researchers work more easily with the material.

We've also found that data file names can cause problems. Many cameras assign images default names—like DSCN9126.jpg—that aren't useful for figuring out what the pictures are. We have to rename each image in a standard way that indicates how it fits into our collection.

For the time being we've chosen simply to number images sequentially within each volume; another reasonable option would be to prefix each of these numbers with an ID referring to the volume the image comes from.

These aren't major hurdles, but they and others along similar lines take some time to figure out and address properly. But this effort pays off when people hoping to explore the collection have an easier time finding and using our images.

Where to store them?

Once we've captured the images, we need to save them somewhere.

At present, the Slave Societies Digital Archive collection is close to 20 terabytes—roughly the space needed to store all the text in the Library of Congress.

Few institutions have the resources, personnel or expertise needed to store humanities data at such large scales. Data storage isn't exorbitantly expensive, but it's also not cheap—especially when the data needs to be accessed regularly, as opposed to being stored in a static backup or archival copy.

For many years, the Vanderbilt University Library hosted the data, but we outgrew what that organization could afford. We had been backing up many of our most important records on the Digital Preservation Network, a consortium of universities that pooled resources to fund a reliable digital storage system for scholarly production. But that organization shut down in late 2018 after consulting with each member organization to ensure that no data would be lost.

Our path has led to the cloud, computers in technology companies' massive server-warehouse buildings that we access remotely to store and retrieve information. At the moment, multiple copies of our entire dataset are stored on servers on opposite sides of North America. As a result, we're far less likely to lose our data than at any previous point in the project's history.

Opening access

Storing these records in secure systems is another part of the equation, but we also need to make sure that they're accessible to the people who want to see them.

Our documents, typically written in archaic Spanish or Portuguese, are very hard to read. Even native speakers need special training to decipher what they say.

For several years, we've been producing manual transcriptions of some of our most noteworthy records, such as a volume of baptisms from late 16th-century Havana. But that takes 10 to 15 minutes per page—meaning that transcribing our entire collection would take more than 100,000 hours.

Other projects have used volunteers to do similar work, but that approach is less likely to be the solution for our archive because of the linguistic skills required to read our documents.

We are exploring automating the transcription process using handwriting recognition technology. Those systems need more work, particularly when dealing with centuries-old handwriting styles, but some researchers are already making progress.

We are also looking at ways to identify the people and places mentioned in our records, making them searchable and connecting them to other similar datasets.

As we and other researchers connect our work, the stories contained in these old documents will come to life and bring new insight to modern scholars.

Provided by The Conversation

This article is republished from The Conversation under a Creative Commons license. Read the original article.

Citation: Building a digital archive for decaying paper documents (2020, January 6) retrieved 30 June 2024 from https://techxplore.com/news/2020-01-digital-archive-paper-documents.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Historians' archival research looks quite different in the digital age

3 shares

Feedback to editors

Researchers develop novel 3D printing strategy with controllable gradients porous structures

Jun 28, 2024

Researchers develop the fastest possible flow algorithm

Jun 28, 2024

Real-time modeling of 3D temperature distributions within nuclear microreactors to improve safety systems

Jun 28, 2024

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Jun 27, 2024

Wireless receiver blocks interference for better mobile device performance

Jun 27, 2024

Researchers successfully develop domestic 6G antenna measurement system

Jun 27, 2024

Research shows how common plastics could passively cool and heat buildings with the seasons

Jun 27, 2024

Researchers suggest smart solution to harness waste heat from industry

Jun 27, 2024

Robotic hand with tactile fingertips achieves new dexterity feat

Jun 27, 2024

Help or hindrance? ER robots have potential to aid health care workers

Jun 27, 2024

Load comments (0)

Building a digital archive for decaying paper documents

From digitization to preservation

Researchers develop novel 3D printing strategy with controllable gradients porous structures

Researchers develop the fastest possible flow algorithm

Real-time modeling of 3D temperature distributions within nuclear microreactors to improve safety systems

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Wireless receiver blocks interference for better mobile device performance

Researchers successfully develop domestic 6G antenna measurement system

Research shows how common plastics could passively cool and heat buildings with the seasons

Researchers suggest smart solution to harness waste heat from industry

Robotic hand with tactile fingertips achieves new dexterity feat

Help or hindrance? ER robots have potential to aid health care workers

Historians' archival research looks quite different in the digital age

Ancestry will let you search online for relatives who were displaced by the Holocaust

How one of the world's largest archives is managing the move from parchment to pixels

Vatican's manuscripts digital archive now available online

Thousands of medieval manuscripts now online in full color through digitization project

ARCHANGEL: Securing UK national archives with AI and blockchain

Security experts find millions of users running malware infected extensions from Google Chrome Web Store

New security loophole allows spying on internet users visiting websites and watching videos

AI browser plug-ins to help consumers improve digital privacy literacy, combat manipulative design

Can we rid artificial intelligence of bias?

Orphan articles: The 'dark matter' of Wikipedia

The tentacles of retracted science reach deep into social media: A simple button could change that

Phys.org

Medical Xpress

Science X

Building a digital archive for decaying paper documents

From digitization to preservation

Researchers develop novel 3D printing strategy with controllable gradients porous structures

Researchers develop the fastest possible flow algorithm

Real-time modeling of 3D temperature distributions within nuclear microreactors to improve safety systems

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Wireless receiver blocks interference for better mobile device performance

Researchers successfully develop domestic 6G antenna measurement system

Research shows how common plastics could passively cool and heat buildings with the seasons

Researchers suggest smart solution to harness waste heat from industry

Robotic hand with tactile fingertips achieves new dexterity feat

Help or hindrance? ER robots have potential to aid health care workers

Related Stories

Historians' archival research looks quite different in the digital age

Ancestry will let you search online for relatives who were displaced by the Holocaust

How one of the world's largest archives is managing the move from parchment to pixels

Vatican's manuscripts digital archive now available online

Thousands of medieval manuscripts now online in full color through digitization project

ARCHANGEL: Securing UK national archives with AI and blockchain

Recommended for you

Security experts find millions of users running malware infected extensions from Google Chrome Web Store

New security loophole allows spying on internet users visiting websites and watching videos

AI browser plug-ins to help consumers improve digital privacy literacy, combat manipulative design

Can we rid artificial intelligence of bias?

Orphan articles: The 'dark matter' of Wikipedia

The tentacles of retracted science reach deep into social media: A simple button could change that

Your Privacy