Long-term Data Management: Automating Digital Preservation

Technology | Dr Matthew Addis| February 18, 2020

default image

Dr. Matthew Addis, CTO of Arkivum, explains the many benefits of automating tasks within the digital preservation and data management workflow.

Long-term data management includes stewardship of digital content over decade timescales or longer. Digital curation and digital preservation are key activities and help ensure that content is properly protected and safeguarded, it remains findable and discoverable, and it remains accessible and usable for whoever needs to access it in the future.

Doing this effectively and in a way that can cope with the ever-growing amounts of digital content that organizations need to retain for the long-term requires three things:

  1. Sufficient and sustainable budgets, i.e. ongoing funding;
  2. Skilled and experienced staff, (e.g. people who can do the work);
  3. Software systems and tools that can support the activities involved.

This is sometimes called the three-legged stool of digital preservation. If any of the legs aren’t solid and stable, then the stool becomes wonky and liable to fall over! 

Automation is about using computers or machines to do stuff that people would otherwise do manually.

In long-term data management, this means using software to automate digital preservation activities and workflows. Automation is about doing more, but without increasing costs. It's about efficiency and productivity. This means being able to deal with some of the common problems like not having enough skilled staff or funding to deal with all the data you need to preserve. Digital preservation involves a lot of activities, and many of those are labor intensive. Within many organizations, there's a small number of people expected to do all things. Automation can help lower the burden on these small teams of people and help them focus on a set of core activities rather than trying to do absolutely everything all at once and all the time. Automation can also reduce costs, especially those associated with staff, but that doesn't mean making staff insignificant or redundant. Quite the opposite: automation allows staff to focus more of their time on the interesting and challenging areas of data management, for example selection or appraisal, curation and helping people find and use content, or dealing with complex content types such as preserving social media, email, geospatial data, websites or software. Through automation, there's also the potential to reduce errors, and that's important in digital preservation because it reduces the risks of content being lost or not accessible.

While automation can sound like a panacea, the challenge is how to put it into practice.

I’d suggest that there are several facets to effective automation:

  • What are the minimal set of activities needed across the board for your digital content, for example safeguarding your files and identifying the file formats that you have? Try automating these activities first. This is sometimes called Minimal Effort Ingest or Parsimonious Preservation. This is about using automation to get a basic level of protection and understanding of your content in place as quickly and easily as possible.
     
  • Are there a relatively small number of well-known formats or types that a lot of your content happens to be in, for example do you have mostly images, documents, videos or email?  Targeting these formats with automation could give you the most ‘bang for your buck’. The more common formats are often the ones that already have good tool support and can be automatically processed more easily, for example metadata extraction or file format conversion.
     
  • What is the bare minimum that users of your content actually need in order to find, access, use re-use the content you have?  If the focus is on fulfilling this first, then maybe you can make more content accessible sooner, which in turn means more uptake and use, and a stronger business case for continued funding. Sometimes, the desire we often have to do the best possible job can result in the ‘great becoming the enemy of the good’ with users suffering because the content they want and need is unavailable or has even become lost.

Automating the lower levels of maturity first makes sense and helps staff focus their time on the harder parts where manual activities are often still necessary.

Maturity models, for example the DPC RAM, help you understand what activities are needed to ensure the basics are in place and then what to target next.

Automating the lower levels of maturity first makes sense and helps staff focus their time on the harder parts where manual activities are often still necessary.

For example, when automating bit preservation, start off simple by automatically making more than one copy of your data and moving them to different geographic locations. Only then move to high levels of automation such as periodic fixity monitoring and repair.

A lot of time can be spent up-front on quality assessment and quality control. Those are the labor intensive and costly functions. For example, checking that all the metadata is present and correct in advance of preserving or storing the files that this metadata described. Do the basics upfront, like getting content into safe storage and storing whatever metadata you have, or can easily be extracted, and without necessarily doing extensive quality assurance and correction at that time.

Sometime even the sort of basic steps of minimum effort ingest can vary between content types. Thankfully, the bulk of the content that needs minimal effort ingest will often be in a small number of file types. It makes sense to target common file types first as part of automation and then deal with the more obscure or complex file types later. That's because, for the more common file types, there's likely to be better tools for support available. There will be more knowledge in the community on how to deal with them, and more experience around how to automate the processing of those file types and dealing with problems.

If your organization does have a wide range of media and file types that you have to deal with or take in then when it comes to preservation, it may be beneficial to separate them out into streams based on their content type. You implement minimum effort ingest or minimal viable preservation for each stream using just the steps and tools that are appropriate for that content type. For example, extracting metadata or maybe converting file formats.

For automation, certainly start with the basics of bitstream preservation and then deal with metadata and characterizing the content you've got, but the question then becomes, well, can you go further with automation?

Can you do things like convert between file formats in an automated and trustworthy way? Creating long-term preservation versions or short-term access versions is called file format migration or file format normalization. The challenges aren’t necessarily in automating conversion itself, but it's knowing that the conversions have worked properly.

Normalization or file format conversion should be done with care, and a lot of testing and validation may be needed afterwards before it can be trusted. The temptation can be to use technical validation tools, e.g. conformance checks of a file against a file format specification such as PDF/A, but that can result in a lot of time and effort going into understanding whether non-compliance issues are really anything major to worry about.

Better is to test whether the outcome of a file format conversion will “open,” “play” or “render” in common software applications. If it does then chances are you are fine, even if it doesn’t comply to the file format spec down to the last detail. Spot checks using some quick “eyeballing” then allows automation of file format conversion to be applied with at least some confidence that the results will be “good enough” in the “majority” of cases. This is very much an “engineering” approach to long-term preservation and data management and problems will sometimes slip through the net, but the end result can be that your digital content, as a whole, will be in a better position sooner and at less cost, than if you try to validate and check absolutely everything by hand.

But even before you do that, think critically about whether format conversion is even necessary in the first place. Technical obsolescence of common file formats (images, video, audio, docs etc.) isn’t always as bad as we fear it might be!

Effective automation is fully supported by Perpetua.

As part of that long-term data lifecycle management, Perpetua supports the whole end-to-end process of taking content from a wide range of content sources, like collection management systems, asset management systems, storage service, and putting through a process of checks and validation, metadata extraction, managing retention, core digital preservation workflows, indexing it, making it searchable, viewable, and be able to export it and publish it to a community of people who want to consume that content in different contexts.

All this can be automated, and this is exactly what Perpetua does. Perpetua automatically geo-replicates digital content into different locations, and that can be private cloud or public cloud, for example, AWS or Azure. The software makes sure there are multiple copies in different locations, with regular fixity checks.

Perpetua automates file format identification and characterization. There's then the option to create additional preservation and access versions, and those versions are stored alongside the original files in packages, archive information packages for long-term archiving and dissemination information packages for access and delivery. Perpetua automates that whole process, including file format conversion for 300 different formats.

Perpetua’s modular solutions perform different functions as part of the overall workflow that can be embedded in any environment. Perpetua will automatically extract metadata, both technical and descriptive. The result of that is then surfaced in a way that it can be easily navigated and searched and viewed with the ability to integrate and allow content to be discovered through external systems like EBSCO Discovery Service™. Automating data management and preservation with Perpetua allows you to lower costs, get more done, and focus staff on more important tasks.

See how Perpetua can help users preserve research data, heritage collections and administration records.

Learn More

image description
Dr Matthew Addis
CTO and Founder, Arkivum

Matthew is responsible for technical strategy. He previously worked at the University of Southampton IT Innovation Centre. Over the last fifteen years, Matthew has worked with a wide range of organizations in the UK, Europe and US on solving the challenges of long-term data retention and access. His expertise includes digital preservation strategies, system architectures, total cost of ownership, how to mitigate the risk of loss of critical data assets and building business cases for both compliance and asset value scenarios.

Thanks for your comment!

Your comment will be reviewed by a moderator for approval.


Other EBSCO Sites +