Saving for the Future: Digital Preservation, Libraries and Publishers

Technology is an amazing thing. That was the basic, rather mundane thought I had on the last day of the Pratt E-Publishing course in London. For two weeks, our class learned about the remarkable ways libraries, archives and publishers are putting technology to work. The British Library crowdsources sounds from users around the UK for their Sound Maps. ProQuest, Bodleian Libraries and the Royal Archives digitized all of Queen Victoria’s journals–even her doodles–and made it free to users in the UK. And just about every publisher we met seems to be putting some really good resources behind Open Access journals. It’s incredibly exciting to be entering the library science world at this moment as we test the limits of what technology can give us.

But as I looked at each new online project, I wondered, “Will I still be able to look at this information in five or ten years?” The world continues to fill up with information, and more and more of that information is born digital. As much as we’d like to believe that the internet is forever, we cannot guarantee it. Tech firms go out of business, Google shuts down services with little warning, a website update erases the all the previous versions. While many libraries have digitization projects, not every institution with digital assets has a plan in place to archive them. Without digital preservation procedures, libraries gamble the long-term viability of the digital projects they worked so hard to put in place.

High Risk of Loss

After the internet and digital technology rose to prominence, the term “digital dark age” was coined. Librarians, archivists, and scientists realized that digital information may be irretrievably lost if it is housed in obsolete file formats, leading to a dark age of modern history where the historical digital record is missing. As technology marches forward, older software, hardware, and file formats become outmoded and inoperable. Computers that are made today no longer have floppy disc drives. What happens to the boxes of floppy discs that form part of an institution’s archive?

With the passage of time, there is always the risk of information being lost. Until recently, this had only been an issue for archivists and librarians to manage in print media. An archive may have books written in forgotten languages, while a book printed on acidic paper will deteriorate over time. Archivists are familiar with these problems and have developed protocols on how to deal with old languages or fragile materials. But with digital information, the timeline for obsolescence is accelerated and many of the procedures are still in flux. An old manuscript will still exist in a readable form for hundreds of years if it’s placed in a temperature-controlled room. The same cannot be said of a hard drive. It may become unreadable in under a decade, perhaps because the physical components no longer work or because the current technology is unable to parse the older file formats. Because technology changes so quickly, archivists may be unsure what the best digital preservation protocols are. A few years ago, burning information onto DVDs or CDs was considered good practice, but today we know that the lifetime of writable discs is comparatively short and this practice has fallen out of favor.

Another challenge is the huge volume of digital information that exists, including personal data, institutional data, or data on the web. Technology has made it faster, cheaper and easier to create content: emails, videos, datasets, text documents, websites, you name it. Managing all that information is very difficult and expensive. Archives need to decide what information to preserve, since it is impossible to capture it all. How do they determine that selection criteria? When it comes to archiving the web, certain types of websites or content are impossible to archive. Flash, Javascript, streaming content and the deep web are unable to be captured by web crawlers. And once a library has an archiving mechanism in place, they have to find a way to make that archived information usable and searchable. The Library of Congress has billions of tweets archived, but no efficient way to make it accessible to outside researchers.

Underrepresented Groups

Members of underrepresented or marginalized groups are using the Internet to come together through personal blogs, message boards and social media. These groups may already be ignored or outside the scope of most traditional archives, which puts their digital objects at an even greater risk of loss. Lyndon Ormond-Parker and Robyn Sloggett write about the importance of archives in constructing the identity of a community, observing that members of that community make the best stewards.[1] More and more interactions and content are created solely online, and large portions of these communities may be entirely lost without the right training and resources for digital preservation. The Lesbian Herstory Archive, for example, does not have a digital acquisitions program and does not archive any lesbian websites or prominent social media accounts. In a few years, they will be missing a large chunk of lesbian history from their archives.

Individuals may not consider archiving their personal sites to be an important or even necessary problem. The internet is fragile and prone to decay. There is no guarantee that a website we use everyday for a month, three months, even years might last in its entirety. Likewise, our own computers have a relatively short lifespan compared to physical media. Digital creation without digital preservation is like putting all your books and papers outside and hoping it doesn’t rain in the next ten years.

Preservation Strategies

LOIS: Take my computer?! Everything I’ve ever done or thought about doing is on that computer! All my notes, my contacts, …my novel.

CLARK: Don’t you backup onto floppy disks?

LOIS: Clark, this is no time to discuss your compulsive behavior.

– “Strange Visitor,” Lois & Clark, the New Adventures of Superman

We do so much online and on our computers, but few of us have given really critical thought to archiving our digital lives until the worst happens: a crashed hard drive, a failed backup, or federal agents confiscating our computers in the hunt to find Superman. Libraries and archives face the same problems when it comes to their digital assets, especially regarding the numerous digital objects that are created during digitization projects.

At the Strand Symposium on E-Publishing, Paola Marchionni of JISC discussed the sustainability of funded digitization projects. What happens to digitized items once the funding ends? What if a relationship with a vendor goes sour? Only a third of their survey respondents had an ongoing budget for digitization, and nearly half used outside vendors for their projects. In one case, the archival group and the web developer had a falling out, and the developer refused to turn over the website files. The archive had to scrape the website off the open web to get as good a copy as it could. After a few years, the website was outdated and difficult to find, rendering the entire digitization effort practically useless. To prevent this, both funders and recipients must commit to maintaining technological infrastructure and ensuring a project’s discoverability long after the grant ends. Libraries and archives should research additional revenue streams and incorporate this into their long-term plan. Furthermore, the project should not remain stagnant–it is not enough to just put up a website and leave it at that. Libraries and archives ought to continue to evaluate their user community. Web analytics and social media are valuable tools to track and enhance discoverability. An online digital project cannot exist in a vacuum. For success, we must understand how it fits into the larger internet community from the very beginning of the project. Maybe that means linking up with Wikipedia, or finding the librarian community on Twitter. Ensuring ongoing relevance and access is a key part of digital preservation.

Some libraries and archives will decide to go beyond archiving their own digital assets and archive the open web or accept digital donations. During our E-Publishing class site visit, the British Library presented two digital preservation initiatives: Personal Archives and Web Archiving. Personal archiving involves the acquisition of a donor’s digital items, like hard drives, and uses bit-by-bit capture to preserve the data. They also use emulation tools to recreate the original computer environment for user access. The UK Web Archive is a collaborative effort between the British Library, JISC, The Wellcome Library, and the National Library of Wales. They automatically crawl the UK web domain and archive sites of national digital heritage, usually grouped around themes such as the London Olympics. This is similar to the web archiving program at the Library of Congress, which also constructs its web archive around national themes. The internet holds a huge amount of significant and fragile information, and many librarians feel it is within the extent of their preservation mandate. Although web archiving seems like a high-tech skill area, libraries and archives should not be scared off. The International Internet Preservation Consortium has a wealth of resources on web archiving including discussions of legal issues, case studies and other scholarly articles. Many of the tools used by Library of Congress and the British Library are open source software programs linked on this site.

However, digital preservation and web archiving are no small tasks, and lay outside the archival scope for many institutions. James Careless interviewed three tech librarians at different-sized libraries on their preservation programs.[2] One, the Library of Congress, has a comprehensive digital preservation initiative in place and holds terabytes of archived data. Another, the Las Vegas-Clark County Library District, as a small collection of videos and podcasts. The larger institutions are using cloud-based storage, experimenting with internet web services like Flickr, and launching digital repositories. However, the county library is still using CDs for some digital storage and has no plan to archive the web. What about smaller institutions with fewer in-house resources?

One solution is the consortium depository model. The consortium has proven to be a successful way for cultural institutions to support each other and develop digital preservation skills in-house. The MetaArchive Cooperative and CLOCKSS are two effective cooperative schemes for digital preservation, but numerous others exist. The cooperative approach distributes the cost burdens, creates redundant backups stored across institutions, and trains staff members across the consortium on digital skills. Different institutions contribute at an appropriate cost level, but each member benefits from the shared LOCKSS network and technological training. Working together, multiple institutions can create a robust technological infrastructure and manage their own digital collections without worrying about the reliability of a third-party provider.

Another guest speaker in the E-Publishing program addressed the preservation problem faced by the scientific community. Dr. Graham Parton is a data scientist at the NCAS British Atmospheric Data Centre. He talked about the challenges of preserving digital scientific data. Rather than relying on the cloud or publishers, he recommended the repository model for preserving data. The cloud has issues with persistence and reliability, and is more difficult to discover. A publisher may archive a scientific journal or an individual article, but it may not choose to archive all the supplementary datasets. A digital repository, however, has a mandate to make data available, to communicate with the user community and to future-proof data through digital preservation. One tool that the Data Centre uses is data profiling. Archivists ask researchers to detail where their information is, how it is stored and accessed, and if there are any restrictions on use. This helps the Data Centre know what data is out there and how to properly archive it. It is also useful for the researcher because it gets him or her thinking about the longevity of his or her data, and may then take steps to ensure it is archived. Data profiling is a valuable method for any digital repository, but it is particularly useful when it can receive many different types of digital information.

Digital archives also have an important role in big data projects. Preserving data sets over many years gives researchers a chance to look at big trends without going through the expense of recreating historical experiments or timeframes. The Oxford Internet Institute is currently working with the UK Web Domain Dataset (1996-2010) to uncover social, political, and government trends over a 14-year period. This research is only possible with ongoing web archiving projects at cultural and scientific institutions.

For entities without preservation mandates, such as publishers, it is unclear if they have a long term preservation program in place. If an online product has enhanced research features like note-taking capabilities, does the publisher have a way to archive user accounts? Will those research notes only last as long as the publisher thinks the product is viable, or is it an even shorter time period? As publishers create and sponsor more digital products for libraries and archives, libraries risk losing valuable digital items if the publisher removes its support.

Publishers and Libraries

A delicate relationship exists between publishers and cultural institutions–just ask any librarian in charge of subscription journals. Both publishers and librarians are still navigating the transition from a print to a print-and-digital world. Take e-books, for example. Some publishers have tried to impose limits on their e-book technology in an effort to make it fit a similar business model to print books (like capping the number of users per e-book at any given time), which continues to be a point of disagreement between libraries and publishers. In this kind of contentious environment, cultural institutions may wonder if for-profit companies are really on their side. But the kinds of resources publishers can put into digital cultural projects also make them good partners.

At the Strand Symposium, Chris Cotton described ProQuest’s public-private partnerships with national UK archives. One recent project was a partnership with the Royal Archives and Bodleian Libraries to digitize Queen Victoria’s journals. The Royal Archives and Bodleian had the archival documents, but by working with ProQuest they were able to create a richer, more expansive digitization project than they would have on their own. As Cotton explained, there are many benefits for a library to partner with a commercial organization. It absorbs the initial upfront costs by drawing on a commercial firm’s existing digitization infrastructure, without the library having to make an additional investment. Many libraries do not have the equipment or staff to conduct digitization in-house. An archive may be able to take risks or taken on bigger projects if they partner with a publisher. A commercial firm will also have expertise in designing user interfaces, negotiating licensing deals, and conducting market research. The end product of ProQuest’s public-private partnership is extremely high-quality and completely free to any user in the UK. Users outside the UK are charge a subscription, which allows the content to support itself and ongoing maintenance. ProQuest also used their specialized indexing models to augment content and take full advantage of the digital technology tools available. However, not everyone thinks that using a commercial provider is wise, especially if that firm is not a partner and only a service provider.

Tyler O. Walters and Katherine Skinner caution against using vendors for digital preservation: outsourcing the work means outsourcing the development of skills.[3] Relying too heavily on third-party vendors also undermines the fundamental mission of cultural institutions, which is to preserve and make culture accessible. It takes power and responsibility away from the librarians and archivists and places it in the hands of for-profit companies. They advocate for the cooperative model, like the MetaArchive, to offset the costs of digital preservation while keeping it in the hands of libraries. I agree with many of Walters and Skinners arguments–especially the importance of institutional support for developing of technology skills among librarians and archivists–but it seems like there may be some middle ground. The public-private partnerships described by ProQuest could be a viable method for large-scale projects, especially those of national importance. If the publisher takes on the initial costs and sets up an ongoing revenue stream, like subscription fees or ad space, then the library or archive will be able to take on the ongoing preservation authority. In addition, libraries should request permanent archival access to publisher’s products. If they choose to end a subscription or if a publisher discontinues a product, access must still be available to an archived version.

Social Media and Internet Services

Publishers aren’t the only for-profit companies that libraries use to create and house digital assets. Many use social media sites to create and promote content, but cannot wholly rely on the service provider to properly archive that content. Some sites, like Facebook, have built-in archive options (although based on personal anecdotes, their archive feature does not capture everything). Others, like Tumblr, have no native archive features. For libraries, the safest course of action would be to use the service provider’s archival methods as a last resort, and independently archive any content that is pushed out into social media channels.

Conclusion

Digital preservation is active preservation. Preservation is access. As more and more digital content is created, we must consider what to preserve and how. Even the smallest archive should take steps to preserve its digital assets, perhaps by joining a consortium. For libraries and archives using online products from publishers or creating content on social media sites, they should ask who is taking on the burden of preserving that content. We cannot guarantee that a for-profit company will hold itself to the same long-term standards as a digital repository, so libraries and archives must do what they do best: preserve for the future. The impact of digital technology is enormous, and we are still figuring out just how far that impact reaches. If we hope to understand our history and prevent a digital dark age, the time for preservation is now.

 

[1] Ormond-Parker, L., & Sloggett, R. (2012). Local archives and community collecting in the digital age. Archival Science, 12(2), 191-212.

[2] Careless, J. (2013). Archiving web content. Online Searcher, 37(2), 44-46.

[3] Walters, T. O., & Skinner, K. (2010). Economics, sustainability, and the cooperative model in digital preservation. Library Hi Tech,28(2), 259-272.