Technical Overview

2015-04-12

Digital Library Cloud Services focus group - Technical briefing note

Introduction

Two decades of digitisation programmes have produced a wealth of material. All this activity has been going on in silos, with little or no common standards for putting these rich resources to use beyond the standalone web sites they are first presented in. The range of delivery technology is huge, from simple static images to deep zoom canvases annotated with overlays. Much ingenuity and development effort is going into reproducing the same patterns in different ways. The British Library alone has 29 different image or image sequence viewers in use on its various web sites. Each digitisation project brings its own viewer, and its own image request and metadata formats.

This problem was recognised and is being addressed by the International Image Interoperability Framework (IIIF), which defines APIs for images and image sequences that clients and servers can comply with. If a digitisation project provides endpoints that serve IIIF, it can use a growing number of server and viewer frameworks (including the Wellcome Player) to share it with the world. Consumers can point their own IIIF-compatible viewing tools at any IIIF endpoint and consume the image resources. For digital library scenarios this typically means image tiles for deep zoom, and larger images at arbitrary crops.

Digital Libraries and other cultural heritage institutions are still faced with the problem of providing these services. Even if IIIF is adopted as the standard for images and their presentation metadata, a small collection still needs its own implementation silo to provide the content. It will need to master and deploy image server technology even it uses an off-the-shelf viewer. There are also other features that might be desirable but would prove expensive to develop and difficult to maintain for organisations without their own development resources, such as searching and annotation support.

In developing the infrastructure to deliver the Wellcome Library’s digitised material we have identified services that could be used by other institutions too. IIIF now gives us a standard for these commodity services to follow. The DLCS is a proposed platform that provides these commodity services for institutions to use as they need, to reduce the development and IT resources for projects involving digitised material. It is envisaged as a cloud platform, for example running on Amazon Web Services. However there may be components of it that some users would wish to run locally. We want to explore possible architecture across a wide range of use cases.

The proposed services

By the far the biggest use of the services, by volume of requests, will be image tiles generated on the fly from high resolution source images for display in “deep zoom” viewing applications. The job of the DLCS platform would be to ensure performance is maintained under extremely variable usage scenarios. Image endpoints are one of the core services:

Image serving using IIIF image API
Image sequence manifests served using IIIF Presentation API
APIs and tools to manage the ingest/upload of master images and the storage and management of IIIF manifests
APIs to allow institution-specific access control, licensing and permitted operations, which the platform can enforce

An institution like the Wellcome Library could make use of these services, incorporating them into their workflow via platform APIs. A smaller institution might then need:

Manifest creation and editing with user-friendly tools (to use alongside or as well as the API). The Bodleian Libraries are currently developing tools along these lines.

So far these services are concerned with hosting IIIF Image API endpoints. The platform can also offer:

Annotation storage and delivery engine with APIs for creating and retrieving annotations. The annotations would be on whole images, or image regions, with various models - sometimes the annotations would be curated, sometimes they would be user generated and moderated, sometimes they could be user generated for that user’s own personal consumption. The Open Annotation framework provides the standard for this.
API retrieval of annotations, as “framed“ JSON-LD for consumption alongside IIIF, or other RDF representations
Possibly a SPARQL5 endpoint for image and annotation data
PDF generation from IIIF manifests (similar to the current PDF download provided by the Wellcome Library, consisting of a cover sheet followed by one image per page)

These services benefit from the platform scale, in two ways. Firstly, they enable features that would simply be too expensive to implement in all but the largest projects. Secondly they (potentially) allow for certain on-demand operations to be spread across a lot more hardware, as long as those operations are the kind that can be divided, distributed across multiple nodes and recombined at the end. Generation of a PDF from multiple source images is probably an example of the latter.

After annotations and PDF representations, there are a set of related services that could be developed that use METS-ALTO files as raw material. These would be of particular interest to institutions digitising printed books and archives, rather than (say) artworks or STEM imagery. Some organisations would have some printed material amongst the artworks.

Service users could supply their own ALTO, or the DLCS could generate it by performing OCR on the images uploaded. This could be a very significant service for a small collection, removing several technical obstacles in one go. Either way, the raw ALTO is ingested into the DLCS platform’s ALTO server, which offers the following services at an API level:

Full text search within, with contextual snippets (an IIIF version of http://bit.ly/1NwZjaf, which is powered by http://bit.ly/1wghJ9u)
Autocomplete service for the words within a work (like http://bit.ly/1E3LECa)
“Search across” where a text search on digitised content can be scoped to particular sets of works, or all of one institution’s works, or all of the DLCS’s known works (this last scope would be opt-in).
Publication of the region information in ALTO files (e.g., block-identified newspaper articles, or other structural elements) as RDF, so that this information can be used in linked data applications
Enhanced annotation based on this ALTO block level information (using an ALTO identified block as the target of an annotation). The DLCS would automatically annotate canvases with the ALTO-identified block information, and this would be available with the rest of the RDF. The automatic annotations can be refined, and more information added. Similarly the selection of words, lines and other regions can be aided by the identification of those elements in ALTO; annotations on fragments or runs of text are then aligned to ALTO-identified elements for consistency
An HTML transformation of an ALTO file - render ALTO as HTML that mimics the original image appearance. This HTML could be used as a layer in a viewing application for selection and annotation purposes (it would appear as if you were selecting the printed text in the image). It could also be used in lots of other creative ways.
Enhanced PDF generation with text that uses a similar invisible layer so that PDFs contain the full text as an overlay, are selectable and searchable

Some of these services need further development in IIIF-compatible viewers to take full advantage.

The ALTO server is also a standalone component that could be implemented locally by institutions that did not want to use the cloud services.

Future Services

Like many institutions the Wellcome Library holds material that cannot be digitised as image sequences – video, audio and born-digital material. The existence of the IIIF standard makes the DLCS platform viable for image resources. At the same time, the JSON-LD wrapper and use of Open Annotation are elements we’d like to adopt consistently for other media. This implies an IIIF superset – “IxIF” – that does not yet exist.

Work towards such a standard has already been mooted. Widespread adoption of IIIF by institutions that have similar needs with other media will help drive this. The platform could then offer:

Audio and Video equivalents of IIIF – IxIF – and hosting/transcoding of Audio and Video
Integrated Platform, sharing common features like annotation, across multiple media types
Preservation services for small collections

This last point is discussed in more detail below.

“Out of the box” reference implementations

The DLCS is not conceived of or offered as a packaged digital library management system, and we do not intend that institutions with significant development budgets and ambitions would use it that way. They would use various services from the list above as they see fit, just as web application developers use various AWS services as core components.

Having said that, we think we should offer a web application (or several) built on top of the DLCS that would allow anyone - including private individuals - to present a collection without having to develop their own website. The Platform can host these applications.

These applications serve several purposes:

An out of the box discovery platform that can present anything from a single image, or a single work, to a large collection. This allows an individual, or a small collection, or a small project in a large institution to just get on and publish material without significant capital investment or IT involvement.
At the individual level it would be like the YouTube of digitisation, encouraging home grown archival projects - say a bundle of WWI letters found in the attic, or something a member of the public thinks worthy of digital preservation, or a school project
On a larger scale it would allow a small institution or project within a large institution to present digitised material around a specific theme, invite annotations on the content (crowd-sourcing) and share the results in a branded web site, without significant development effort. They still need to digitise the material, and upload it onto the platform.

The provided discovery platform will be built from the same APIs that are offered for bespoke development. The discovery application(s) can be studied by the developers in larger institutions as an exemplar implementation and adapted to their own purposes.

These reference delivery applications act as drivers of the DLCS platforms APIs, ensuring they deliver for the expected use cases.

Many institutions (even of the Wellcome Library’s size) have limited access to development resources, and the complexity of servers and frameworks required puts discovery and engagement projects using digitised material way out of reach because of the initial hurdle of building an equivalent Digital Delivery System. The DLCS jumps this hurdle for them.

A major library digitisation programme can choose to use a selection of the DLCS services while building a bespoke discovery layer, and spend the bulk of the development budget on unique features rather than the underlying ‘plumbing’. A small collection without much budget, or even an individual, could use the services on offer to host a digital collection or even a single document without any development costs at all if they want to use the default discovery platform.

There are various other Digital Library platform products, but they often introduce a silo of their own, or are designed for specific scenarios like Newspapers. The DLCS does not compete with these platforms. For a client like the Wellcome Library the services are like basic utilities around which the library’s own development unfolds. Image tiles, search results and other metadata may be coming from the Cloud Services platform but visitors are looking at a custom web application on the Library’s platform; the Library’s own discovery layer, which isn’t even hosted on the DLCS.

In all these scenarios, someone still needs to produce the images in the first place. These may be professionally sourced, using expensive equipment. Or they might be produced at home on a consumer-level scanner. The Cloud Services address what happens after this point. In the simpler scenarios, a user just uploads the images and uses the web-based tools of the platform to describe and manage them. In more complex scenarios the institution would push metadata and images into the platform via rich APIs, as a part of its digitisation workflow.

The DLCS as guarantor of digitised resources

There are many possible features and directions for the DLCS, and we want help to make those clear. However the DLCS won’t get off the ground if it tries to reach all these destinations at once. The IIIF, annotation and ALTO-based services are more clearly defined initial components that are all directly usable by the Wellcome Library, and directly usable by others who are doing work with digitised materials but have no wish to disrupt their catalogue or discovery applications.

From the perspective of the cultural heritage community the DLCS might become a trusted platform for storing and serving their digitised artefacts. In such a model the DLCS could be funded as a trust of public sector partners who would be the custodians and guarantors of the platform. The interoperability of image, annotation and other resources across the platform is very attractive to such institutions.

We want to explore organisational or corporate models for developing the DLCS, and cooperate with other initiatives in this area. The long term storage and maintenance costs for a small digitised collection are low on this platform, especially if the platform is widely adopted. 10m JPEG 2000 files from a large library is the same “size” as 250,000 collections of 400 images each, although the usage patterns would be very different.

What the DLCS is definitely not

A general purpose Library Management System
A digitisation workflow management system - although such systems might hook into the DLCS to push digitised output onto the platform
A web hosting or content management platform for a large institution’s bespoke discovery layer - the DLCS is not a content management system, though we hope that plugins for common CMS platforms would emerge, and the out-of-the-box discovery layer would be an exemplar integration with a popular open source CMS.
A replacement for a Library or Museum catalogue - or other authoritative sources of metadata

What the DLCS is probably not

An archival service for permanent storage of digitised masters (a DAM or Digital Repository) (although it may end up being used as such for small collections)
A replacement for a large institution’s bespoke discovery layer - such a layer uses components of the DLCS
A general purpose hosted triple store for publishing arbitrary metadata about any catalogue item, digitised or not.

Questions to consider in preparation for the Focus Group session

Which services specified above could my organisation use? Correspondingly which ones would not be applicable and why?
What considerations are important in devising a corporate model to provide stewardship and governance to the DLCS?
Does my organisation fit into the scenarios in Appendix 1? Are there other scenarios we might have missed?
For the more technical attendees are there any technical issues or blockers that we need to consider? Our biggest challenges are, we think:
- Delivering high resolution source images to image servers at speed without incurring huge storage costs for random access file systems in the cloud (e.g., AWS EBS)
- Access control for material that requires some form of login, or acceptance of terms. IIIF 2.1 will include a vocabulary to direct viewing applications to credentials providers, but there are some problematic UX concerns to address.

Appendix 1: Example users of the platform

Scenario	Examples	Platform needs
A. I just have images, but I’m willing to provide ad-hoc description	A school project, a small collection without a catalogue, one off media use, an image gallery, a deep zoom image gallery	Tools to modify the image sequence - ordering, addition, deletion, description, provision of structural metadata. As a user I would upload images or image sets, move them around if the default order is wrong, verify or correct any OCR, annotate the images, etc. There is no integration with any catalogue, and I have to provide all the source images up front.
B. I have images and metadata (maybe in the form of METS files) and I can write code	The artefacts from an existing digitisation programme	I would script the ingestion of images and metadata to the platform via its API. The platform would attempt OCR if I selected that option in the API settings.
C. I have images, METS and METS/ALTO		As above, but the platform doesn’t attempt the OCR. I might have better quality ALTO or have identified some regions specially. The platform API needs to allow power users to influence how it interprets ALTO files. Ultimately the identification of regions is persisted as annotations in some form.
D. I have a huge number of images, digital library object metadata (METS), ALTO and catalogue records (MARC etc.)	Wellcome Library	Likely to adopt an alternative scenario where I provide the service with a “strategy” to acquire the image if it doesn’t already have it (Tier 3, below), rather than bulk-load all the images in advance. Some libraries may choose to upload everything anyway, but this may not be the most economical use of the platform. I am responsible for writing the code that transforms metadata from various sources into metadata appropriate for an IIIF manifest.
E. I have images and a catalogue but nothing else
F. I have an archival hierarchy (fonds)	Wellcome Library, many others	The platform needs to accept assertions about the relationships between separate image sequences (archival units within the hierarchy). Typical archival scenarios need to be supported.
G. I have access control requirements	Wellcome Library, many others	The default position is that everything is open, but some institutions will need to enforce acceptance of terms and conditions. In these scenarios the platform needs to route requests through a pipeline that can delegate authentication back to the origin library.
H. I have DRM and a commercial imperative	Wellcome Images?
I. I have a heritage project and want a web platform to facilitate it		Display images, crowd-sourced identification of image locations, basic web site/discovery layer.
J. I have everything already but not full text search	British Library?	Use the ALTO server only and integrate calls to it into their viewer. IIIF 2.1 will include vocabulary for describing available search services.

Appendix 2: Some Architectural Questions

The Image Servers need their source files available locally on the platform, on a fast random access file system if they are to generate tiles at speed. When IIPImage extracts a tile from a JPEG 2000 via the Kakadu library, it does not need the whole file, just the head information (which it could cache for the next tile) and enough of the file required to generate the particular tile.

If the DLCS platform were running on Amazon Web Services, the fast random access file storage would be SSD-based EBS storage. On Microsoft Azure this would be Page Blobs and Disks (or Files). For a collection comprising a small set of images (for example, 200 photographs) the storage costs are low even for EBS, especially when aggregated with many other institutions into general platform storage costs. But for a large scale digitisation programme the cost of storing millions of high resolution source images on fast disk cloud storage would be too high to justify use of the platform. There are also logistical problems in getting the image files onto the platform in the first place (for example if they are not already in some form of AWS storage).

A larger institution could choose to keep image serving (i.e., the IIIF Image API endpoints) local, on its own systems, so that its image servers are close to the source files, while still using other platform features in the cloud. This may be suitable for enormous collections. However, it defeats one of the main objects of the platform: offloading the maintenance, scalability, burst-handling and reliability of image serving infrastructure to a dedicated service.

The current architecture of the Wellcome Library’s Digital Delivery System suggests a possible model for serving images from platform endpoints without having to store all the source image files in the most expensive storage.

The platform manages a working set of the most frequently used images in SSD-based storage. For some institutions this would be a reserved “pool”, e.g., the Wellcome Library could reserve 2TB of EBS space (about 10% of the total storage at present). Source images in this pool are immediately available to the image servers. Call this Tier 1.

The next tier of storage (Tier 2) is HTTP-based object storage such as Amazon S3 or Azure Block Blobs. The platform needs to copy source images from S3 to EBS to make them available for the image servers. It is assumed (but needs to be verified) that a copy from S3 to EBS in the same region is “fast” (figures to follow).

Typically all an institution’s source image files would be stored in Tier 2 and a working subset of it would be replicated at any one time in Tier 1. This subset would be maintained by the platform, ensuring that the most used images are in the fastest storage. Some asset management systems can “mirror” content to S3, or use it as primary storage. In other scenarios an institution would incorporate upload of resources to Tier 2 into its digitisation workflow.

However, there is a third possible tier, which is that the source image is available at some other HTTP(s) URI on the internet - anywhere on the internet, and the platform fetches it from this URI only when required. A large scale digitisation programme will involve images that are required extremely infrequently (once in a decade).

Before attempting to generate tiles from a source file the platform checks that it is present in Tier 1. If it isn’t, it will try to move it from Tier 2. If it isn’t present in Tier 2 either, but the platform knows the “origin” URI of the source image file, it can attempt to fetch from that URI. This approach means that an institution could reserve a small amount of Tier 1 space, a larger amount of Tier 2 space, and opt to provide the remaining asset source files from its own endpoints on demand. The platform manages the “working set” in tiers 1 and 2. This caps cloud storage costs but has several other potential problems and is probably inappropriate for all but the largest users of the platform. The originating institution is responsible for ensuring the source image is available at this URL in the long term.

The current Wellcome Library DDS has an architecture that is somewhat like Tier 1 + Tier 3 (i.e., with no Tier 2) although both tiers are within the Wellcome Library network. This works pretty well, although you can tell when the DDS is copying from Tier 3 to Tier 1, there is a delay of about 2 seconds before tiles start appearing for a given image. Careful cache management with a large Tier 1 working set mitigates this for most users.

Managing the working set in each tier is what the platform does to get the best balance of performance and cost for each user organisation’s budget, preferences and usage patterns. The implementation of this, behind distributed file system abstractions with eviction/scavenging policies, needs a great deal of work and empirical testing.

Our experience at the Wellcome Library is that the distribution of requests across the entire set of images is extremely uneven:

Image usage at Wellcome Library

The first 18 form a distinct cluster. They are the winners of the 2014 Wellcome Image Awards.

Almost all tile requests are served from Tier 1 files. But the user experience for material that has to be fetched from Tier 3 must be acceptable. It’s an extremely long tail; the user working on obscure manuscripts can accept a short wait between images (but not between tiles), but no more than a couple of seconds. This is particularly important for viewers that offer thumbnail previews of pages.

Alternative approaches that need to be examined include the performance of file systems that are wrappers over S3 storage, such as s3fs. Under the hood these are still copying from s3 via GET requests, but clever use of byte-range requests could improve performance significantly. However this is still a long way from the performance of random access SSD-backed storage. An s3fs-style file system that used EBS or local storage as a cache, and had excellent cache management capability, could be a possibility. EBS storage cannot be shared between instances.

Another alternative is to use ephemeral “direct attached” storage as Tier 1, avoiding EBS costs but losing all Tier 1 whenever an EC2 instance shuts down. The practicality of this depends on whether there are a large number of image server instances running on a large number of low performance VMs (each with their own very small isolated island of ephemeral storage) or a handful of very powerful VMs (maybe running dozens or hundreds of image server processes) that are expected to run uninterrupted long enough to build up a meaningful cache. Large VMs on Azure and AWS come with large ephemeral SSD-backed storage, An S3-backed file system wrapper (or rather, Block Blob on Azure) could use most of this as a tuned cache (and some of the RAM as a RAM disk too).

On the DLCS platform, an incoming tile request can be routed by a load balancing component to the image server instance that currently has that image as a local file. If that image server is deemed busy, the least busy image server will be enrolled to ALSO serve tiles from that source file. If no image server has the source file, the least busy image server will be selected and the file copied to it from Tier 2 (or Tier 3). This approach still allows for multiple low power image servers, with the tile request being routed to the image server that has the necessary image data locally. A very popular image could end up on more than one image server, although tile caching further up the stack should make this quite rare. This kind of distribution of load implies that there is an optimal balance of the variables:

Number of image servers
Amount of local storage on each image server (ephemeral or EBS)
“Performance” of each image server (does it work best if each instance is a Micro or a huge VM? – this is obviously hiding many other variables

The shape of the curve shown above could be significantly different for a platform holding hundreds of different collections.

Tom Crane
March 2015