The National Archives Taxonomy in the new Discovery Service

Objectives

To develop a user-friendly subject categorisation and filtering functionality for the new Discovery catalogue service, creating subject tags deep across 11 million low level catalogue descriptions using innovative technology and limited human resources.

Contact:

Jone Garmendia, Head of Cataloguing, The National Archives, Kew, jone.garmendia@nationalarchives.gsi.gov.uk

Time period:

June 2010 – March 2011 as a project.
April 2011 onwards as a business as usual.

Resources

Development took place in-house: four archivist worked on the project whilst carrying out other duties, an IT developer supported the taxonomy management software and categorisation engine.

Background

We have used the basic definition of taxonomy published by the group Taxonomies in the Public Sector (TIPS) (a system for naming and organising things into groups that show similar characteristics) but with a strong focus on its practical application. We wanted to deliver a taxonomy to support web usability (searching and filtering of results), content re-use and serendipitous findings. The new subject filter sits next to the collection filter which continues to allow users to select records by provenance.

With the help of the lessons learnt after our 2005 site search categorisation exercise, consultation with experts and a series of internal workshops, we came up with a simple taxonomy which is structured as a list of subjects, using a simple header and a single level below.

Methodology and process

We ran a pilot categorisation exercise, which was extremely helpful particularly to:

For example, we decided to go for the query method as the quality of the categorisation was high. The alternative was the training method (automated tagging by the search engine after analysing a sample of representative records for each subject category), which was not suitable due to the nature and chronology of our data

We implemented a new set of business rules within the metadata schema in order to allow subseries and series titles to influence the categorisation of low level file descriptions. This was essential to break the barriers of the strict multilevel structure of our relational database catalogue.

Next was the research and query building phase: each archivist was responsible for the development of a group of subjects. We researched sources for each subject within the context of The National Archives holdings, always assessing which records a wide range of users would expect to find under each subject filter. The outcome was a compilation of words and phrases for each subject. These word lists were used to build sophisticated queries at the back end and drive the automated tagging process.

One of the challenges was (and still is) disambiguation: dealing with homonyms. For example, in order to build a successful query to tag records under the subject ‘Race relations’ we had to exclude instances of the term ‘race’ in other contexts (horse races, air races, athletic races, race to the moon, etc.). We also learnt to never underestimate the number of words that might be contained in an English name.

This diagram illustrates the indexing process:

New data is automatically categorised during the ingestion and indexing process. This is extremely important for us because we release an average of 12,000 catalogue entries per week.

Beta release, testing and tuning

Discovery is still a beta system and is not yet fully synchronised with the live TNA Catalogue data.

Following the beta release we carried out two testing exercises in parallel:

Outcome

  • A taxonomy subject filtering is in operation to serve the new Discovery Service.
  • Taxonomy development work has become business as usual
  • A team of archivists has embraced the new technology to carry out a different type of descriptive work, putting a curatorial dimension into a search engine

Future work

  • Taxonomy maintenance
  • Adjustments for the ingest of other data sources, ie DocumentsOnline

Lessons learnt

  • The importance of writing robust business rules in a language that developers can interpret and transform.
  • Testing and tuning, testing and tuning and testing and tuning again.
  • The advantages of agile development versus prince related project management methodologies.
  • The fact that software is always imperfect and that we need to accept some undesirable features and work around the issues to achieve a positive outcome, resisting the temptation to hit our heads against the wall.
  • The realisation that difficulties around disambiguation, language and personal names can also make your day much more interesting and fun.
  • Finally, we already knew this from the start but… we kept learning that in the current climate all projects have to be delivered with very limited financial resources.