The National Archives Taxonomy in the new Discovery Service
To develop a user-friendly subject categorisation and filtering functionality for the new Discovery catalogue service, creating subject tags deep across 11 million low level catalogue descriptions using innovative technology and limited human resources.
Jone Garmendia, Head of Cataloguing, The National Archives, Kew, firstname.lastname@example.org
June 2010 – March 2011 as a project.
April 2011 onwards as a business as usual.
Development took place in-house: four archivist worked on the project whilst carrying out other duties, an IT developer supported the taxonomy management software and categorisation engine.
We have used the basic definition of taxonomy published by the group Taxonomies in the Public Sector (TIPS) (a system for naming and organising things into groups that show similar characteristics) but with a strong focus on its practical application. We wanted to deliver a taxonomy to support web usability (searching and filtering of results), content re-use and serendipitous findings. The new subject filter sits next to the collection filter which continues to allow users to select records by provenance.
With the help of the lessons learnt after our 2005 site search categorisation exercise, consultation with experts and a series of internal workshops, we came up with a simple taxonomy which is structured as a list of subjects, using a simple header and a single level below.
Methodology and process
We ran a pilot categorisation exercise, which was extremely helpful particularly to:
a) understand the taxonomy management tool, a product linked to our existing search engine
b) make decisions about the best way to proceed within the available technology.
For example, we decided to go for the query method as the quality of the categorisation was high. The alternative was the training method (automated tagging by the search engine after analysing a sample of representative records for each subject category), which was not suitable due to the nature and chronology of our data
We implemented a new set of business rules within the metadata schema in order to allow subseries and series titles to influence the categorisation of low level file descriptions. This was essential to break the barriers of the strict multilevel structure of our relational database catalogue.
Next was the research and query building phase: each archivist was responsible for the development of a group of subjects. We researched sources for each subject within the context of The National Archives holdings, always assessing which records a wide range of users would expect to find under each subject filter. The outcome was a compilation of words and phrases for each subject. These word lists were used to build sophisticated queries at the back end and drive the automated tagging process.
One of the challenges was (and still is) disambiguation: dealing with homonyms. For example, in order to build a successful query to tag records under the subject ‘Race relations’ we had to exclude instances of the term ‘race’ in other contexts (horse races, air races, athletic races, race to the moon, etc.). We also learnt to never underestimate the number of words that might be contained in an English name.
This diagram illustrates the indexing process:
New data is automatically categorised during the ingestion and indexing process. This is extremely important for us because we release an average of 12,000 catalogue entries per week.
Beta release, testing and tuning
Discovery is still a beta system and is not yet fully synchronised with the live TNA Catalogue data.
Following the beta release we carried out two testing exercises in parallel:
a) Analysis of results for 100 popular searches to establish the precision and relevancy of the categorisation.
b) Analysis of untagged material (records left without a subject) to tackle issues around recall. The whole team learnt a great deal through this type of work. It wasn’t just about how good the subject tags that we could see were. The key here was to look at what we couldn’t see, the records that had slipped through the net. Just before April 2011 there were 5.4 million untagged records. This number has gone down to 1.5 million and the team continues to work towards the maximum possible coverage.
- A taxonomy subject filtering is in operation to serve the new Discovery Service.
- Taxonomy development work has become business as usual
- A team of archivists has embraced the new technology to carry out a different type of descriptive work, putting a curatorial dimension into a search engine
- Taxonomy maintenance
- Adjustments for the ingest of other data sources, ie DocumentsOnline
- The importance of writing robust business rules in a language that developers can interpret and transform.
- Testing and tuning, testing and tuning and testing and tuning again.
- The advantages of agile development versus prince related project management methodologies.
- The fact that software is always imperfect and that we need to accept some undesirable features and work around the issues to achieve a positive outcome, resisting the temptation to hit our heads against the wall.
- The realisation that difficulties around disambiguation, language and personal names can also make your day much more interesting and fun.
- Finally, we already knew this from the start but… we kept learning that in the current climate all projects have to be delivered with very limited financial resources.