[MIMAS logo] MIMAS Metadatabase: Final Project Report

Ann Apps, Ross MacIntyre and Leigh Morris
MIMAS, Manchester Computing, University of Manchester
June 2002

1 Project Objective

2 Background

MIMAS at the University of Manchester is a national data centre for higher and further education and the research community in the UK, providing networked access to key data and information resources to support teaching, learning and research across a wide range of disciplines. This cross-domain, heterogeneous collection of resources includes:

Until now there was no consistent way of discovering information within these MIMAS collections and associated services, except by reading the web pages specific to each service. Although most of these web pages contain high quality information relevant to their particular service, this information is not presented in a standard format and there is not a simple way to search for information across the services.

Some of the resources held at MIMAS are freely available globally, but access to many is restricted in some cases to members of UK academia, maybe requiring registration, in other cases by subscription. For resources where access is restricted, currently general resource discovery will find only shallow top-level information, and may not indicate to a prospective user the appropriateness of a resource to their interest.

MIMAS services will be required to provide interfaces consistent with the architecture of the JISC 'Information Environment' for resource discovery. Currently many of the services do not provide these interfaces. Some of the MIMAS services are products hosted and supported by MIMAS, but not developed in-house, making implementation of additional interfaces unlikely.

To overcome all of these problems consistent, high quality metadata records for the MIMAS services and collections have been created. These metadata records are standards- based, using Dublin Core, XML and standard encoding schemes for appropriate fields. Freely available access to this XML metadata repository, or 'metadatabase', is provided by an application which supports the interfaces required by the Information Environment, enabling information discovery across the cross-domain MIMAS resource.

3 MIMAS Metadata Records

3.1 Cross-Domain Information Discovery

Because the MIMAS service consists of a heterogeneous collection of services and datasets across many disciplines, a common, cross-domain metadata schema is required for their description, hence the choice of qualified Dublin Core. Someone searching for information about for example 'economic' will be able to discover results of possible interest across many of the MIMAS services beyond the obvious macro-economic datasets, including JSTOR, census data, satellite images and bibliographic resources. In the future the metadata may be extended to include records according to domain-specific standards, but the MIMAS metadata cross searching capability would of necessity still be based on the 'core' metadata encoded in qualified Dublin Core.

3.2 An Example Metadata Record

The MIMAS metadata is encoded in XML and stored in a Cheshire II database, described in more detail in section 5, which provides a World Wide Web and a Z39.50 interface.

Using the Web interface to this metadatabase, searches may be made by fields title, subject or 'all', initially retrieving a list of brief results with links to individual full records. An example of a full record for one of the results retrieved by searching for a subject 'science', with web links in '[...]', but with an abbreviated description, is:

Title:ISI Web of Science
Creator:MIMAS; ISI
Subject (LCSH):Abstracts; Arts; Books Reviews; Humanities; Letters; Periodicals; Reviews; Science; Social sciences
Subject (UNESCO):Abstracts; Arts; Book reviews; Conference papers; Discussions (teaching method); Periodicals; Science; Social sciences
Subject (Dewey):300; 500; 505; 600; 605; 700; 705
Description:ISI Citation Databases are multidisciplinary databases of bibliographic information gathered from thousands of scholarly journals.
Publisher:MIMAS, Manchester Computing, University of Manchester
Type (DC):Service
Type (LCSH):Bibliographical citations; Bibliographical services; Citation indexes; Information retrieval; Online bibliographic searching; Periodicals Bibliography; Web databases
Type (UNESCO):Bibliographic databases; Bibliographic services; Indexes; Information retrieval; Online searching
Type (Dewey):005
Type (MIMAS):bibliographic reference
isPartOf:[ISI Web of Science for UK Education]
hasPart:[Science Citation Index Expanded]
hasPart:[Social Sciences Citation Index]
hasPart:[Arts & Humanities Citation Index]
Access:Available to UK FE, HE and research councils. Institutional subscription required.
MIMAS ID:wo000002

Following a Z39.50 search, records may be retrieved as Simple Unstructured Text Record Syntax (SUTRS), both brief and full records, full records being similar to the above example, GRS-1 (Generic Record Syntax) and a simple tagged reference format. In addition the MIMAS Metadatabase is compliant with the Bath Profile, an international Z39.50 specification for library applications and resource discovery, providing records as simple Dublin Core in XML according to the CIMI Document Type Definition.

3.3 Standard Classification and Encoding Schemes

To provide quality metadata for discovery, subject keywords within the metadata are encoded according to standard classification or encoding schemes. In order to facilitate improved cross-domain searching by both humans and applications where choices of preferred subject scheme might vary, MIMAS Metadata provides subjects encoded according to several schemes. As well as the encoding schemes currently recognised within qualified Dublin Core, Library of Congress Subject Headings (LCSH) and Dewey Decimal, UNESCO subject keywords are also available. In addition, MIMAS-specific subjects are included to capture existing subject keywords on the MIMAS web site service information pages supplied by the content or application creators as well as MIMAS support staff. The use of standard classification schemes will improve resource discovery and will also assist in the provision of browsing structures for subject-based information gateways.

Similar classification schemes are included for 'Type' to better classify the type of the resource for cross-domain searching. Each metadata record includes a 'Type' from the high- level DCMI Type Vocabulary, 'Service' in the example above, but for some MIMAS records this will be 'Collection' or 'Dataset'. In addition, the above example includes type indications, including 'Bibliographical citations' and 'Online searching', according to standard schemes. Again the MIMAS-specific resource type is included.

Countries covered by information within a MIMAS service are detailed according to their ISO3166 names and also their UNESCO names, captured within the 'dcterms:spatial' element of the metadata record and shown on the web display as 'Country'. This is of particular relevance to the macro-economic datasets, such as the IMF databanks, which include data from many countries in the world. Temporal coverage, again of relevance to the macro-economic datasets, is captured within a 'dcterms:temporal' element and encoded according to the W3CDTF scheme. This is displayed as 'Time' and may consist of several temporal ranges. Information about access requirements to a particular MIMAS service is recorded as free-text within a 'dc:rights' element and displayed as 'Access'.

3.4 The MIMAS Application Profile

Where possible the metadata conforms to standard qualified Dublin Core. But this is extended for some Dublin Core elements to enable the capture of information which is MIMAS-specific or according to schemes which are not currently endorsed by Dublin Core. These local additions to qualified Dublin Core effectively make up the MIMAS application profile for the metadatabase. The inclusion of UNESCO as a subject, type and spatial classification scheme described above is an example of local extensions, as is the capture of MIMAS-specific subjects and types.

Some administrative metadata is included: the name of the person who created the metadata; the creation date; and the identifier of the record within the MIMAS Metadatabase. Capturing the name of the metadata creator will be of use for future quality checks and updating.

3.5 The MIMAS Metadata Hierarchy

Although each of the records within the MIMAS Metadatabase is created, indexed and available for discovery individually, the records represent parts of the service within a hierarchy. In the example above, the record for 'ISI Web of Science' is a 'child' of the top-level record 'ISI Web of Science for UK Education', the umbrella term for the total service offered, and is a 'parent' of several records including 'Science Citation Index Expanded'.

During metadata creation only the 'isPartOf' relation is recorded, as the MIMAS identifier of the parent metadata record. The 'hasPart' fields and the displayed titles and links for parent and child metadata records are included by the MIMAS Metadatabase application (section 5.2). Hard coding 'hasPart' fields into a metadata record would necessitate the inefficient process of updating a parent record whenever a new child record were added. Dynamic generation of these links assists in simplifying the metadata creation and update process, and in maintaining the consistency of the metadata.

A further navigation hierarchy is provided by the application. If a parent and a child record, according to the 'isPartOf' hierarchy, also have a matching MIMAS subject keyword, the application includes a link from the parent's subject keyword to the particular child record. For example a JSTOR fragment record could include:

Title:JSTOR Ecology & Botany Collection
Subject (MIMAS):[Ecology / Journal of Applied Ecology]
Subject (MIMAS):Botany

where the text 'Ecology / Journal of Applied Ecology' is a web link to the record for that particular journal. Again this subject navigation hierarchy is provided dynamically by the application and does not depend on the accuracy of metadata creation beyond the 'isPartOf' identifier and the matching subject keyword.

The child, 'hasPart', links within the MIMAS metadata hierarchy are available in the web interface only. A metadata record retrieved through the Z39.50 interface will include a single 'isPartOf' relation at most, which will consist of the MIMAS identifier of the parent record. Any required linking between records would be provided by the application retrieving the records.

3.6 Metadata Creation

The initial MIMAS metadata covering all the MIMAS services has been created by one person as part of the set-up project, much of it being scraped from the existing MIMAS service web pages. The initial draft of metadata records for each service was checked by the particular support staff, thus ensuring quality metadata for each MIMAS service. It is envisaged that the metadata will be maintained by the service support staff in the future, as part of the standard support process for each MIMAS service.

Lacking a suitable XML authoring tool, the MIMAS metadata is currently created as XML files using an XML template and a text editor. The created XML is validated by parsing against an XML Document Type Definition (DTD) before the record is indexed in the metadatabase. The DTD is available on the project web site.

4 The JISC Information Environment

All MIMAS resources are part of the JISC 'Information Environment' and thus must be consistent with its architecture. The Information Environment will enable resource discovery through the various portals in its 'presentation layer', including the discipline specific Resource Discovery Network (RDN) hubs. Content providers in the 'provision layer' are expected to disclose their metadata for searching, harvesting and by alerting. This means that all resources within the Information Environment should have a Web search interface and at least some of the following for machine-to-machine resource discovery: a Z39.50 (Bath Profile compliant) search interface; an OAI (Open Archives Initiative) interface for metadata harvesting; and an RSS alert capability.

The majority of MIMAS resources have a Web search interface to provide resource discovery within their particular service. A few MIMAS services, COPAC, zetoc and the Archives Hub, provide Z39.50 interfaces. Some services, being commercial products hosted by MIMAS, may never provide Z39.50 searching or OAI metadata. To overcome this lack of requisite interfaces for MIMAS content and access restrictions on some of the services, the MIMAS Metadatabase will act as an intermediate MIMAS service within the 'provision layer' of the Information Environment, functioning as the main resource discovery service for MIMAS content.

5 The Cheshire II Information Retrieval System

The software platform used for the MIMAS Metadatabase is Cheshire II which is a next generation online catalogue and full text information retrieval system, developed using advanced information retrieval techniques. It is open source software, free for non-commercial uses, and was developed at the University of California-Berkeley School of Information Management and Systems. Experience and requirements from the development of the MIMAS Metadatabase have been fed back into the continuing Cheshire development. Although using evolving software has caused some technical problems, the Cheshire development team, project collaborators, have been very responsive to providing new functionality, and this relationship has proved beneficial to both projects. As part of the project a 'sort' capability has been developed within Cheshire.

5.1 Z39.50 via Cheshire

Cheshire provides indexing and searching of XML (or SGML) data according to an XML Document Type Definition (DTD), and a Z39.50 interface. The underlying database for the MIMAS Metadatabase is a single XML data file containing all the metadata records, along with a set of indexes onto the data. The MIMAS metadata XML is mapped to the Z39.50 Bib-1 Attribute Set for indexing and searching. The application's Z39.50 search results formats are detailed above in section 3.2. The mapping from the MIMAS metadata to the GRS-1 Tagset-G elements is defined in the Cheshire configuration file for the database and is used by Cheshire to return data in GRS-1 format to a requesting client. The other Z39.50 result formats are implemented by bespoke filter programs which transform the raw XML records returned by Cheshire, the 'hooks' to trigger these filters being specified in the configuration file for the database. The mapping from the MIMAS metadata to simple Dublin Core, as required by the Bath Profile, is straightforward, the base data being qualified Dublin Core, albeit with some loss of information such as subject schemes. In order to obviate this information loss as much as possible, such details are included in parentheses in the supplied record. For example, a Z39.50 XML result for the example in section 3.2 may contain the element:

<subject>(LCSH) Abstracts</subject>

Details of these indicated data mappings are available on the project web site.

5.2 The Cheshire Web Interface

Cheshire also provides a basic, customisable Web interface, 'webcheshire'. The web interface for the MIMAS Metadatabase is built on webcheshire as a bespoke program written in OmniMark (version 5.5). This web program provides a search interface which includes saving session information between web page accesses. It transforms retrieved records from XML to XHTML (version 1.0) for web display. OmniMark was chosen as the programming language for this interface, rather than Perl or TCL (the basic Cheshire interface language), because it is XML (or SGML) aware according to a DTD, a knowledge which is employed for the XML translations involved, and also because of existing expertise and availability on the MIMAS machine.

The MIMAS Metadatabase web interface provides search results in discrete 'chunks', currently 25 at a time, with 'next' and 'previous' navigation buttons. This is implemented by using the Cheshire capability to request a fixed number of records in the result set, beginning at a particular number within that set. The application remembers the MIMAS identifiers of the results in the retrieved 'chunk', and extracts the record corresponding to a particular MIMAS identifier when an end-user selects a 'full record display'.

To implement the metadata hierarchy navigation functionality, described in section 3.5, an additional index, used internally by the application, is created on the 'isPartOf' fields of the records which denote the MIMAS identifiers of the parent records. When a record is displayed, this index is checked to find all metadata records which indicate the current record as parent, the titles of these children records also being determined from the database. For each child record found a 'hasPart' link is displayed. Similarly the title and link for the 'isPartOf' display are determined by a database look-up.

6 Future Work

Further enhancement of the MIMAS Metadatabase will take place as part of the Implementing the DNER Technical Architecture (ITAM) project:

Other possible future work, but not currently planned, could include:

7 Conclusion

MIMAS has aimed to describe its collection of datasets and services using quality metadata. Quality assurance has been achieved by checking of the metadata records for a particular service by the relevant support staff. Continued metadata quality will be ensured by maintenance of the metadata by these support staff. Subject keywords are included in the metadata according to several standard classification schemes, as are resource types and geographical names. Use of standard schemes enhances the quality of the metadata and enables effective resource discovery.

The project has developed an interoperable solution based on open standards and using leading-edge, open source technology, achieved with a Cheshire II software platform to index Dublin Core records encoded in XML. A spin-off has been improvements to Cheshire following feedback from MIMAS, including the addition of 'sort'. Use of Z39.50 will enable the MIMAS Metadatabase to be integrated into the JISC 'Information Environment', thus providing a valuable resource discovery tool to the stakeholders within that environment.

The MIMAS Metadatabase provides a single point of access into the disparate, cross-domain MIMAS datasets and services. It provides a means for researchers to find and access material to aid in the furtherance of their work, thus assisting in the advancement of knowledge. Learners and their teachers will be able to discover appropriate learning resources across the MIMAS portfolio, improving the educational value of these datasets.

The MIMAS Metadatabase project was funded by the Joint Information Systems Committee (JISC) for the UK Higher and Further Education Councils as part of the JISC Services DNER: Z39.50/Authentication Programme.

15 July 2002, epub@manchester.ac.uk          [Go to MIMAS home page]          [Valid XHTML 1.0!]