Guidance

Making your data easy to find: Metadata best practice guide for data publishers

Published 12 November 2021

Summary

This report provides guidance to promote the findability of metadata. Metadata provides a structured description of a dataset that allows a user to discover data and appraise its usefulness to them. Metadata itself however needs to be created in a way to maximise that it can be found and used. The Metadata best practice guide gives guidance to geospatial data publishers in the UK about how to create metadata that is: complete and accurate, easy to read, discoverable, and reusable. It encourages and promotes the adoption of the UK GEMINI standard, with the aim of providing a consistent approach to geospatial metadata across organisations and departments. This good practice guide focuses on the experience of the metadata end-user to ensure content is optimised for findability through search engines.

About Metadata

What is Metadata and its purpose?

Metadata is the first thing individuals come across when searching for geospatial data. It helps them find the data and tells them information about the datasets they have found: what it contains, who produced it, who maintains it, and how it can be used. Metadata should always be created with the end-use case and user data re-use in mind, targeting a wide use of potential data re-users.

According to the Geospatial Glossary published on GOV.UK website, metadata is “data about data (or a service”). It is a record of the most important features of your data resource. It tells the user what the data resource contains, where it comes from, the frequency with which it will be updated, licence requirements, and the quality controls that have been applied. These are essential descriptors of the data resource and give users the ability to make an informed decision on whether the data resource is suitable for the intended use. Creating metadata is not a one-off event. It needs to be kept up to date and in sync with the data resource it accompanies.

Why is it important to produce quality, discoverable metadata?

Good quality metadata is essential for:

  • Making it easier for users to find your data;
  • Letting users know how to access your data;
  • Letting users know any conditions on reuse of your data;
  • Identifying the lineage of the data resource from source to consumption making the data transparent to users;
  • Providing users in your organisation with an accurate account of your data resources;
  • Showcasing the qualities of the data resource to inform decisions about its suitability for a particular purpose;
  • Ranking favourably on search engines making it more findable;
  • Data platforms and portals to operate efficiently; and
  • Building value and trust, allowing others to use that data with confidence.

To put in place the business processes and tools you need to maintain and update your metadata throughout the data life cycle. You need to consider:

  • What metadata elements you need to capture, modify and update
  • How you can minimise human intervention and automate metadata creation and updating

Tip: Start your metadata journey early. The creator of the data resource should be thinking about the metadata while the resource is being created rather than as a later, additional task.

About this metadata best practice guide

The development of this guide is part of the Geospatial Commission’s Data Improvement Programme and incorporates the results of research and interviews conducted during the project. The UK GEMINI standard was found to be most suitable for this metadata guide, which describes several of the GEMINI elements in detail.

This work was carried out as part of the Geospatial Commission sponsored Data Improvement Programme which aimed to improve the alignment of public sector geospatial data to the FAIR data principles. As such this work contributes to Mission 2 of the UK’sNational Geospatial Strategy - Improve access to better location data. This work was part of the Data Discoverability project which aimed to improve the Findability of geospatial data held by the Commission’s partner bodies. The objective of this work was to improve metadata practices across the Geo 6 Partners [footnote 1] and wider geospatial groups across government.

Research conducted in the Data Discoverability 2 (DD2) shows that most users of geospatial data will use search engines to find the datasets they need. This means that the metadata, which is what search engines rank, must be created in a way that is search engine optimised (SEO).[footnote 2]

In this document the term ‘data resource’ is used in preference to ‘dataset’. This is because a GEMINI record may describe a dataset, a series of datasets, or a (web) service that ‘operates on’ (e.g. provides access to) one or more datasets. It would also cover tile metadata, or any other ‘level’ of metadata.

Best Practice Guide

Understand the data resource for the metadata you are creating

Why it’s important

Before creating any metadata, you need to understand why the data resource was created, what it contains and who is the target audience. The creator of the metadata needs this information to produce quality metadata that will be useful to the users. This is harder to do if you are creating metadata for a data resource created by someone else.

What it means

  1. You will need to have discussions with the creators of the data resource to gain an understanding of what the data is all about. You will then be able to produce metadata that will be discoverable and relevant to its target audience.

  2. If you need to store other items of metadata which are not described in the standards used by your organisation, you can create your own internal extensions. Examples might be to flag personal data, or internal systems and processes.

Follow the latest UK GEMINI standard for metadata

Why it’s important

UK GEMINI is the recommended standard for geospatial data as it:

  • encourages a consistent approach to geospatial metadata across organisations and departments;
  • ensures conformance with the underlying ISO standards and the EU’s INSPIRE Directive[footnote 3]; and
  • supports improved discovery of geospatial data.

GEMINI can also be used to publish details of something that might not be traditionally ‘geographical’, such as a video or a rock sample, so long as you can provide a relevant location and relate it to one or more theme keywords.

What it means

  1. You should refer to the latest UK GEMINI standard and follow its guidance as closely as possible. Some of the elements will be discussed in detail further on in this guidance.

  2. We recommend all public bodies use GEMINI metadata to make their data resources discoverable on data.gov.uk, or for Scottish public bodies, on spatialdata.gov.scot. This is mandatory for data in-scope of the INSPIRE Regulations. This can be done either directly or via a connected national or thematic portal, for example Scottish Government’s spatialdata.gov.scot site or Marine Environmental Data and Information Network (MEDIN).

Data.gov.uk transforms your metadata to be human readable and searchable on data.gov.uk web search and is enhanced by embedding some structured metadata in the web page (using schema.org). It also makes the records available in an Open Geospatial Consortium (OGC) Catalogue service. Data.gov.uk has a GEMINI Harvester [footnote 4] for this purpose.

Tip: If you use an existing record as a template where you simply change the key fields, make sure you use a new identifier for the record (gmd:fileIdentifier). If you do not, data.gov.uk will overwrite the record that was the template when it harvests it.

3.There are a few things to remember when providing metadata:

  • write out URLs (web addresses) in full, for example ‘https://…’;
  • ensure your metadata is kept up to date and in sync with the data resource;
  • where possible, do not use free text as it is difficult to carry out effective search on these fields. Use code lists wherever you can and double check that they are valid (see individual elements for further detail); and
  • the table lists several GEMINI elements that are of vital importance so make sure you complete them accurately and in full (some of these are discussed in more detail further on in this guidance).

Some of these elements must be used in particular ways if you are in-scope of INSPIRE. The GEMINI guidance provides details on these.

Table 1:Subset of GEMINI elements

UK GEMINI element Brief description
Title [footnote 5] Name of the dataset
Abstract A summary of what the dataset is about
Keyword To indicate the general subject area of the data resource using keywords.
File Identifier Is a unique identifier for the metadata file that must never be changed.
Resource locator A link to: Where the data resource may be directly accessed online; or Where no direct access is available, to an online resource providing information about accessing it.
Responsible organisation The name of the organisation releasing the dataset.
Lineage Indication of how the data resource was created.
Limitations on public access To identify reasons why the data is not openly available to the public.
Use constraints Where licence requirements and other restrictions on using the data are noted.
Dataset reference date Date of publication of the data resource.
Metadata date Date on which the metadata was created, last updated, or was confirmed as being up to date.
Temporal extent The date or date range that identifies the currency of the data. It may refer to the period of collection, or the date at which it is deemed to be current.
Extent This defines the geographical extent of the data resource by linking to the name of a well-defined area (county, national park, etc).
Bounding box The rectangle enclosing the extent of the data resource described in latitude and longitude.
Data quality A quantitative description of the completeness, uniqueness, consistency, timeliness, accuracy and validity of the data.
Conformity A statement of conformity with the product specification or user requirement against which the data is being evaluated.
Maintenance information About the scope and frequency of updating.

Write a descriptive title with most critical information in the first 60 characters

Why it’s important

The name or title of your geospatial data is the first chance you have to tell your users what your data resource is about.

Search engines truncate titles at around 60 characters in their search engine results pages. Any information in subsequent characters will be cut off and users will be less likely to understand what the data is about. Longer titles are currently not penalised by search engines, but having critical information stated upfront in the title will improve the user experience.

What it means

  1. Give your dataset a relevant and meaningful title to help people determine whether it is relevant to their problem or purpose.
  2. If possible, keep titles to less than 60 characters as this is what will be published in search engine results pages. Where this is not possible, make sure the first 60 characters of your title contain enough information to be informative to the user.
  3. Provide a title at a real-world object level in a way that represents what users search for.
  4. Find out if a data resource is known by any other names and include them as Alternative titles to reach a wider audience. Note that more than one alternative title can be provided.
  5. Where your organisation has its own rules for creating meaningful titles, try to apply the guidance in this section as far as possible.

Example of a well-structured title: Postcodes of Great Britain produced from Code-Point, 2020.

What (real-world object) Postcodes
Place Great Britain
Source / Product Code-Point
Currency / Publication date 2020
Chunky Middle keyword Postcodes of Great Britain
Number of characters in the title 57

Write a meaningful abstract with the most critical information in the first 150 characters

Why it’s important

The Abstract supports the title in helping people to quickly understand the data and make an initial assessment of its suitability for their use or purpose.

Furthermore, search engines take the first 150 characters of the abstract to display in search results. It is therefore vital that the abstract is front-loaded to provide a useful and concise summary to encourage people to click through and read more about your data.

What it means

1. Write a meaningful description of the data in plain English. Use the ‘5 Ws’ as a framework for this, keeping your end-user in mind:

  • What does the data contain?
  • Where does the dataset refer to?
  • When was the data collected?
  • Why was the data collected?
  • Who collects, manages and publishes the data?

2. Use short, concise sentences which are grammatically correct and free from spelling mistakes. Do not use technical terminology or specialist jargon. Avoid using any acronyms or abbreviations.

3. Make the first 150 characters of the summary as compelling as possible. Front-load the first 150 characters with a good description of the data.

4. Include relevant keyword phrases and the INSPIRE theme of the data naturally within the abstract information.

5. The abstract should not include information about background methodology, source datasets, quality processes and so forth, as these belong in the Lineage element. It should not include anything about copyright or licensing as this belongs in Use constraints.

The image below is a fictitious example of a well-written abstract derived from an actual British Geological Survey (BGS) record:

Made up exampled of a well-written abstract derived from an actual British Geological Survey record

Use keywords responsibly

Why it’s important

Keywords help to quickly and easily convey information about the data to both users and search engines. The term ‘keyword’ in SEO refers to individual words and longer phrases. When used effectively, keywords help people to quickly determine if the data is relevant to their needs and prompt further investigation. In addition, it will encourage improved discoverability and higher rankings in search engine results.

This section applies much of the guidance found in the Search engine optimisation (SEO) for data publishers: Best practice guide. In some instances, this guide tailors the SEO guidance for geospatial metadata.

In GEMINI, keywords have different types such as topics, themes and feature types, and places. These are catered for in the following GEMINI elements:

  • Topic Category is one of a shortlist of general topics;
  • Keyword is generally used for the INSPIRE themes and real-world feature types; and
  • Extent is for keywords that specify the place the data is about.

Tip: You should think about keywords when the data resource is being created. It will help you present the data in a way that is easily understood by your users and will ensure alignment between the data resource and its metadata.

What it means

1. Figure 1 shows the relationship between the number of views with levels of keyword granularity (Fat Head, Chunky Middle and Long Tail). Understanding this is relevant to the rest of the discussion around keywords.

chart showing the relationship between the number of views with levels of keyword granularity (Fat Head, Chunky Middle and Long Tail)

Figure 1: Search demand curve[footnote 6]

2. Use keywords that your users would use. These should include real-world concepts which are meaningful to your users. Users are more likely to search for this than for product names that are totally unrelated to the contents of the dataset. Where possible, this choice of language should influence words used in titles and summaries.

3. Maximise value of keywords to improve the search engine rankings for your data by sticking to these best practices:

  • Use keywords naturally and in context. Do not write keywords as a list such as “Keywords: Highways; Roads; Motorways; A Roads” where they are obviously not used naturally in the text. Search engines penalise web pages that try to benefit from listing keywords in this way. It negatively affects the ranking of your webpage which in turn makes it harder for users to find your data resource.
  • Include INSPIRE theme names within your keywords. These are widely understood, plain English names for broad themes such as Addresses, Geology, Protected Sites, etc.
  • Supplement INSPIRE theme names with ‘Chunky Middle’ terms. [footnote 7] For example, end-users looking for data about the property market may be more likely to search for ‘sale price’ or ‘property ownership’ than the related INSPIRE theme ‘Cadastral Parcels’.
  • Where possible, use keywords from controlled vocabularies and online registries as these offer improved standardisation and remove ambiguity across organisations and departments.
  • Terms from multiple registries can be used. For example, INSPIRE themes are a source that is widely used for the Fat Head terms, and a domain-specific vocabulary for the Chunky Middle and perhaps Long Tail terms. Free text is also useful when Long Tail terms are not specified in a controlled vocabulary.
  • If you can access one, use a keyword analysis tool to identify the popular keywords being used for a specific use case. There are many such tools available online.

Write a detailed lineage

Why it’s important

Lineage explains how the data came into existence and the stages it has passed through. It provides information about the events and source data used in the construction of the data resource. The more information you provide about the history of the data resource, the easier it will be for prospective users to assess whether it will meet their requirements.

What it means

Include information about:

  • source material;
  • data collection methods used;
  • quality control processes (note that any report from these processes should be captured in a ‘Data quality’ element);
  • data processing methods used; and
  • indicate any data collection standards used.

The image below is an example of a well-written lineage from BGS (reference: UKGEOS Glasgow Soil Chemistry 03_18 dataset).

An example of a well written lineage from the British Geological Survey

Be clear on any limitations on public access

Why it’s important

Users need to understand whether they can access the data and if not, why not.

What it means

  1. Understand the sensitivity of your data and whether there are any reasons you should or should not make this available.
  2. Clearly state all limitations on public access. For example: “Unpublished - Not publicly available to preserve personal data rights under GDPR (General Data Protection Regulation). Restricted access may be possible if a legal justification to do so exists.” Note that limitations may be because of the licences under which the data was made available to you. The licences under which you make the data available to others should be captured in the Use constraints element.
  3. Where no limitations exist, this should be stated as a signal that the data resource publisher had considered this element.

Be clear on any use constraints

Why it’s important

Let users know if and how they can use the data as early as possible so that they don’t waste time trying to acquire data which they cannot use.

What it means

  1. Make sure you provide all constraints associated with the data resource. Provide URLs to relevant resources, for example a licence document, which can offer users with detailed information around conditions of usage.
  2. If the data is available under a variety of licence options, add a Use constraints element for each option, clearly indicating the circumstances under which each apply.

For example, there may be one licence for public sector use, another licence for private sector commercial use, and another licence for non-commercial use.

Provide accurate date information in the correct date fields

Why it’s important

It tells your users about when the data was produced and the period the data covers. This means they can assess whether the data covers the time period they are interested in.

What it means

Make sure you provide the correct dates for the following fields:

  • Dataset reference date – this date element is used for date of publication as well as the date of revision, for example, first published January 2010, last revised January 2020;
  • Temporal extent – the date or date range that identifies the currency of the data. It may refer to the period of collection, or the date at which it is deemed to be current. For example, data published in January 2020 may have been collected in June 2019; and
  • Metadata date – date on which the metadata was created, last updated, or was confirmed as being up to date. If you are using GEMINI-aware software, that may set the metadata date for you.

Why it’s important

If the data is part of a series, you should let the user know so they can quickly find other datasets in the series. This will reduce their efforts in finding other data resources that may be useful for their requirements.

What it means

Use Parent ID to link the data to other datasets in the series.

Tip: In the abstract of a dataset, you need to mention that it is part of a series and provide further relevant information about the series.

Define information to describe the spatial location of your dataset - Bounding box and Extent

Why it’s important

Geospatial data is about places on the Earth, so it is important for users to quickly and easily understand where a particular data resource refers to. This information must be provided in the metadata rather than having to access the resource itself for this information. This can be done using place names or areas defined by coded geographic identifiers or coordinates.

This information needs to be properly defined because place names, geographic identifiers and coordinates are not unique, and are often not suitable for machine processing. Furthermore, they may not be widely recognised by human readers unless they are backed up by the naming or referencing system that defines them.

What it means

1. Provide Extent information to describe the coverage of the data using well-defined place names or location codes, for example electoral regions, local government areas or national parks. It is preferable to include a unique reference to the item in an online registry of place names.

Examples of online registries are:

2. Express the location of the dataset’s Bounding box using minimum and maximum latitude and longitude coordinate values. Only approximate values are required for the bounding box, though values should be provided to at least two decimal places, which is roughly the nearest kilometre. For datasets that cover smaller areas, you may want to provide a more precise bounding box.

3. Bounding box has to be given in WGS84 coordinate reference system (identified by the EPSG URI). This means all bounding box extents from a range of datasets can be aggregated, portrayed on a map and used in simple spatial search queries without applications needing to perform a variety of complex coordinate conversion calculations. Various desktop GIS and online tools are available to perform coordinate conversions, for example, https://epsg.io/transform.

4. The West bounding coordinate usually has a value less than the value of the East bounding coordinate. The exception is when the extent straddles the 180 degree meridian (near the International Date Line). Make sure that any data entry validation allows for these cases.

Table 2: Examples of bounding box – all approximate

  North South East West
Great Britain 61.12 49.02 2.19 -8.75
Great Britain (including 12-mile limit seawards) 61.42 48.43 2.41 -9.42
England 55.85 48.43 2.41 -6.48
Edinburgh 56.0016 55.8199 -3.0776 -3.4524

The bounding box will often not be very representative of the true location of the data, for example for a linear or complex extent. In these cases, it can be useful to associate an online resource of a browse graphic image showing a picture of the actual extent. This is not currently part of the GEMINI standard, but the underlying encoding supports it and GEMINI allows any such additional elements. The locations entered as keywords in the Extent element should be as specific as possible about which named places within the large bounding box represent the extent of the data resource.

Provide one or more spatial reference systems

Why it’s important

The spatial reference system identifies the way in which the data is spatially referenced in the dataset and could be in the form of coordinates (such as British National Grid), or as geographic identifiers (such as postcodes). The declaration of the coordinate reference system is as essential for coordinate data as the country and area code is for a landline telephone number.

For coordinate data, the spatial reference system is a fully defined coordinate reference system which provides the mathematics to relate coordinates to real locations on Earth.

Spatial datasets that use the same spatial referencing system can be easily aggregated and compared. Datasets in different spatial referencing systems first need to be processed and converted to a common referencing system in order to be aggregated and compared. That conversion can only be performed, and data can only be successfully used, if the spatial referencing system is properly described. Even so, be aware that coordinate conversion algorithms introduce another source of uncertainty to the location data.

There are hundreds of different coordinate reference systems, either projected coordinates (eastings and northings) or geographic coordinates (latitude and longitude), which are either local/regional or global in scope. The common global coordinate reference system that gives a best fit for the whole earth is currently WGS84, as used by GPS systems. The Government Digital Service recommends using ETRS89 for points in geographical Europe, and WGS84 for the rest of the world. This is because ETRS89 coordinates keep track of the tectonic movement of the European plate (see GOV.UK’s Exchange of location point). The most widely used online registry of coordinate reference systems is the EPSG registry.

Something to be aware of: You may see a coordinate pair and make an assumption on the coordinate reference system based on locale or domain, but without it being explicitly declared, you would be at risk of getting the real world position very wrong. If you assume WGS84 rather than one of its predecessor systems onshore or offshore UK, you could have positional errors of 100m.

For example, the meridian line at Greenwich Observatory is 102m to the west of the 0 degrees longitude line displayed on a hand-held GPS. If you interpret axes order the wrong way round your data could plot in the wrong half of the world. These positional errors can be extremely expensive or dangerous, for example civil engineering projects striking subsurface assets, or for offshore wells drilled in the wrong place. When we consider use cases like driverless cars and augmented reality then a positional error of only centimetres could be unacceptable.

What it means

  1. At least one spatial reference system must be provided to indicate how the location data is represented in the data resource.

  2. If the dataset uses geographic identifiers (such as Postcodes, Nomenclature of Territorial Units for Statistics (NUTS), Geohashing), supply a resolvable HTTP-URI that provides more information about the geographic identifier system.
  3. If the dataset uses a coordinate reference system, identify it using an entry in a well-known register such as https://epsg.org.

For example, for ETRS89 use EPSG 4258; for British National Grid, use EPSG 27700. A lot of UK-oriented, GEMINI-aware software will provide these values as a drop-down choice.

Tip: If your data resource is provided in more than one spatial reference system, use the Lineage element to inform the users what spatial reference system the data was originally collected in, and which ones were added by conversion from those original coordinates.

Useful resources

GEMINI editors:

  1. The Geo 6 Partners are the Geospatial Commission’s six partner bodies: The British Geological Survey, The Coal Authority, HM Land Registry, Ordnance Survey, UK Hydrographic Office, The Valuation Office Agency 

  2. The Search engine optimisation (SEO) for data publishers: Best practice guide was produced in the Data Discoverability 2 project and is the source for some of the SEO-related guidance in this document. 

  3. Through INSPIRE (Infrastructure for Spatial Information in the European Community), the EU has created common standards across Europe for 34 spatial data themes. On leaving the EU, the UK has fully incorporated the INSPIRE Directive into UK law. The aim of the UK INSPIRE Regulations 2009 is to improve environmental policy-making at all levels of government by establishing a UK Spatial Data Infrastructure making spatial data available for use and re-use using common standards for data and data services. 

  4. “Harvesting means using software built by data.gov.uk, known as a ‘harvester’, to collect metadata from your website and publish them on data.gov.uk. Using the harvester lets you publish more than one record at once. You can also schedule harvests to keep data.gov.uk up to date.” Obtained from Harvest or add data page on data.gov.uk. 

  5. The number is the element ID in GEMINI. 

  6. Graph derived from The Beginner’s Guide to SEO, Chapter 3 – Keyword Research, published by Moz. 

  7. This is search-engine terminology for the terms that target relevant and refined keywords that attract the greatest number of people with the correct intent whilst minimising the effort required to attain a top position on a search engine result page.