GOV.UK Data Labs (Cabinet Office): Related Links
Published 1 June 2022
name | tier and category | description | entry (please enter all the required information in this column) |
Name | Tier 1 - Overview | Colloquial name used to identify the algorithmic tool. | GOV.UK Related Links |
---|---|---|---|
Description | Tier 1 - Overview | Give a basic overview of the purpose of the algorithmic tool. Explain how you’re using the algorithmic tool, including: * how your tool works * how your tool is incorporated into your decision making process Explain why you’re using the algorithmic tool, including: * what problem you’re aiming to solve using the tool, and how it’s solving the problem * your justification or rationale for using the tool * how people can find out more about the tool or ask a question - including offline options and a contact email address of the responsible organisation, team or contact person |
Related Links is a recommendation engine built to aid navigation of GOV.UK by providing relevant onward journeys from a content page. The tool uses an algorithm called node2vec to train a model on the last three weeks of user movement data (web analytics data). Three weeks of data was used as our experience with user movement data on the site indicated it was sufficient to capture medium and long term trends while ignoring short-lived spikes in user behaviour. The model is used to predict related links for every page. These new related links are published to GOV.UK. The tool is used to help users find useful information and content, aiding navigation. GOV.UK has approximately 600,000 pieces of content. Previously, related links were created manually, with only approximately 2,000 pieces of content having related links. The tool expanded that to nearly the entirety of the GOV.UK content. For further questions: data-products@digital.cabinet-office.gov.uk |
URL of the website | Tier 1 - Overview | If available, provide the URL reference to a page with further information about the algorithmic tool and its use. This facilitates users searching more in-depth information about the practical use or technical details.This could, for instance, be a local government page, a link to a GitHub repository or a departmental landing page with additional information. |
https://github.com/alphagov/govuk-related-links-recommender https://apolitical.co/solution-articles/en/machine-learning-government-algorithm https://snap.stanford.edu/node2vec/ https://dataingovernment.blog.gov.uk/2019/06/19/a-or-b-how-we-test-algorithms-on-gov-uk/ |
Contact email | Tier 1 - Overview | Provide the email address of the organisation, team or contact person for this entry. | The Data Products team, Data Services, Product and Technology, the Government Digital Service: products@digital.cabinet-office.gov.uk |
1.1 Organisation/ department | Tier 2 - Owner and responsibility | Provide the full name of the organisation, department or public sector body that carries responsibility for use of the algorithmic tool. For example, ‘Department for Transport’. | GOV.UK - Government Digital Service |
1.2 Team | Tier 2 - Owner and responsibility | Provide the full name of the team that carries responsibility for use of the algorithmic tool. | The Data Products team, Data Services, Product and Technology, the Government Digital Service |
1.3 Senior responsible owner | Tier 2 - Owner and responsibility | Provide the role title of the senior responsible owner for the algorithmic tool. | n/a |
1.4 Supplier or developer of the algorithmic tool | Tier 2 - Owner and responsibility | Provide the name of any external organisation or person that has been contracted to develop the whole or parts of or the algorithmic tool. | The tool was built in-house using open source tooling. No external organisation was involved. |
1.5 External supplier identifier | Tier 2 - Owner and responsibility | If available, provide the Companies House number of the external organisation that has been contracted to develop the whole or parts of or the algorithmic tool. You can get a company’s Companies House number by finding company information or using the Companies House API | n/a |
1.6 External supplier role | Tier 2 - Owner and responsibility | Give a short description of the role the external supplier assumed with regards to the development of the algorithmic tool. | n/a |
1.7 Terms of access to data for external supplier | Tier 2 - Owner and responsibility | Details the terms of access to (government) data applied to the external supplier. | n/a |
2.1 Scope | Tier 2 - Description | Describe the purpose of the tool in terms of what it’s been designed for and what it’s not been designed for. This can include a list of potential purposes that the tool was not designed to fulfil but which could constitute possible common misconceptions in the future | The tool was designed to populate most pages on GOV.UK with up to five related links. Every user sees the same links; the related links are not personalised. |
2.2 Benefit | Tier 2 - Description | Describe the key benefits that the algorithmic tool is expected to deliver, and an expanded justification on why the tool is being used. | The benefit of the tool is that it predicts related links for a page. These related links are helpful to users. They help users find the content they are looking for. They also help a user find tangentially related content to the page they are on; it’s a bit like when you are looking for a book in the library, you might find books that are relevant to you on adjacent shelves. |
2.3 Alternatives considered | Tier 2 - Description | Provide, where applicable, a list of non-algorithmic alternatives considered, or a description of how the decision process was conducted previously. | Previous manual effort of deciding related links led to only 2,000 pages on GOV.UK having related links. Those pages that had related links remained static unless manual effort was made to update them. 98% of pages on GOV.UK did not have related links. |
2.4 Type of model | Tier 2 - Description | Indicate which types of methods or models the algorithm is using. For example, expert system, deep neural network and so on. | We used node2vec, which is a machine learning algorithm that learns network node embeddings. The way users move around GOV.UK is represented as a graph and is used as input by the algorithm. The nodes represent pages and the edges represent user movement (where an edge exists if at least 5 “users” moved between those nodes in the last three weeks). The hyperlinks between pages were also included as edges. We train a model using three weeks of user movement data. This model can then be used to predict related links for a page (cosine similarity is used to identify similar pages). Refer to the blogs and the Github repository above for extra detail. |
2.5 Frequency of usage | Tier 2 - Description | Provide information on how regularly the algorithmic tool is being used. For example the number of decisions made per month, the number of citizens interacting with the tool, and so on. | The tool updates links every three weeks and thus tracks changes in user behaviour. The average click through rate for related links is about 5% of visits to a content page. For context, GOV.UK supports an average of 6 million visits per day (Jan 2022). True volumes are likely higher owing to analytics consent tracking. We only track users who consent to analytics cookies (see https://www.gov.uk/help/cookies for detail). |
2.6 Phase | Tier 2 - Description | Describe the phase in which of the following stages or phases the tool is currently situated: - idea - design - development - production - retired This field includes date and time stamps of creation and any updates. | The tool is in production. Date first live: May 2019 |
2.7 Maintenance | Tier 2 - Description | Give details on the maintenance schedule and frequency of any reviews. For example, specific details on when and how a person reviews or checks the automated decision. | We developed a way for publishers to add/amend or remove a link from the component. On average this happens two or three times a month. Every three weeks, the machine learning algorithm is trained using the last three weeks of analytics data and trains a model that outputs related links that are published, overwriting the existing links with new ones. We only track users who consent to analytics cookies (see https://www.gov.uk/help/cookies for details) Publishers can add/amend/remove a link from the component. This manual intervention can be temporary or permanent if suggested to deny list. |
2.8 System architecture | Tier 2 - Description | If available, provide the URL reference to documentation about the system architecture. For example, a link to a GitHub repository image or additional documentation about the system architecture. | https://github.com/alphagov/govuk-related-links-recommender |
3.1 Process integration | Tier 2 - Oversight | Explain how the algorithmic tool is integrated into the decision-making process and what influence the algorithmic tool has on the decision-making process. Give a more detailed and extensive description of the wider decision-making process into which the algorithmic tool is embedded. | The decision process is fully automated. |
3.2 Provided information | Tier 2 - Oversight | Describe how much and what information the algorithmic tool provides to the decision maker. | n/a |
3.3 Human decisions | Tier 2 - Oversight | Describe the decisions that people take in the overall process, including human review options. | Humans have the capability to recommend changes to related links on a page. There is a process for links to be amended manually and these changes can persist. These human expert generated links are preferred to those generated by the model and will persist. |
3.4 Required training | Tier 2 - Oversight | Describe the required training those deploying or using the algorithmic tool must undertake, if applicable; For example, the person responsible for the management of the tool had to complete data science training. | The tool is deployed automatically every three weeks. Humans aren’t in the loop regarding deployment, it’s automatic. The tool is owned and monitored by the Data Products team in the Data Services, Product and Technology, the Government Digital Service. This team is multidisciplinary but has data scientists in it. |
3.5 Appeals and review | Tier 2 - Oversight | Provide details on the mechanisms that are in place for review or appeal of the decision available to the general public. | GOV.UK has a feedback link, “report a problem with this page”, on every page which allows users to flag incorrect links or links they disagree with. Publishers are also able to submit pages for a deny list or temporary removal. |
4.1 Source data name | Tier 2 - Information on data | If applicable, provide the name of the datasets used. | Web analytics data exported from Google Analytics to BigQuery (a data warehouse) for querying. This data was aggregated and provided the “user movement data”. The hyperlinks between pages were also used as input for the training of the algorithm. We only track users who consent to analytics cookies (see https://www.gov.uk/help/cookies for details). |
4.2 Source data | Tier 2 - Information on data | Gives an overview of the data used to train and run the algorithmic tool. It will also specify whether data is used for training, testing, or operating. It should include which categories of data - for example ‘age’ or ‘address’ - which were used to train the model and which are used as input data for making a prediction. | The node2vec algorithm takes user movement data as input. This is derived from our web analytics data of how users move around the site. This data is represented as a graph, where the nodes are pages and the edges are user movement between those pages. Edges are ignored if they have fewer than five movements in the last three weeks. The hyperlinks between pages are also included in this graph as edges. We only track users who consent to analytics cookies (see https://www.gov.uk/help/cookies for details). The data is used to train the model (there are plenty of blogs online that explain how node2vec works). The model is used to make predictions for each page about what pages are most similar to the page of interest. Using the cosine similarity distance we produce a sorted list of pages that should be recommended for each page. We use the top five most similar pages as the related links. Some pages cannot be output as related links, these include pages that might be deemed sensitive or inappropriate such as Air Accidents Investigation Branch reports. A curated deny list is in place, which was compiled using content designer expertise and feedback. The deny list is revisited and can be updated if inappropriate recommendations are observed. |
4.3 Source data URL | Tier 2 - Information on data | If available, provide a URL to the dataset. | n/a |
4.4 Data collection | Tier 2 - Information on data | Gives information on the data collection process, including the original purpose of data collection. | The purpose of conducting web analytics on GOV.UK is to enable GOV.UK / GDS to obtain a comprehensive view of how people interact with GOV.UK and to identify improvements that can make those interactions simpler and easier for users. This is achieved by collecting performance analytics data from GOV.UK visitors that have provided their consent into a GOV.UK-managed Google Analytics account. We only track users who consent to analytics cookies (see https://www.gov.uk/help/cookies for details). In order to achieve this, GOV.UK deploys some Google Analytics code which places cookies on a user’s device and sends the performance analytics data to the GA account. Each cookie placed in a browser is assigned a unique ClientID allowing its activities to be tracked across multiple GOV.UK visits. This data is then analysed to identify potential improvements that can be made to GOV.UK. The GOV.UK web analytics data is used for secondary purposes like related links. We automatically export this data to BigQuery on Google Cloud Platform, a data warehouse. We query this data warehouse and use aggregated data to train a model. See 5.2 for more detail. |
4.5 Data sharing agreements | Tier 2 - Information on data | Provides further information on data sharing agreements in place. | n/a |
4.6 Data access and storage | Tier 2 - Information on data | Provide details on who has or will have access to this data, how long it’s stored, under what circumstances and by whom. | Following from 4.4 The related links for each page are output as a file which is stored securely in an Amazon Web Services S3 bucket. The data is then published to the GOV.UK website using the GOV.UK publishing API. |
5.1 Impact assessment name | Tier 2 - Risk mitigation and impact assessment | Provide the name and a short overview of the impact assessment conducted. | A Data Protection Impact Assessment exists for GOV.UK Web Analytics generally. Related links makes use of this data and aggregates it, thus it was considered to sit under this DPIA. |
5.2 Impact assessment description | Tier 2 - Risk mitigation and impact assessment | Give a description of the impact assessments conducted. | The impact assessment mentioned in 5.1 concluded that the purpose of conducting web analytics on GOV.UK is to enable GOV.UK / GDS to obtain a comprehensive view of how people interact with GOV.UK and to identify improvements to make those interactions simpler and easier for users. We only track users who consent to analytics cookies (see https://www.gov.uk/help/cookies for details). |
5.3 Impact assessment date | Tier 2 - Risk mitigation and impact assessment | Provide the date in which the impact assessment was conducted. | n/a |
5.4 Impact assessment link | Tier 2 - Risk mitigation and impact assessment | If available, provide a link to the impact assessment. | Internal document DPIA21-4581900. |
5.5 Risk name | Tier 2 - Risk mitigation and impact assessment | Provide an overview of the common risks for the algorithmic tool. | A recommendation engine can produce links that could be deemed wrong, useless or insensitive by users (e.g. links that point users towards pages that discuss air accidents). |
5.6 Risk description | Tier 2 - Risk mitigation and impact assessment | Give a description of the risks identified. | A recommendation engine can produce links that could be deemed wrong, useless or insensitive by users (e.g. links that point users towards pages that discuss air accidents). |
5.7 Risk mitigation | Tier 2 - Risk mitigation and impact assessment | Provide an overview of how the risks have been mitigated. | We added pages to a deny list that might not be useful for a user (such as the homepage) or might be deemed insensitive (e.g. air accident reports). We also enabled publishers or anyone with access to the tagging system to add/amend or remove links. GOV.UK users can also report problems through the feedback mechanisms on GOV.UK. |