Monitoring the status of your service
By the time you reach public beta, you must have monitoring in place for your service to identify any problems that might affect it.
Monitoring with the right tools and processes allows you to:
- discover any problems that users have
- get alerts when technical problems occur so you can fix them
- anticipate problems before they happen or become more serious
- discover vulnerabilities or the warning signs of an impending cyber security attack
- improve your service, for example by using performance data to help with capacity planning
Plan your monitoring
You should start planning how to monitor your service during alpha.
During alpha, your team should agree:
- what to monitor in your service
- how to monitor your service
- how to process and record issues
Metrics to monitor
You must track user-related metrics, as well as technical and security metrics. For example, track the percentage of users that can complete a task as well as available disk space, application programming interface (API) performance and memory usage.
How to monitor
Once you’ve agreed what to monitor, your team should:
- set up internal and external monitoring checks
- write monitoring checks
- write alerts
Setting up internal and external monitoring checks
You should set up internal and external monitoring checks.
Internal monitoring is the monitoring you should set up inside your infrastructure and will give you real time updates about metrics like memory usage, page load times, and network traffic.
External monitoring is the monitoring you should set up outside of your service which keeps checking your systems even if your infrastructure goes down.
Writing monitoring checks
You need to decide the type of monitoring checks that are most useful to your service.
A monitoring check is a series of tests that you can run against your systems or overall service to assess their status and tell you if something is wrong.
For example, you might decide you need to see an alert if 1% of users in an hour have problems finishing a transaction. You could also capture access control issues, where excessive levels of login attempts may indicate someone trying to brute force a password.
You should write monitoring checks at the same time as writing code and treat your checks as tests for your live system.
Learn about managing observability to track the security health of your service.
Writing alerts
Make your alert messages clear and concise. They need to be easy to understand for team members who might be woken up in the night to fix a problem.
Consider creating an operations manual or documentation to help your team deal with problems quickly. Make sure every member of your team has a local copy of the documentation in case your cloud-based documentation storage is unavailable.
Processing and recording issues
You should manage and track errors using a ticketing system that allows you to delegate them to members of your team.
Errors always contain interesting information - they can tell you about:
- a user problem
- attacks on your service
- failing systems
- problems with capacity
Tracking errors helps you to see which ones are recurring and whether they’re part of the overall service or related to a particular application or machine.
You can combine monitoring test results to better understand what to fix in your service. For example, comparing page-loading tests with failed transactions and application errors allows you to:
- find out the parts of your service where more users are having problems
- identify the cause of problems
- discuss how to fix the cause of problems, for example, disk space or slow performance
When fixing problems, always make sure to evaluate the security impact of changes being made.
Make data widely available
Unless it’s not safe to do so, you should make monitoring information and data widely available.
For example, you can share performance reports with other service teams in your department or use a status dashboard, like the operations status page used by GOV.UK Notify, to tell users about any issues.
Reviewing your monitoring processes regularly
You should review your monitoring processes to make sure they align with your support obligations and capabilities.
If someone is called out of hours, you should make sure the issue needs that level of response.
For example, if the issue doesn’t affect users and could wait until the morning, consider changing your alert strategy so that type of error doesn’t prompt an alert in future. It may be possible to implement automated responses to issues that don’t require an immediate human intervention.
Related guides
You may also find the Uptime and availability guide useful.
- Last update:
-
Integrated elements on Managing observability and Evaluating the security impact of changes.
-
Guide revised to include more detailed advice on how to set up monitoring, write checks and alerts, and how to process and record issues.
-
Guidance first published