DBT: Expenses Fraud Detector
A model to detect policy violations, anomalies and possible fraud within expense and Government Procurement Card (GPC) claims.
Tier 1 Information
1 - Name
DBT Expenses Fraud Detector
2 - Description
The Expenses Fraud Detector tool is being used because the financial governance team receive tens of thousands expense claims lines of data each month and it is impossible to manually sift through all of these and find all the fraudulent ones. This tool aims to make this process easier for them. This project takes the DBT expenses data and applies rules to the data using computer code. These rules are the encoded version of the expenses policy and they’re applied in a way a human would. We will use the outcome of this code to identify those employees who have committed policy violations or fraud to help the finance team investigate further.
3 - Website URL
N/A
4 - Contact email
ai.governance@businessandtrade.gov.uk
Tier 2 - Owner and Responsibility
1.1 - Organisation or department
Department for Business and Trade
1.2 - Team
Financial Governance Team
1.3 - Senior responsible owner
Deputy Director Risk, Assurance, Partnerships and Financial Governance
1.4 - External supplier involvement
No
Tier 2 - Description and Rationale
2.1 - Detailed description
The base of the tool is written in Python. It consists of the codified policy rules, together with a range of other indicators (such as expenditure on leisure activities) which may indicate that fraud is at play. Of course, each suspicious expenditure will be reviewed by a member of the financial governance team, in conjunction with other available information, in order to make a fully informed decision. The code and its results easy to follow and understand. The algorithm is entirely rules-based and doesn’t use any machine learning or statistical models. Each rule produces a ‘1’ or a ‘0’ to indicate whether the conditions have been met. We will also be looking at weighting each rule differently, for example if the financial governance team thinks we should consider a specific expenditure more severe than over-expenditure on another expenditure. The resultant DataFrame is surfaced using Streamlit (a Python library used for creating dashboards) and is hosted internally on the DBT Data Workspace. It contains the employee name with the score against their expenditure and details of which rules they have contravened.
2.2 - Scope
This tool aims to make the process of finding fraudulent claims easier for the Financial Governance Team. It is only going to be used by the Financial Governance Team. The scores and outputs of the tool will always be reviewed by a member of this team, before a decision is made on undertaking any further action.
2.3 - Benefit
The tool is being used because the financial governance team receive tens of thousands lines of expense claims each month and it is not possible for them to sift through all of these claims and find all the fraudulent ones. This tool is able to reduce the time spent spot-checking expenses and spending more time on targeted expense claims. The tool also helps save the department money by speeding up the identification of fraud and not paying out expenses that are fraudulent or in violation of the expenses policies.
2.4 - Previous process
We had a dashboard which looked at general trends in expenses data, with some policy rules applied.
2.5 - Alternatives considered
The Governance team considered using an unsupervised model (isolation forest model) to find patterns in the data and help us identify possible fraud, and which factors contribute most to fraudulent activity. DBT used this approach with our data but found that the explainability of the model was poor, making the conclusion less transparent. The random forest model’s accuracy scoring was not consistent enough, given the seriousness of the conclusions, the team felt it better to stick with a more straightforward encoding of the rules rather than using a statistical approach - at least at this stage. The team chose the current approach for its simplicity, and that there is very little labelled data. The current approach met the needs of the financial governance team without being overly complex. The simplicity of this tool also reduces the risks in creating erroneous conclusions, as the outcomes are easy to explain.
Tier 2 - Decision making Process
3.1 - Process integration
The tool speeds up the process of checking monthly expenses. It will be used alongside the current dashboard (as mentioned in 2.2.4). United Kingdom Shared Business Services (UKSBS)(A public sector shared service centre owned by the Department of Science, Innovation and Technology, the Department for Energy Security and Net Zero, the Department of Business and Trade and UK Research and Innovation that collects expenses data) also flags expenses that are out of policy. Together, all these tools enable the financial governance team to make an informed decision on whether a expense claim was fraudulent and will make the final human made decision to take further action.
3.2 - Provided information
The current output is in the form of a table. It provides details of the expenditure and a flag for each of the violation columns to indicate if that rule has been violated or not.
3.3 - Frequency and scale of usage
It has only just been released but we expect it to be used multiple times a month, especially following month-end expenses data being loaded in to the Data Workspace.
3.4 - Human decisions and review
If the tool identifies an expense as suspicious, then the financial governance team will use this in conjunction with the raw data and other information, to make a definitive decision on whether an expense was fraudulent/out of policy and if further action is required from the governance team. No decisions are automated.
3.5 - Required training
The tool has been designed and documented in a way that minimal training is needed. There are extensive guidance/explanatory notes on the dashboard to explain what the rules are doing.
3.6 - Appeals and review
There is no solely automated processing involved, all decisions are taken after thorough human investigation. Mechanisms for review are in line with DBT expenses policy. The users of the tool are able to provide feedback on the tool performance to the tool developers to improve the tool.
Tier 2 - Tool Specification
4.1.1 - System architecture
The dashboard is hosted on the Data Workspace which is a secure DBT platform which hosts datasets and data analysis tools such as PGAdmin, Python (JupyterLab and a custom IDE), RStudio. Data workspace uses the AWS platform for hosting. The code queries the expenses dataset (which is hosted on the Data Workspace) using SQL and then applies the rules using a Python script. The code is stored on an internal GitLab instance. The visualisation uses Streamlit (a Python package): the code for this is also held on the secure platform and is run inside a Docker container.
4.1.2 - Phase
Production
4.1.3 - Maintenance
The tool has only just recently been deployed. We will iterate on demand from the stakeholder. There are some graphs and rules still yet to implement. The users of the tool are able to provide feedback on the tool performance to the tool developers to improve the tool.
4.1.4 - Models
Rules Based model.
Tier 2 - Model Specification
4.2.1 - Model name
Expenses Violation Detector
4.2.2 - Model version
1
4.2.3 - Model task
Analyse the raw expenses data and apply rules to help decide whether an expense was fraudulent
4.2.4 - Model input
Expenses and Government Procurement Card (GPC) data
4.2.5 - Model output
1 or 0 depending on whether the rules have been met or not.
4.2.6 - Model architecture
Rules based algorithm in Python using features and data from the expense and Government Procurement Card (GPC) data.
4.2.7 - Model performance
As this is not a statistical model, the metrics of precision and recall don’t necessarily apply. However, we know that there are some limitations - mainly due to the lack of data (e.g. we don’t have temperature data for the travel destination so we can’t test if weather specific clothing like snow boots were bought appropriately). This would be something that a member of the financial governance team would have to manually investigate further and decide whether a certain expenditure was appropriate. It’s important to remember that that would be a problem without the model.
4.2.8 - Datasets
DBT Government Procurement Card (GPC) and expenses dataset
4.2.9 - Dataset purposes
The dataset was used to test and validate the outputs of the algorithm
Tier 2 - Data Specification
4.3.1 - Source data name
Government Procurement Card (GPC) MI Data
4.3.2 - Data modality
Tabular
4.3.3 - Data description
Contains DBT Government Procurement Card (GPC) and expense transactions data
4.3.4 - Data quantities
40,000 rows for development and testing
4.3.5 - Sensitive attributes
employee name, approver name, employee number, approver employee number
4.3.6 - Data completeness and representativeness
All government procurement card and expense transactions from DBT employees, submitted in the past 5 years. DBT was formerly the Department for International Trade (DIT), so it contains expense data from previous expense claims submitted by DIT employees in the past 5 years. It does not contain the historic expense data from employees who previously worked for BEIS prior to 2023, when it was merged with DIT.
4.3.7 - Source data URL
Not openly accessible
4.3.8 - Data collection
Collected by UKSBS via their expenses portal, when employees submit their expenses. UKSBS are a business administration company and they provide a platform to DBT to collect and administer expense claims. The data is used for the same purpose that it was collected for, to process expense claims.
4.3.9 - Data cleaning
Date strings are converted to a datetime format.
4.3.10 - Data sharing agreements
We have a data sharing agreement with the Government Internal Audit Agency. In the case that fraud is found to have occurred, we would pass details of the fraud to the Government Internal Audit Agency.
4.3.11 - Data access and storage
Data kept for 5 years. Only those in the financial governance team or those in DDaT who are working on the data have access. It is stored on the Data Workspace, a secure data warehouse hosted on the department’s AWS platform instance. Employee name and approver name is not used to train the model - it is anonymised during the training and testing stages. It is only surfaced in the deployment stage
Tier 2 - Risks, Mitigations and Impact Assessments
5.1 - Impact assessment
This project has been approved by the Information Risk Assurance Process (IRAP) on 02/07/2024. IRAP is DBT’s process for assessing and assuring risk when working with information. It helps us prevent data leaks and cyber-attacks and advances the department’s compliance with its obligations under data protection law. The process makes sure our projects are compliant with UK General Data Protection Regulation (UK GDPR), cyber security and physical security. It keeps the data that we rely on safe and secure from any threats or risks, protecting DBT and the UK government.
A full Data Protection Impact Assessment was approved for the project on 20/06/2024.
IRAP Approved as low green risk with the following conditions:
- Access controls are followed for data workspace
- The data is removed from Shared SharePoint folder (cloud-based storage using Microsoft) once uploaded to data workspace
- Any further work would require a new IRAP assessment.
5.2 - Risks and mitigations
DPIA risk and mitigation: Risk 1: In the event of a security breach and the algorithm results show that an employee has – or is suspected to have – committed persistent fraud amounting to thousands of pounds, then it is possible this could result in a risk of psychological harm to the individual.
Mitigation 1: Project confirmed that they would only process the employee number on the analysis. This would reduce the personal data as its difficult to directly identify DBT staff as it the number is not publicly accessible . The project does not use the employee name when processing the rules for the analysis, The employee name is only used at the final stage so that it appears on the dashboard.
Risk 2: Data Subjects may be surprised that their data is being processed in this way. This activity is separate activity after the claim has been made. Data subject may be contacted if fraud is suspected of occurring and requested to pay back expenses.
Mitigation 2: When data subjects fill in expenses they agree to the civil service TRAVEL AND EXPENSES POLICY. The exercise is to find the top employees who are breaking Departmental expense policy and/or acting in a suspicious manner. This which could possibly indicate they are committing fraud. These employees will not incur any ‘automatic’ penalties or repercussions. The outputs will be carefully reviewed by a member of the financial governance team to be certain that fraud has been committed. In a case of fraud specifically, the financial governance team would not take any action immediately, except for notifying the DBT Counter-Fraud team. We would always notify the individual in advance if reimbursement was to be required. Decisions of this nature would fall to the Compliance and Transparency Lead or an individual senior to them in him line management chain. This may result in the inability of the individual to access the expenses services.
Risk 3: Risk of false-positives, especially those who submit proportionally more expenses than average (e.g. those who travel a lot).
Mitigation 3: A human will review all outputs of the model before taking any action.