It’s April 2020. Italy has shut its borders and anecdotal reports are driving global concern that black and south asian populations carry higher mortality risk from COVID19. Since death registrations don’t contain ethnicity and existing methods of adding ethnicity had a several-week lag, confirming or refuting this seemed as crucial as it was out of reach. The one dataset able to provide this information was the census.
For the past 200 years, the census has been the primary way of providing an accurate picture of society at national and local levels, but pressures on people’s time, women’s shift to the workplace and the increase in spam calls have meant that the Office for National Statistics (ONS) has needed to change the way national data is collected and processed.
However, whilst this ad hoc case study illustrated the importance of data linkage, Cleaton underlined the sustained need to scale data linkage in a way that is less reactive, bespoke and resource-intensive. Amidst the international trend of fewer people responding to surveys, analysts need to be able to accurately combine pre-existing data sets, including administrative data.
Administrative data, is the data collected to enable provision of a service and according to Cleaton, it is a major, untapped resource that can be used to supplement widening holes in census and survey data. A joined-up approach to administrative data across the public sector is a core pillar of both the UK Statistics Authority’s five-year Statistics for the Public Good strategy and of the Department for Science, Innovation and Technology and the Department for Digital, Culture, Media & Sport’s National Data Strategy.
Rising to the challenge, Cleaton and her colleagues of ONS data engineers are finding innovative ways of making admin data, which can be anything from tax records, benefits and education data to information from utility suppliers, to be transformed into a format that allows it to be used for statistical insight.
“Beyond the challenge of multi-party coordination, connecting disparate administrative datasets with different unique identifiers, such as NHS numbers or National Insurance Numbers, can be complex and time-consuming. Data linkage bridges this gap by utilising demographic information like names, addresses, date of birth, sex, and ethnicity to join records accurately. This process is crucial for informing evidence-based decision-making,” Cleaton noted.
Combining advances in data science and computing capability, automated data linkage underlies the Reference Data Management Framework (RDMF), providing high-quality reference data and indexing services about businesses, addresses, people, classifications, and geographies, transferring the best practices of software engineering to ONS’ data teams.
As part of the drive towards automating and scaling data linkage, Cleaton highlights how the Reproducible Analytical Pipeline (RAP) paradigm is enhancing the traditional bespoke method of analysis, including data linkage. By incorporating elements of software engineering, data linkage pipelines can be made reproducible, auditable, efficient, and high-quality.
She underscored, “The bespoke approach of data linkage often takes months, but through utilisation of RAP skills and sharing of best practice in methodology and coding through informal communities of practice, ONS’ data teams are increasingly accelerating their ability to unlock data benefits to their customers, users and wider society.
The most exciting digital transformation I’ve seen has been the subtle but impactful shift in upskilling of analysts and engineers, as they move away from tedious and repetitive work on bespoke scripts and towards more efficient coding methods that free up time for innovative research and experimentation. As analysts develop their coding skills, we’re starting to see cases where that dedicated coding and design time has been able to automate months of manual data input.”
Spending much of her early career as a academic researcher, Cleaton highlighted her enthusiasm for being able to develop individuals from statistical, social research or economics backgrounds by tapping into their creative potential in and “brings us closer to unlocking the full value of citizens’ data and helping society plan for the future,” she added.
Alongside Cleaton’s mentoring role at the ONS, Cleaton has been spearheading the modernisation of the RDMF’s Inter-Departmental Business Register.
Cleaton’s primary responsibility lies in driving the transformation of the RDMF’s Business Index. Through the application of open-source software, Cleaton and her colleagues have been able to leverage deterministic and probabilistic data linkage, that is, automated linkage methods that relies on the balancing of evidence to determine whether entities in disparate data systems can be considered the same.
“The modernisation of the Inter-Departmental Business Register (IDBR) has needed to adapt to the rapid expansion of small, online businesses, crucial to understanding the job market and industrial growth. Machine learning has been a major driver of the probabilistic data linkage we use. The Ministry of Justice’s award-winning ML model, Splink, has made it possible for us to update records near real-time for researchers and policymakers at a scale of over 100 million records.”
The Pandemic highlighted the fundamental importance of data linkage in revealing new insights that have the potential to inform decision-making up and down the country.
Just as Cleaton and her team are connecting siloed data sets to reveal this insight, the de-siloing of software engineering skills and analytical knowledge are nurturing creativity that shows promise in building an innovative and resilient public sector.