The importance of synthetic data in people analytics

According to David Creelman, 'synthetic data' is something more HR analytics professionals are going to have to start using. So isn't it time they knew more about what it does?

Article main image
May 27, 2024

Synthetic data is an odd concept.

The idea is that if we don’t have enough data, then we just make up a bunch more.

This seems completely implausible and yet it works.

In the world of HR, not only is it a way to get more data, it’s a fantastic way to anonymize data.

In fact, with privacy legislation, people analytics teams may be pushed into using synthetic data purely for anonymization.

So given that you may be required to use synthetic data for some analytics purposes,

it’s a good idea to start familiarizing yourself with the concept:

How to create synthetic data

To create synthetic data you need some specialized tools.

These tools will:

  • Make sense of your existing data: Statistical tools study the existing data and look for patterns
  • Generate new synthetic data: Create a large data set based on the patterns it has learned

Several vendors sell such tools, and if you go down this path then you will want to speak to a fair number of different vendors because the field is still relatively new and so the specializations and capabilities of vendors can vary a lot.

Furthermore, it’s not so much the tool as the expertise of the people using the tool that really matters.

So, you will have to find vendors or experts who have a detailed understanding of how to create synthetic data or you run the risk of getting a data set that is inaccurate or not adequately anonymous.

Predictive analytics

The most natural use of synthetic data is for predictive analytics.

As always, when we talk about predictive analytics, the use-case that springs to mind is flight risk.

If you don’t have enough data to do predictive analysis on flight risk, then synthetic data might do the trick.

The big tech companies have found that synthetic data is highly effective in supporting machine learning.

Anytime you are using machine learning then you might find synthetic data is useful.

Legal compliance

For ethical reasons, we usually want to keep personal data anonymous.

There are also an increasing number of regulations that compel organizations to keep personal data anonymous.

If we are analyzing properly prepared synthetic data then we no longer need to worry about violating the law; we can happily share the data with different analysts in different parts of the world to get their insights.

If we are studying patterns, and not making decisions about individuals, then synthetic data may be the way to go.

We still need real data much of the time

While synthetic data allows us to anonymize data for some applications, a lot of the time we need to use real personal data.

This means we will still need rigorous processes for protecting privacy – especially around who can see what data.

These processes will make it difficult to do some analytics, particularly in global companies. But this is simply something analytics professionals will need to learn to manage.

It can feel like a lot of pressure to grapple with strange concepts like synthetic data.

Nevertheless, people analytics professionals ‘do’ need to take the lead in this area because no one else is likely to understand the people data and analysis issues well enough to guide the use of synthetic data and stay compliant with the law.