blog

The Quiet Cost of Synthetic Data.

The Quiet Cost of Synthetic Data
Image of the post author Jodie Shaw

The role of market research has been quietly changing. Product teams that once delayed feature launches to wait for fieldwork now treat those pauses as delivery risk. Marketing leaders who once budgeted months for qualitative research now ask for “directional insight” that fits sprint cycles. Insight teams that once had the authority to stop a decision are now expected to validate decisions that have already cleared internal review.

Artificially generated datasets are being used to simulate consumer behaviour, fill demographic gaps in surveys, generate personas, and justify product and marketing moves. What started as a technical workaround for privacy and data scarcity is increasingly treated as a general substitute for research.

The appeal is obvious. Synthetic data is faster than fieldwork and cheaper than recruitment. It avoids much of the legal and governance friction now attached to real customer data. It produces large samples on demand, formatted to fit modern delivery cycles. It rarely causes delay. It rarely causes trouble.

But this shift is not being driven by a belief that synthetic data offers deeper understanding of people. It is being driven by a belief that waiting for people is now operationally unacceptable.

A privacy fix that escaped its category

Synthetic data did not originate as a way to understand markets. In 1993, Harvard statistician Donald Rubin proposed it as a solution to a government problem: how to release usable population data without exposing real individuals. Agencies were under pressure to make datasets available while re-identification risks were rising. The solution was procedural rather than interpretive. Generate artificial records that preserve the statistical structure of the original data, but contain no actual people.

For decades, synthetic data stayed in that lane. Hospitals used it to test systems without exposing patient histories. Banks used it to stress-test fraud and compliance controls. Engineers used simulated data to train systems in scenarios that were unsafe or impractical to capture in real life. It was an infrastructure tool, deployed where real data was too sensitive, too scarce, or too risky.

Its limits were well understood. Synthetic datasets could reproduce historical distributions and correlations. They could not surface behavioural shifts that had no precedent in the source data. They were designed to scale what was already known, not to discover what was changing.

Modern synthetic data systems are trained on historical datasets and generate new records that preserve the statistical properties of the original data. The output is designed to match distributions, correlations, and co-occurrence patterns while ensuring that no generated record corresponds to a real individual or event. Functionally, this makes synthetic data a replication engine. It recombines existing behaviour at scale. It does not introduce new information. It does not generate new motivations. It cannot explain why people are starting to behave differently.

Where the objective is to stress-test systems, simulate rare events, or train machine-learning models without exposing sensitive information, this works. Where the objective is discovery, it does not.

When GDPR slowed internal data sharing across Europe, analytics teams found themselves blocked from moving customer data. Legal review became a prerequisite for reuse. Consent audits stretched timelines that once ran in weeks into quarters. Synthetic data allowed work to continue without triggering regulatory risk.

At the same time, organisations compressed decision cycles. Product roadmaps are locked before research begins. Feature backlogs are prioritised ahead of fieldwork. Insight is expected to be incremental, not directional. Research that introduces contradiction is treated as disruption rather than signal.

Synthetic data fits this environment better than real research. It can be generated on demand. It does not require recruitment. It does not challenge assumptions embedded in planning cycles. It produces numbers that align.

That is why it is now used to fill survey gaps, generate personas, simulate buyer behaviour, and validate decisions already underway. Not because it reveals something new, but because it removes the need to wait.

submit-your-brief

Where synthetic data breaks

Marketing and market research are not concerned primarily with how people have behaved in the past. They are concerned with how people are beginning to behave differently. They are tasked with detecting emerging needs, shifting values, declining trust, and new trade-offs before those changes are fully visible in transactional data.

Synthetic data is structurally unsuited to that task because it is trained on historical inputs. Its outputs preserve yesterday’s behavioural logic and present it as continuity. It can reproduce past correlations at scale, but it cannot surface motivations that have not yet appeared in the data. It cannot detect cultural inflexions that have not yet produced measurable patterns. It cannot explain why consumers abandon a category, lose trust in a brand, or change how they interpret value.

Yet this is the role it is increasingly being asked to play. Artificial personas are now generated from transaction histories. Simulated survey responses are used to fill gaps in underpowered studies. Buyer behaviour is modelled from historical data and presented as future likelihood. In each case, leadership treats replication as if it were discovery.

Product and marketing leaders are choosing not to wait for people. They are choosing coherence over contradiction, roadmaps over recruitment, and they are choosing models that confirm assumptions already locked into planning cycles.

Synthetic data does not push back, surface awkward trade-offs, introduce dissenting signals, or reveal uncomfortable new truths.

When replication replaces discovery, weak signals are filtered out as noise. Early discomfort is treated as statistical variation. Outliers are smoothed away. Deviations from historical norms are absorbed back into modelled continuity.

By the time the error becomes visible, it is no longer diagnosable as a data problem. It shows up as declining relevance, eroding brand trust, or unexpected churn.

The closed loop

The standard defence is that synthetic data can be triangulated with social listening, search data, and behavioural analytics. In practice, that tightens the same loop.

Historical behaviour generates synthetic data. Synthetic data informs strategic assumptions. Those assumptions shape product and marketing decisions. Those decisions shape future behaviour. That behaviour is then fed back into the system as training data.

At no point does human explanation re-enter the system. Organisations stop learning from people and start learning from their own past. The more synthetic data is used, the more that loop hardens. Over time, it becomes structurally harder to notice when customer motivations shift because the models that inform strategy are trained on behaviour that existed before the shift.

Who is choosing this

The push toward synthetic data is not coming from insight teams. It is coming from leadership. CMOs favour it because it produces fast evidence for campaigns that are already booked. Product leaders favour it because it fits sprint cycles and does not threaten roadmaps. CFOs favour it because it stabilises research spend and removes variable costs. Procurement favours it because it can be purchased as a platform rather than commissioned as a project. Agencies favour it because it accelerates delivery timelines and protects margins. Internal analytics teams favour it because it expands their remit into strategy.

None of these incentives are irrational. They are simply misaligned with the requirements of discovery.

What synthetic data offers is not better understanding of people. It offers better alignment with organisational tempo, budget logic, and political risk management.

It removes the most destabilising feature of real research … contradiction.

Gen X – No Hand-Holding Desired or Required-East

What is being lost

Synthetic data is not going away. It will remain essential in regulated environments, AI training, simulation, and data-sharing contexts where real data cannot be used.

The loss does not come from using it. The loss comes from using it where conversation is required.

When replication replaces discovery, organisations do not just make weaker decisions. They lose the capability to notice when their understanding of customers is no longer current. That capability cannot be rebuilt quickly. It requires recruitment pipelines, qualitative skill, tolerance for contradiction, and leadership willing to pause when findings are inconvenient.

Synthetic data does not fill that gap. It replaces the habit of listening. When leadership finally realises that something fundamental has shifted, the organisation no longer has a reliable way to find out what it was.

That failure does not appear on a dashboard. It appears when a company stops recognising the people it claims to serve.

 

FAQs

What is synthetic data?

Synthetic data is artificially generated data created by algorithms rather than collected from real people or events. It is designed to mimic the statistical patterns of real datasets without containing actual personal information, which makes it attractive for privacy-sensitive or restricted use cases.

How is synthetic data created?

Synthetic data is typically generated using statistical models or machine learning systems trained on historical datasets. These systems learn distributions, correlations, and relationships in the original data and then produce new records that resemble those patterns without reproducing real individuals.

What is synthetic data used for?

It is commonly used for software testing, machine-learning training, simulation, and data sharing in regulated environments. Organisations also use it to fill gaps in datasets, stress-test systems, or enable analysis when real data cannot be accessed due to privacy, legal, or cost constraints.

Is synthetic data as reliable as real data?

Synthetic data can be reliable for replicating known patterns and testing systems, but it cannot reveal new behaviours, motivations, or emerging trends. Its reliability depends entirely on the quality, scope, and relevance of the original data it was trained on.

What are the risks of using synthetic data?

The main risk is treating synthetic data as a substitute for real-world insight. Because it is based on past behaviour, it can reinforce outdated assumptions, miss early signals of change, and create false confidence when used for discovery, strategy, or understanding human decision-making.