Synthetic Data - What's Possible and What's (Still) Not

Method with a Future

Synthetic data – currently a topic of heated debate. While some see it as a game changer for market research, many questions remain: How is such data generated? How reliable is it? And what realistic role can it play in the methodological portfolio of corporate research?

An objective assessment with a critical view on potentials and limitations.

Axel Schomborg and Fatemeh Aarabi, Produkt + Markt

A Buzzword? And the Questions Behind It.

The topic of "synthetic data" has been gaining renewed attention in market research discussions for some time now. It's not an entirely new approach—instead, a concept that has been known for a while is currently experiencing a significant increase in interest.

However, there is widespread uncertainty about what this trend actually means for market research. Caught between interest and skepticism, many market researchers are wondering whether synthetic data could soon complement or even replace traditional methods. At the same time, there is a critical examination of whether artificially generated datasets truly possess the necessary quality to meaningfully enhance surveys and analyses conducted in real settings.

For (corporate) market researchers, this raises several key questions:
  • What exactly are synthetic data—beyond the hype?
  • What specific advantages do they offer to market research practice?
  • What are the limitations and risks of this method?
  • Will we soon no longer need traditional market research—or are entirely new opportunities emerging here?

Synthetic Data – A Brief Overview

Synthetic data are artificially generated datasets designed to replicate, supplement, or simulate real data. Unlike traditional market research data, they are not directly derived from surveys or observations of real people or events but are created using statistical models, machine learning methods, or specialized algorithms.

Essentially, there is a distinction between qualitative and quantitative synthetic data: Qualitative data primarily include artificially generated texts or categorical features, such as typical target group statements, often produced with Large Language Models (LLMs). Quantitative synthetic data, on the other hand, are based on numbers and statistical patterns. They are mainly used to replace missing values in datasets, extend samples, or simulate hypothetical scenarios (“what if” analyses).

In corporate market research, synthetic data offer diverse applications, such as efficiently representing hard-to-reach target groups, conducting privacy-compliant analyses, or quickly and cost-effectively testing initial hypotheses before conducting extensive empirical studies.

How Synthetic Data is Created

One aspect is particularly important: "Synthetic" does not mean "random" or "arbitrary". Responsibly generated synthetic data adheres to strict methodological principles. They are derived from statistically validated models that reliably replicate real-world relationships and structures. For market research, this specifically means that synthetic data is not meant to simply replace real surveys, but rather to complement them: It aims to fill data gaps, improve the quality of small samples, and allows for the risk-free exploration of new market scenarios—achieving a speed and depth that traditional methods alone could not.

To ensure this, various statistical, machine, and deep learning techniques are employed. Common to all is the idea of replicating real data structures so accurately that the generated data is robust and meaningful for market research purposes.

Four main approaches can be distinguished:
 
  1. Classical statistical methods
    Here, fundamental statistical metrics—such as means, medians, or frequencies—are derived from real data. These metrics then form the basis for generating synthetic data.
  2. Generative Adversarial Networks (GANs)
    GANs are neural networks consisting of two competing algorithms.
  3. Large Language Models (LLMs)
    LLMs generate synthetic data by learning and replicating linguistic and textual structures. They are particularly suitable for creating qualitative, contextually meaningful data, such as target group statements.
  4. Machine Learning Methods (ML)
    Machine learning models identify and learn complex statistical relationships within a real dataset. From these models, new synthetic data can be generated that closely resembles the original statistical patterns.

Each of these approaches has specific strengths and weaknesses in terms of complexity, control, flexibility, and expressiveness. LLMs, as the name suggests, are primarily focused on linguistic tasks and thus represent a powerful technology for numerous applications. For generating quantitatively robust, statistically controlled synthetic data, machine learning methods present a promising path.

Opportunities for Corporate Market Research

Synthetic data offers a range of concrete advantages in operational market research—especially where traditional methods encounter practical, temporal, or economic limitations.

The application areas can be clustered as follows:
 
  • Imputation refers to the targeted replacement of missing values based on statistical patterns within the dataset.
  • Augmentation and expansion refer to the synthetic extension of existing samples (datasets and/or variables), for example, to enhance modelability, particularly in segmentation and target group analyses.
  • Creation involves generating new data samples based on (various) secondary data.

Synthetic data can also be used to simulate hypothetical scenarios. However, in practical application, it is important to understand that such simulations are always based on already collected data. This means that new questions can only be interpreted within the framework of the existing data structure. Thus, it is a "retrospective look into the future." The quality and relevance of the original data are therefore crucial for the significance of synthetic scenarios.

The ability to synthesize data can also be very helpful with regard to data protection requirements. If real survey data is fully synthesized—meaning an artificial dataset is created that is statistically comparable but no longer contains original data—this data can be shared with third parties without the risk of personal data exposure.

Where are the limits?

As diverse as the possibilities of synthetic data may be, they have clear methodological, practical, and ethical limitations.

A central issue is validity. Synthetic data do not reflect real responses from actual people but are based on patterns, probabilities, and model assumptions. They are only as good as the data on which they are based and can only reproduce what was already present in the underlying information. Truly new, unforeseen insights can rarely be derived from them.

Bias can also persist unnoticed. If the source material contains weaknesses or structural imbalances, these are often not corrected through modeling but are instead systematically reinforced. Without careful examination, there is a risk that synthetic data may present a misleading picture of representativeness.

Another aspect concerns regulatory requirements. Even if synthetic data are often unproblematic in terms of data protection law, transparency is still crucial: Where do the data come from? How were they generated? What assumptions are they based on? Clients and internal stakeholders must be able to understand how the results are produced, especially when they are incorporated into strategic decisions.

Our Approach at Produkt + Markt – Informed, Differentiated, Practical

At Produkt + Markt, we rely on a methodologically sound, practice-tested approach to synthetic data. Our goal is to employ it where it creates real value for corporate market research—complementing traditional empirical methods, not replacing them.

In the field of quantitative data, our approach is based on the use of machine learning techniques. We deliberately work with ML methods because they allow us to model realistic, robust data structures—without relying entirely on fully trained deep learning models and without the risk of retrospective bias. This approach combines flexibility with methodological transparency and provides a reliable foundation for a wide range of market research questions.
 
We are happy to support you as Data Guides and help unlock the full potential of your projects.
Fatemeh Aarabi – Data Scientist Produkt + Markt

We do not adhere to a rigid standard process. Instead, we view synthetic data as a modular component within a broader methodological offering: flexible in application, tailored to each specific study, and always accompanied by expert consultation. Our solutions are developed in close collaboration with research teams and project leaders – with the aim of meaningfully supporting data-driven decision-making processes, not replacing them.
 
Contact

Interested? Get in touch!

Read also: