Synthetic Data in Practice - Case Studies

Christoph Fritsch and Fatemeh Aarabi, Produkt + Markt

From concept to concrete application – it becomes clear how this technology addresses typical data-related challenges in market research. The use cases of Imputation, Augmentation, Expansion, and Creation demonstrate how the generation of statistically robust data can meaningfully extend and enrich traditional research methods.
In our first article on the topic of "Quantitative Synthetic Data," we looked beyond the hype to explore the concept of synthetic data and its potential. Now, we move from theory to practice. This article examines four powerful use cases—Imputation, Augmentation, Expansion, and Creation—that demonstrate how this technology can be applied to solve specific, everyday challenges in market research.
We start with one of the most common issues: missing data.

Imputation: Precisely Filling Data Gaps

Imputation refers to the targeted replacement of missing values in a dataset. In any real-world data collection, gaps are unavoidable—they arise from survey dropouts, participant fatigue, or simply the refusal to answer sensitive questions (item non-response). While it may seem straightforward to simply delete cases with missing values, this approach can significantly reduce the statistical power and, more critically, lead to substantial biases if the non-responses are not entirely random. Simple methods, such as replacing missing values with the mean or median, may fill the gaps but often create new issues by distorting the natural variance of the data and weakening the relationships between the variables.

A far more sophisticated solution is offered by modern machine learning-based imputation. Here, an ML model is trained on the complete data and learns the complex, multivariate relationships between all variables. Based on this understanding, it then predicts the most likely value for each missing entry on a case-by-case basis. This preserves the original data structure, including its variance and correlations, ensuring that subsequent analyses are both stable and valid.
Practical Application Example:
Imagine a large-scale tracking study on brand health where the Net Promoter Score (NPS) is a key metric. A small yet significant portion of respondents answers all questions about brand perception and usage but skips the final NPS question. Excluding these cases would result in the loss of valuable data. Through imputation, a model can be trained on the complete responses. It learns how various perception aspects correlate with the NPS values of other respondents. Based on the specific answers to brand perception, the model can then predict a statistically plausible NPS value for those participants who skipped the question. This preserves the entire dataset, stabilizes the data foundation, and ensures that the final NPS calculation is as representative as possible.
While imputation is the ideal solution for repairing individual, incomplete data points, market research often faces a larger challenge: entire subgroups that are too small for reliable analysis. This is where the next powerful use case comes into play.
 

Augmentation: Mastering the Challenge of Small Sample Sizes

Augmentation is a sophisticated generative technique developed to tackle the critical challenge of insufficient sample sizes, particularly within specific subgroups of a dataset. In market research, the “small N problem” frequently arises, where a central study segment is represented by too few respondents or there is an imbalance between target audiences. This data scarcity renders standard statistical analyses unreliable, making it practically impossible to draw robust conclusions.

The core objective of augmentation is to generate new, synthetic data points of such high quality that they are statistically indistinguishable from those of the genuinely underrepresented group. This extends far beyond simple methods like duplicating datasets. Instead, the model learns not only the characteristics of each variable individually but also the complex, multivariate relationships and dependencies among all variables simultaneously. It learns, for example, how age, geographic location, and expressed attitudes collectively influence purchasing behavior within this specific group.

The augmentation process isolates the actual respondents of the target audience or the entire dataset and trains a generative model exclusively on their data. After training, this model can be used to draw new data points from the learned distribution. These are not mere duplicates of the original data but entirely novel combinations of features that follow the complex patterns learned by the model. Each synthetic case is thus a plausible, statistically consistent new member of the subgroup. Combining the augmented data with the original dataset enables more meaningful and robust multivariate analyses that would otherwise not be possible.
Practical Application Example:
An automotive manufacturer is conducting a market study for a new electric luxury vehicle. Among the 500 respondents, only 50 participants are identified who both have a high income and currently own an electric luxury vehicle from a competitor—a crucial segment for acquiring new customers. However, this sample of 50 individuals is too small for a comprehensive driver analysis to understand which features motivate a brand switch. Through augmentation, a model is trained based on these 50 actual respondents. It then generates 100 new, synthetic cases exhibiting the same demographic, psychographic, and behavioral key characteristics. With this dataset expanded to 150 cases, the manufacturer can now perform a more reliable driver analysis.
While augmentation enriches a dataset by synthetically generating more respondents, there is another powerful form of enrichment that operates differently. Instead of adding people, expansion adds knowledge by supplementing each case with new variables, creating a more holistic and meaningful dataset.
 

Expansion: Gaining Deeper Context Through New Dimensions

This powerful technique enables researchers to go beyond respondents' mere answers and gain deeper, more strategic insights. This enrichment can be achieved through two distinct and valuable approaches. The first approach generates new variables by uncovering latent structures within the existing dataset. The second approach links the dataset with an external information source to add new knowledge.

Internal Expansion – Generating New Variables from Existing Data

The most significant insights of a dataset often do not lie in the answers to a single question, but in the complex interplay of multiple responses. Internal expansion analyzes the patterns and interrelationships among several existing variables to create a new, composite variable. This new variable typically represents a higher-level concept or latent construct that would be difficult or impossible to measure directly.
Practical Application Example:
A financial service provider possesses a dataset from a large study with various questions regarding customer perception (e.g., "offers reliable service", "has transparent fees", "resolves issues quickly"). Through internal expansion, the relationships between these variables are analyzed to generate a robust index value for each respondent.
Excursus: Even though internal expansion, such as the creation of indices, has long been a proven practice in market research (also discussed in the first part), it is technically also synthetic data. The difference from more modern approaches primarily lies in the complexity of the calculation methods and the underlying statistical models. However, both generate new information that was not directly collected in this form—a key similarity that clearly places internal expansion within the realm of synthetic data generation.

External Expansion – Linking Datasets with External Information

This second approach addresses the well-known challenge of data silos, where valuable information is dispersed across multiple, unconnected studies. External expansion creates a statistical bridge to transfer knowledge from one dataset to another. This allows market researchers to enrich a primary dataset with variables and insights from a completely different study without ever asking those questions to the original respondents.
Practical Application Example:
A trading company possesses data from a large-scale customer satisfaction study, which contains detailed information on purchasing motives and service preferences. Independently, the company regularly collects transaction data from its CRM system, capturing customer behaviors such as purchase frequency and spending per transaction, but lacking deeper attitudinal data. Through external expansion, the deeper attitudes from the satisfaction study can be statistically transferred to the customer data from the CRM. The result is a comprehensive, synthetically enriched dataset where real transactions are complemented by valuable psychographic and motivational dimensions. This provides the company with a more holistic understanding of its customers and enables the derivation of targeted, personalized marketing and sales strategies without needing to directly survey all customers.
While all previously discussed methods start with a primary, respondent-based dataset, the Creation breaks this pattern. This technique leads us from refining real data to simulating a complete virtual market by synthesizing information from multiple sources to explore scenarios that have not yet occurred.
 

Creation: Simulating Markets and Exploring Scenarios

Creation is a particularly ambitious approach, whose benefits should be viewed with caution, as it is the furthest removed from real, empirically collected data. Creation involves the complete generation of a synthetic dataset to simulate a market or customer group. This process does not rely on a single primary dataset but synthesizes information from multiple, diverse sources. The result is a virtual environment populated by synthetic consumers, whose collective behavior attempts to mirror the real world. This simulation environment enables companies to conduct risk-free and cost-efficient 'what-if' analyses to test strategic decisions before implementing them in the real world.

However, it is crucial to remember the principle we discussed in our previous post: these models enable a 'retrospective look into the future.' They can only generate outcomes based on the patterns and relationships of the data with which they were trained. They cannot truly predict novel market shocks or unforeseen consumer behaviors.
Practical Application Example:
A mobile phone manufacturer is deciding on the feature set for its next flagship model. It has access to insights on consumer preferences for various attributes such as camera quality, battery life, and screen size, as well as data on price sensitivity. Using Creation, a synthetic market is built. The model generates a population of synthetic consumers whose demographic distribution mirrors the target market. Each synthetic consumer is assigned a preference set that aligns with the underlying research insights. The manufacturer can then introduce different virtual phone configurations into this simulated market in order to optimize its product strategy.
Conclusion
The application of quantitative synthetic data provides market researchers with a multitude of valuable opportunities to make their analyses more effective, deeper, and more robust. At Produkt + Markt, we consciously focus on the application areas of imputation, augmentation, and expansion. We believe that reliable insights should always be based on real collected base data to ensure the relevance and timeliness of the insights gained. For this reason, we refrain from purely virtual simulation approaches, as these, despite their methodological appeal, ultimately stray too far from empirically grounded data to provide reliable decision-making foundations for our clients.
Inquiry

Interested? Get in touch!

Read also: