Enterprise AI term

Synthetic data

Data that is artificially generated rather than obtained from direct measurement of real-world events, designed to preserve selected statistical properties of a source dataset while removing direct identifiers and reducing re-identification risk.

In practice

Synthetic data is produced by statistical models, simulators, or generative neural networks trained on a source dataset. It is used to augment scarce training samples, to share data with third parties under tighter privacy constraints, to test pipelines without exposing production records, and to stress-test models against rare scenarios. It is not automatically privacy-safe: a generator over-fit to its training set can leak records, and utility for downstream tasks must be empirically demonstrated rather than assumed.

Worked example

A bank generates a synthetic transactions dataset from its production ledger to share with a fintech vendor during proof-of-concept, after measuring that downstream fraud-detection model accuracy drops by less than two percentage points relative to training on the real data.

Source

Authoritative reference

Related on Moweb

Data engineering services

This definition is maintained by Moweb partners and used in live client engagements. For how Synthetic data applies to your estate, or to challenge a working definition, speak to a partner.

Brief a partner

Browse the full A-Z glossary