Synthetic data is data obtained not from direct measurement, but rather data generated (synthesized) from a real dataset through machine learning techniques, which, however, e.g. for privacy considerations, cannot be shared. In this sense, it can be said that it is ‘machine learning enabling machine learning’.
The aim of synthetic data is thus to provide the high-level relationships within the data, without actually disclosing single data points (which might be data points about individuals) while enabling privacy-preserving data mining. To be properly called so, synthetic data should:
1. Preserve the utility of the data set, by preserving at least global, but preferably also local characteristics;
2. Protect the privacy of individuals, by preventing risks of an individual (record) or attribute disclosure.
The promise is that this would allow, for most purposes, learning models on the synthetic data instead of the real data, without a significantly reduced effect on the utility of the data and subsequent models.
To know more:
- Hittmeir, M., Ekelhart, A., & Mayer, R. (2019, August). On the Utility of Synthetic Data: An Empirical Evaluation on Machine Learning Tasks. In Proceedings of the 14th International Conference on Availability, Reliability and Security (p. 29). ACM.