Synthetic data is data which is mainly algorithmically generated by trained AI to approximate original data in order to be used for the same purpose as the original data. In the example of fake portraits mentioned previously, the system learned the properties of people’s photos from real life using AI in order to generate realistic images of human faces. Generally, the algorithm first learns patterns, correlations, and statistical properties of the real-world sample data. As a result, trained AI can generate synthetic data statistically identical to real-world data, which is a true gold mine.
The Rise of Synthetic Data in AI
Having good-quality data is the core of every successful project. It is the most important and challenging part of building a successful AI. Collecting real-world data is expensive, complicated, and requires a lot of time, patience, and energy. In 2006 British mathematician Clive Humby said, “Data is the new oil,” which has spread worldwide and been proven correct. Data holds tremendous value – today, entire industries are powered by data. But not everyone is lucky enough to have at least sufficient data resources to complete their projects. As a result, many are resorting to creating their own data, which is affordable, effective, and efficient – this is where synthetic data comes in.
In a Gartner report on synthetic data (Maverick Research: Forget About Your Real Data – Synthetic Data Is the Future of AI”, Leinar Ramos, Jitendra Subramanyam, 24 June 2021), authors predicted that by 2030, most of the data used in AI would be artificially generated by rules, statistical models, simulations or other techniques.
The importance of synthetic data in AI is demonstrated by the fact that the most common application of this type of data is in AI/ML model training. Machine learning models are increasingly reliant on synthetic data for training purposes, as synthetic data often outperforms real-world data and is critical in developing high-quality AI models.
Another typical application of synthetic data is in testing data. Synthetic data is simpler to generate for testing purposes, such as production or operational data than rule-based test data. Synthetic data is greatly useful in stress-testing AI models with rare real-world events or patterns and can aid in eliminating biases that are often present in real-world data. It is also critical for data-driven testing and software development.
AI as the Central Engine of Synthetic Data Generation
The most effective method for creating synthetic datasets is using AI. To produce dependable synthetic data, the crucial requirement is to have a sample of original (real-life based) data, which we can feed into our synthetic data generator to enable it to learn the data’s statistical properties, such as correlations between data, distributions, and patterns. The greater the original data sample, the better the quality of the resulting synthetic data.
Fundamentally, there are two principal methods for acquiring synthetic data:
- Using generative models
- Adopting conventional techniques, such as utilizing specialized tools and software in conjunction with procuring data from third-party sources
Both approaches can be employed to produce diverse varieties of synthetic data. Generative Adversarial Networks (GANs) consist of two constituent sub-models: a generator and a discriminator. The generator’s role is to fabricate artificial data, whereas the discriminator’s objective is to verify whether the generated data is authentic or fake. Since both sub-models oppose each other, the term “adversarial” is used to describe this approach.
As for conventional techniques, although you can always develop your own AI to generate synthetic data if you’re solely interested in the final product or lack the resources or expertise to build your own AI, there are alternative options available. For instance, the Python programming language has several implemented options for generating synthetic data, as described in this Towards Data Science blog. Additionally, multiple synthetic data generation tools are available, some of which are free to use but with limited resources. One such example is MOSTLY AI, an AI-driven synthetic data generator that provides up to 100k rows of synthetic data per day for free (following registration). The picture below displays some additional examples of synthetic data generation tools, and more information about them is available in this Turing article.
Synthetic Data Applications
Having established the definition, significance, primary purpose, and acquisition methods of synthetic data, we will now go through a brief survey of some real-world instances where synthetic data forms the essential ingredient.
One example of a synthetic data application is Amazon’s famous product Alexa, which I’m sure we have all heard of. Amazon trained Alexa on synthetic data to recognize requests in multiple languages. When Amazon implements a new language into Alexa’s system, the data sources for the machine learning model are extremely poor. That’s where the synthetic data kicks in.
Another interesting example of usage is in Google’s self-driving car project called Waymo. Waymo uses synthetic data to train self-driving vehicles. They’ve created an environment where a vehicle is trained on labeled synthetic data along with real data to drive safely while recognizing objects and following traffic rules. Created environment imitates both good and bad road situations for self-driving cars to learn.
Some industries that can benefit from synthetic data most are financial services, healthcare, manufacturing, security, and social media. These industries have already started leveraging synthetic data in significant ways, but there is still a vast untapped potential waiting to be discovered. Ongoing research and exploration into synthetic data applications are crucial to unlocking its full potential.
Let’s look into a bit of a different example of synthetic data application. Andrej Karpathy, a computer scientist who served as the director of artificial intelligence and Autopilot Vision at Tesla, developed an impressive language model named makemore. makemore takes one text file as input, where each line is assumed to be one training object and generates output objects similar to the provided input. He made the code open-source and very easy to adjust to your own needs. You can also play with this tool on your own, without any hussle, since it’s super-user-friendly, and there are even detailed instructions on how to use it published on Karpathy’s official YouTube channel. makemore is categorized as one of the generative models discussed in the previous section, and it’s somewhat a lite version of ChatGPT, which has already become popular worldwide.
In this blog, makemore was used to generate new Bosnian names. This seemed like an adequate example since most new parents and parents-to-be struggle with a creative name choice for their baby. In order to use makemore successfully, the first step was cloning the Git repository into desired destination folder using the following command in the terminal:
git clone https://github.com/karpathy/makemore.git
Now, I was able to view and edit the main code file ‘makemore.py’ to adjust it to my goal. I used Visual Studio Code editor to edit and test the code, which I strongly recommend since it’s easy to use, but also very powerful. However, you can use any Python-friendly editor which suits you. Code did not require a lot of editing (it can be used without any editing at all, which will be explained later), and with clearly provided instructions on how to use makemore and helpful comments throughout the entire code, it took me only a couple of minutes after I got an initial grip of the code structure. Since I wanted to get new Bosnian names as a result, I needed to provide Bosnian names as input for training as well. I collected a small sample of around 300 Bosnian names for girls and boys and saved it in a text file named ‘imena.txt’. Regarding this, I made a change in the code as well. The initial code uses English names for training from a default file named ‘names.txt’. I provided a path for the new file ‘imena.txt’ instead of the original default file name ‘names.txt’ in the arguments definition section:
parser.add_argument('--input-file', '-i', type=str, default='/Users/admin/Desktop/Blog-SyntheticData/Code/makemore/imena.txt', help="input file with things one per line")
I also edited the default output working directory path:
parser.add_argument('--work-dir', '-o', type=str, default='/Users/admin/Desktop/Blog-SyntheticData/Code/makemore/output', help="output working directory")
and changed the default maximum number of optimizations as it suited my needs (10,000) after testing it a few times (it was initially set up to infinite):
parser.add_argument('--max-steps', type=int, default=10000, help="max number of optimization steps to run for, or -1 for infinite.")
One additional change I made in the code was adding functionality to save new sample names in an output text file where I could review them more clearly. I added this code snippet in print_samples function definition by adding a new for loop, in the end, to write those new name samples in a text file named ‘output_ba.txt’, each new name in a new line:
for new_name in new_samples: with open('/Users/admin/Desktop/Blog-SyntheticData/Code/makemore-master/output_ba.txt', 'a') as f: if(new_name!=''): f.write(new_name) f.write('\n')
After all changes are applied and saved, the file can finally be run using VS Code (or any other dev environment you’d like to use). Another easier way to run and use makemore is through terminal commands without editing the initial code. You just need to open the terminal in the destination folder where your files are saved. The main script can be run using the following command:
admin@192 Code % python3 makemore.py —input-file imena.txt —working-dir output —max-steps 10000
When using the previous command, there is no need to edit default input and output (working directory) paths nor the maximum number of optimizations through code – you just define them directly in this command. You can set up any other input argument through this terminal command, you just need to check the desired input variable name defined in the code. The only thing which will be missing without editing the code and just running the file using the terminal is the output text file with new names, but new names are constantly being written in the terminal as the code executes, so that you won’t miss much. By editing the code, I just wanted to show how it is extremely user-friendly and easy to understand and edit. Note that terminal commands can be slightly different if you have installed ‘python’ instead of ‘python3’, which was used in the examples above.
After the main file was successfully run, the model started to generate new meaningful names after only a few iterations. And I must admit, most of the new names sound really awesome. Several examples of training and newly generated names are shown in the table below.
To be honest, some names can sound a bit extreme, like džeon, miliš, hekslanjan, or arh, but like the old Latin saying states: De gustibus non est disputandum (In matter of taste, there can be no disputes!). However, output names improve significantly by iterations. The more iterations and the larger input sample we assign, the better quality result we can expect. For example, you can use this model to generate names for your new business or pet. The possibilities of its use are endless.
Pros and Cons
Synthetic data has significant benefits, including being cost-effective, efficient, easy to generate, (ideally) unbiased, and able to replicate rare or restricted real data. Since this type of data is generated under controlled conditions, we can also significantly improve data integrity which we’ve discussed in one of our previous blogs, “How to Ensure Data Integrity?”. In our other blog, “Data centric AI / Big data vs. Good data,” we mentioned the importance of good and big data for successful AI models – synthetic data can be both!
However, there are some downsides, such as the possibility of missing outliers that may contain crucial information and the fact that the quality of generated synthetic data greatly depends on the input data and generation model. Additionally, synthetic data may be challenging to accept from users and cannot completely replicate real-world events.
Overall, synthetic data is an excellent tool for the AI world, but its benefits and drawbacks must be carefully considered before use.