Connect with us


What Is Synthetic Data, and How Does It Help AI?




What is synthetic data?

Synthetic data is information produced artificially but based on real data from existing inputs. This data can be structured in a database or unstructured, but the goal is for the information to be synthesized from real datasets to look like the original information without using it.

Information can be produced directly from real data, but it can also be produced indirectly from a model, removing any link to directly identifiable data sets.

For artificial intelligence or business analytics applications, data models can be used in various contexts such as simulations and visualizations, which can help identify technical issues in areas such as engineering, financial services or health care. According to a 2021 Accenture report, the National Institutes of Health is using a synthetic dataset to help improve its approach to research on the challenges facing COVID-19 patients — without actually using real patient data.

“While the pandemic has illustrated potential health research-focused use cases for synthetic data, we see potential for the technology in a range of other industries,” writes Fernando Lucini, Accenture’s global head of data science and machine learning engineering.

LOOK: Learn about the unique threats healthcare organizations face to protect their data.

How does synthetic data work?

Synthetic data, as the name suggests, replaces real data – used in place of reality and contextualized in a similar way.

This approach enables some important use cases. For example, it can be much faster than traditional data collection processes, according to Akhil Docca, senior product marketing manager for NVIDIA’s NGC cloud services platform.

“The idea is that you can create a really great model with synthetic data, import real data, refine it, and you’re in production a lot faster,” Docca said. “So for us, it shortens that time to market, if you will.”

Nyla Worker, a solutions architect at NVIDIA who focuses on visual simulation and deep learning, says working with synthetic data in this way, called bootstrapping, has great potential, especially for business intelligence. In other cases, synthetic data can effectively supplement the actual data used.

For example, Worker cited the widely reported use of augmented data in training autonomous vehicles. While automakers have worked hard to train vehicles for real-world situations, there may be circumstances where it is difficult to account for real-world conditions, creating the potential for error.

“They’ll have the data maybe in sunny conditions, but not at night, or not in all these other conditions,” she says. “This is exactly where synthetic data would come in and fill in the gaps in your data set.”

In some cases, legal reasons, such as privacy issues, may prevent access to certain data. Synthetic data can solve these problems while maintaining data privacy. For example, Microsoft’s AI Lab built a synthetic data generator to detect human trafficking without tracking personally identifiable information related to monitored subjects. This helped create a model that nonprofit organizations could use to assess the impact of human trafficking.

“By using synthetic data, we provide a level of indirection,” Microsoft’s Artificial Intelligence Lab explained on its website.