© 2024 Blaze Media LLC. All rights reserved.
Child sexual abuse material found in AI training dataset: Report
Photo by Thomas Trutschel/Photothek via Getty Images

Child sexual abuse material found in AI training dataset: Report

A Wednesday report released by the Stanford Internet Observatory found thousands of pieces of child sexual abuse material in a dataset used to train artificial intelligence image generators.

The researchers previously believed that generative machine learning models created explicit images of minors by combining the system’s understanding of adult pornography and benign, non-sexual photographs of children.

The latest report, “Identifying and Eliminating CSAM in Generative ML Training Data and Models,” revealed that AI systems are also learning to generate these images because they are being trained using child abuse photographs.

“While our previous work has indicated that generative ML models can and do produce Child Sexual Abuse Material (CSAM), that work assumed that the models were able to produce CSAM by combining two ‘concepts,’ such as child and explicit act, rather than the models understanding CSAM due to being trained on CSAM itself,” the report explained.

The researchers sought to uncover how much of the explicit material they could find in the LAION-5B dataset, a massive database of indexed online pictures used to train AI-powered image generators. The LAION dataset has been utilized to train Stable Diffusion and Google’s Imagen.

Google opted not to make the product public after an audit “uncovered a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes.”

The LAION-5B was “fed by essentially unguided crawling” and, consequently, “includes a significant amount of explicit material.”

“We identified 3,226 dataset entries of suspected CSAM, much of which was confirmed as CSAM by third parties,” the report stated.

For size, liability, and copyright reasons, the database does not contain the actual images themselves. Instead, it stores metadata about the material, including a description and a link to the original image.

The report recommended the removal of the explicit material from the original hosting URLs and the dataset. It noted that the removal of the images is “already in progress” as a result of its findings.

“While the amount of CSAM present does not necessarily indicate that the presence of CSAM drastically influences the output of the model above and beyond the model’s ability to combine the concepts of sexual activity and children, it likely does still exert influence,” the report continued.

The researchers also mentioned that the number of explicit materials found was “a significant undercount.”

The Associated Press reported that the group worked with the Canadian Centre for Child Protection and other charities to report the content to law enforcement and have it removed.

LAION told the news outlet it “has a zero tolerance policy for illegal content and in an abundance of caution, we have taken down the LAION datasets to ensure they are safe before republishing them.”

Stanford Internet Observatory’s chief technologist David Thiel explained that the sexual abuse material went unnoticed because AI projects were “effectively rushed to market,” the AP reported.

“Taking an entire internet-wide scrape and making that dataset to train models is something that should have been confined to a research operation, if anything, and is not something that should have been open-sourced without a lot more rigorous attention,” Thiel explained.

Stability AI, which took over the development of Stable Diffusion, stated that it “has taken proactive steps to mitigate the risk of misuse.” The software company noted that it only hosts filtered versions of the AI product.

“Those filters remove unsafe content from reaching the models,” Stability AI stated. “By removing that content before it ever reaches the model, we can help to prevent the model from generating unsafe content.”

Like Blaze News? Bypass the censors, sign up for our newsletters, and get stories like this direct to your inbox. Sign up here!

Want to leave a tip?

We answer to you. Help keep our content free of advertisers and big tech censorship by leaving a tip today.
Want to join the conversation?
Already a subscriber?
Candace Hathaway

Candace Hathaway

Candace Hathaway is a staff writer for Blaze News.
@candace_phx →