How your data ends up in AI training sets
On a winter’s day in Moscow when a thick blanket of snow covered the streets, a man slipped as he passed a children’s playground, steadying himself on the bonnet of a parked car.
In a city where average winter temperatures are about -7C, this should have been an entirely unremarkable event. But while the man in the video likely never gave it a second thought, his slip was recorded on one of Moscow’s high-tech surveillance cameras and shared among thousands of gig workers all over the world, one of whom posted it on YouTube. There it provides a fascinating glimpse into the global AI industry and its insatiable thirst for data.
The man slipping on the ice was one of many ordinary people whose images have been posted on Toloka, a Russian gig work platform registered in the Netherlands and Switzerland but popular with workers in the Global South. He had walked past one of the 200,000 or so cameras in Moscow that use computer vision technology, a branch of machine learning in which software learns to automatically recognise patterns in images.
To help AI systems learn how to recognise these patterns, gig workers are enlisted to label the images, meaning videos captured by street cameras are constantly fed back to workers to help improve the system’s performance. In this instance, workers were asked to tag whether the man in the video “runs”, “rides a bike” or performs “another action”.
A recent investigation from the Bureau of Investigative Journalism (TBIJ) revealed that Toloka had recruited workers to train facial recognition software for Tevian and NTechLab, companies that have supplied surveillance tech to Moscow and other Russian cities. Both have been sanctioned by the EU for their role in “serious human rights violations in Russia, including arbitrary arrests and detentions”.
The man’s face is not blurred but also not particularly identifiable. However, with a reverse image search I quickly found a match for the children’s playground in the background, which was featured in a Russian news piece about NTechLab's surveillance tech, and some possible matches for its precise location in Moscow. In another frame from the same video, we see a completely legible licence plate – demonstrating how easy it is to begin to deanonymise the people in the videos.
While Moscow’s use of facial recognition for policing is extensive, the technology is also used widely by law enforcement in Western countries, particularly in the US. The UK government recently pledged to expand facial recognition to crack down on shoplifting.
ClearView and PimEyes, two of the best-known facial recognition companies that have worked with US law enforcement, built their software off the back of vast numbers of images scraped from the internet. Chances are, if there are publicly available photos of you – perhaps in your tagged photos on Facebook or on a long-dormant Flickr account – they will have already been scraped and used to train facial recognition software. Tools such as “Have I Been Trained?” allow people to check if any of their photos have been processed in this way.
And with panic setting in that the generative AI boom might be about to falter because companies such as OpenAI have already tapped most of the public internet to train their models, AI executives are scrambling for fresh sources. Some haved signed licensing deals to access data from platforms such as Tumblr and even long-forgotten relics like Photobucket in order to maintain and improve their products.
A rapidly growing industry has also sprung up to broker training data to AI companies with more niche requirements. Some of the datasets you can buy on this marketplace include images of conflict, protest crowds, adult content, audio datasets of recorded phone conversations on various topics, and social media posts classified by the sentiment they express.
Is there anything we can do to stop our data from being monetised by AI companies and turned into fodder for ever more sophisticated surveillance tools? Wired recently published a guide containing helpful instructions on how to opt out of AI training on platforms such as Google, WordPress and Tumblr – while also being realistic about the fact that the horse has already bolted.
“Any companies building AI have already scraped the web, so anything you’ve posted is probably already in their systems. Companies are also secretive about what they have actually scraped, purchased, or used to train their systems,” the guide says.
In other words, our personal data is constantly being harvested, monetised, shared and reshared without our knowledge or consent. While this process is usually invisible, the man who slipped in the street in Moscow and the Pakistani gig worker who uploaded the video to YouTube gave us a rare peek inside the machine.
Reporter: Niamh McIntyre
Tech editor: Jasper Jackson
Deputy editors: Katie Mark and Chrissie Giles
Editor: Franz Wild
Production editor: Frankie Goodway
Fact checker: Ero Partsakoulaki
This story was produced in partnership with the Pulitzer Center’s AI Accountability Network. Our reporting on Big Tech is funded by Open Society Foundations. None of our funders have any influence over our editorial decisions or output.
-
Subject:
-
Area: