Propaganda poisoning.

The debate around GenAI has been going round circles for a while now, and much better writers have commented on its usefulness, likely impact on society, and potential pitfalls along the way.

One way GenAI can (will?) fail is through its training data, which becomes increasingly hard to curate as ‘botshit’ fill the internet with junk content (reportedly, OpenAI already spends more than 50% of its time trying to remove content generated by GenAI to train the next generation of ChatGPT).

If the training set is bad, then the output will be bad. Garbage In, Garbage Out. To get a sense of how the training of those models happen, consider this: OpenAI Training Bot Crawls ‘World’s Lamest Content Farm’ 3 Million Times in One Day.

In essence, OpenAI’s crawler has been stuck in a loop of self-refential and randomly generated pages with zero content (the url web.sp.am/. tells it all!). You can’t expect high quality output from a training set like this.

What’s worse, this creates a self-defeating loop. As usual, Doctorow puts the dynamics at play very clearly: “Google and Microsoft’s full-court press for “AI powered search” imagines a future for the web in which search-engines stop returning links to web-pages, and instead summarize their content. The question is, why the fuck would anyone write the web if the only “person” who can find what they write is an AI’s crawler, which ingests the writing for its own training, but has no interest in steering readers to see what you’ve written?”

But one thing that I haven’t seen mentioned in the discussions so far, is the risk that the content is not just your standard junk of remashed content, but rather disinformation or propaganda junk. Imagine the OpenAi’s crawler got stuck in a loop of millions of self-generated (subtle or not) pro-Russia, neo-capitalist, or negationist pages instead? What will be the biases in the model’s outputs?

Such ‘data poisoning’ is already used by artists to protect their works from AI models. Propaganda poisoning of GenAI models is probably already happening under the cover of government agencies (for those who doubt online disinformation by governments is real, I highly recommend Maria Resa’s memoir, How to Stand Up to a Dictator).

We know that debiasing AIs is a difficult proposition, which will get more and more difficult as we scale them. So, with the ability to generate content at scale, we also get the risk of biases and disinformation at scale as more people use - and trust - ChatGTP and its cousins.