NewsPREMIUM

SA authors’ books ‘stolen’ to train AI bots

Acclaimed SA author Zakes Mda is one of those whose novel is used to artificial intelligence tools

During Trump’s first term, AI was still finding its footing. Now, it’s practically marching down the runway in a flashy outfit, demanding attention. Stock image.
During Trump’s first term, AI was still finding its footing. Now, it’s practically marching down the runway in a flashy outfit, demanding attention. Stock image. (123RF/SEMISATCH )

Books by Nobel Prize winner Nadine Gordimer, Zakes Mda and other  South African authors  have allegedly been “stolen” to help artificial intelligence bots churn out texts demanded by users. 

Described as the biggest ever act of copyright violation,  more than 183,000 books from around the world were allegedly pirated by “US-based Books3 dataset” and used to train generative AI tools for corporations such as Meta.

Authors and publishers are now fighting back, including the US Authors Guild and 17 best-selling authors who filed a class-action suit against OpenAI and its ChatGPT bot.

The pirated books appear to have been illegally downloaded from BitTorrent, a website containing pirated books and films. The database of the pirated books used was published by a US journalist.

Mda was shocked to learn that his book The Heart of Redness was among them.  

“Oh my goodness! I knew generally and vaguely that books were being used illegally to train AI. But I had no idea that my books specifically were used. Nor Nadine’s. Of course, I am quite outraged,” he said.

The theft of books for training AI raises ethical questions about respecting intellectual property rights and highlights the need for stricter regulations and enforcement mechanisms in the digital age. It is essential for society to find a balanced approach that allows innovation while upholding authors' rights and ensuring fair compensation for their work

—  Kelly Ann Mawa

Mda  said he would  take “drastic steps” through any available platform, including the Authors Guild, to get compensation from “these thieves of my intellectual property”. 

“This is a double whammy for me because I have just returned from Sweden where I discovered one of my novels has been republished without my permission by a British publisher since 2019 and I never received a cent for it,” he said. 

Webber Wentzel attorney and partner Carla Collett said under South African law, the unauthorised use, reproduction or adaptation of the authors’ books in South Africa would amount to copyright infringement.   

“There is also the possibility that use, reproduction or adaption of the authors’ books outside of South Africa would amount to copyright infringement, given that the Berne Convention provides for the concept of ‘national treatment’ for the benefit of its member states.” 

Collett said authors could potentially sue the person behind the creation of the dataset in the country in which that person committed copyright infringement. 

“The South African Copyright Act, which is now 45 years old, certainly did not contemplate artificial intelligence technologies when it was drafted. Even though there are proposed amendments to the act which should, in theory, bring the law into the 21st century, it will be interesting to see how the legislation, courts and organisations will balance the multitude of competing rights,” she said.  

However forensic analyst Jason Jordaan said there is  immense misunderstanding about AI, which might not be a threat to intellectual property. 

“It is a very complex area of computer science and even experienced computer scientists do not also always understand it, so it has become something ‘magical’. The use of books like this is actually used to train large language learning models,” said Jordaan. 

 “While I do believe the use of pirated materials like this could constitute an intellectual property violation, I am not convinced that the concept of AI is a threat to intellectual property. If a human reads multiple books and learns a writing style from these, and then writes a book in a similar style, but the content and story are not the same, then it is never considered a copyright violation. But now, when a program does it, we are saying it is a copyright violation simply because it is not human.”  

Jacana Media spokesperson Kelly Ann Mawa said the author’s plight takes centre stage when considering the exploitation of books in training AI.  

“At the outset, their struggle revolves around the loss of control over their work, their words, and the potential repercussions on their earnings and perhaps even their reputation,” she said.  

The practice of using books to educate AI systems  has a dual nature, with advantages and controversies, said Mawa.  

“On the one hand, exposing AI to an extensive array of literary works facilitates a deeper comprehension of human language, resulting in more authentic and human-like responses. This, in turn, yields practical benefits such as enhancing chatbots and refining speech recognition technologies. 

“Nevertheless, it is imperative that this process adheres to legal and authorised procedures. Respecting copyright laws serves a twofold purpose: it guarantees that authors and publishers receive equitable compensation for their creative endeavors and safeguards the well-being of the publishing industry,” she said.

In essence I asked ChatGPT to do a spin disabusing us of the notion that AI is bad for publishing. In my humble opinion it did a rather poor job overall of a narrative where we need not fear AI in publishing. 

—  Makhosi Khoza

Spokesperson Amanda van Rhyn said Penguin Random House was unwavering in fiercely championing and protecting the human element of creativity, while examining the ways transformational AI technology can help improve publishing operations.   

“We encourage regulators and lawmakers to keep front of mind the important implications of these technologies for the owners of copyrighted content and the need for transparency regarding the data and content used to train AI models,” said Van Rhyn.

“Specifically with regard to generative AI models, Penguin Random House maintains that the unauthorised ingestion of copyrighted content to train such models is a copyright infringement. We call on AI developers to guarantee the transparency of their training datasets and respect the important and legitimate interests of copyright owners, including our authors and illustrators.” 

Author and media personality Makhosi Khoza, who investigated “the dying art of writing” in his publication, said AI of itself is not a bad thing.  

“The problem is that the advance of technology in the publishing space comes with the potential dilemma of extinguishing more jobs than it can create — no amount of spin can fix this.” 


Related Articles