Friday, August 16, 2024

Artificial Intelligence’s Data Needs: Can They Be Met Legally and Logistically?

Three of the problems I identified with AI in previous posts concern the data used to train its large language models.  One is the sheer volume of information it needs to create more advanced capabilities.  Second is data’s legal status, which has caused several large lawsuits, and doubtless many more small ones, charging copyright infringement.  The third is distortion from chatbots taking in output from themselves or others.  What has been in the press lately about these issues, and what does it mean not only about this aspect of AI but about AI in general?

Apparently, “Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos to Train AI” (Annie Gilbertson, WIRED, July 16th).  The problem has been that “tech companies are turning to controversial tactics to feed their data-hungry artificial intelligence models, vacuuming up books, websites, photos, and social media posts, often unbeknownst to their creators.”  Everyone anywhere near the field, let alone companies’ legal personnel, should know that electronic versions of books and published articles are as subject to copyright laws as hardcopy editions, long documented in statements such as “no part of this book may be reproduced in any form or by any means without the prior written permission of the Publisher, excepting brief quotes…” which I got from a random 1968 paperback – but it is understandable for lay people not to know if that also applies to the likes of videos and other less formally protected online material.  It also may be difficult, in these data-absorbing efforts, to avoid off-limits products, but the problem still must be solved.

That’s why, at least per Nico Grant and Cade Metz, in the New York Times on July 19th, we are seeing or should see “The Push to Develop Generative A.I. Without All the Lawsuits.” The partial copyrighted-information solution here is those owning the rights to data “building A.I. image generators with their own data,” and then selling AI-development access.  Two companies already starting that are “the major stock photo suppliers Getty Images and Shutterstock,” which will pay photographers when their work is thus used.  Fair play, or so it seems.

Otherwise, “The Data That Powers A.I. Is Disappearing Fast” (Kevin Roose, The New York Times, July 19th).  Although, per research “by the Data Provenance Initiative, an M.I.T.-led research group,” “three commonly used A.I. training data sets” had restricted only 5% of their data (though “25 percent… from the highest-quality sources”), but the operation is in progress.  Conclusive definition of legal information use is not here yet, as “A.I. companies have claimed that their use of public web data is legally protected under fair use.”  Perhaps, per the author, “if you take advantage of the web, the web will start shutting its doors.”

Another way out was described in Forbes Daily on July 24th: “The Internet Isn’t Big Enough To Train AI.  One Fix?  Fake Data.”  “OpenAI’s ChatGPT, the chatbot that helped mainstream AI, has already been trained on the entire public internet, roughly 300 billion words including all of Wikipedia and Reddit” (italics in original), meaning that “at some point, there will be nothing left.”  A company, Gretel, wants to provide AI firms with “fake data made from scratch,” which is not totally new, as “Anthropic, Meta, Microsoft and Google have all used synthetic data in some capacity to train their models.”  Two issues with it are that “it can exaggerate biases in an original dataset and fail to include outliers,” which “could make AI’s tendency to hallucinate even worse.”  If, that is, it does not “simply fail to produce anything new.”  We will find out, probably within the year, if artificial data is a worthwhile partial or complete substitute.

To the point of the final first-paragraph problem is “What happens when you feed AI-generated content back into an AI model?  Put simply:  absolute chaos” (Maggie Harrison Dupre, Futurism.com, July 26th).  Per a recent study, “AI models trained on AI-generated material will experience rapid “model collapse” … as an AI model cannibalizes AI-generated data, its outputs become increasingly bizarre, garbled, and nonsensical.”  The problem is out there now, as “there are thousands of AI-powered spammy “news” sites cropping up in Google; Facebook is quickly filling with bizarre AI imagery… Very little of this content is marked as AI-generated, meaning that web scraping, should AI companies continue to attempt to gather their data from the digital wilds, is  becoming a progressively dubious means of collecting AI training data.”

Despite the hope in the second story above, none of this looks good for future AI releases.  These problems will not be easy to solve.  We already have the issue that AI is nowhere near ready to produce even page-length writing releasable without human scrutiny – the concerns here will, most likely, keep that capability at bay.  Until then, AI will fail to even approximate the utility expected by its customers and backers.  That means, even without regard to other obstacles such as insufficient power for fundamentally more advanced releases, that artificial intelligence is in deep trouble.  All should govern themselves accordingly.

No comments:

Post a Comment