Three of the problems I identified with AI in previous posts concern the data used to train its large language models. One is the sheer volume of information it needs to create more advanced capabilities. Second is data’s legal status, which has caused several large lawsuits, and doubtless many more small ones, charging copyright infringement. The third is distortion from chatbots taking in output from themselves or others. What has been in the press lately about these issues, and what does it mean not only about this aspect of AI but about AI in general?
Apparently,
“Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos to Train AI”
(Annie Gilbertson, WIRED, July 16th). The problem has been that “tech companies are
turning to controversial tactics to feed their data-hungry artificial
intelligence models, vacuuming up books, websites, photos, and social media
posts, often unbeknownst to their creators.”
Everyone anywhere near the field, let alone companies’ legal personnel,
should know that electronic versions of books and published articles are as subject
to copyright laws as hardcopy editions, long documented in statements such as
“no part of this book may be reproduced in any form or by any means without the
prior written permission of the Publisher, excepting brief quotes…” which I got
from a random 1968 paperback – but it is understandable for lay people not to
know if that also applies to the likes of videos and other less formally
protected online material. It also may
be difficult, in these data-absorbing efforts, to avoid off-limits products,
but the problem still must be solved.
That’s why,
at least per Nico Grant and Cade Metz, in the New York Times on July
19th, we are seeing or should see “The Push to Develop Generative A.I. Without
All the Lawsuits.” The partial copyrighted-information solution here is those
owning the rights to data “building A.I. image generators with their own data,”
and then selling AI-development access.
Two companies already starting that are “the major stock photo suppliers
Getty Images and Shutterstock,” which will pay photographers when their work is
thus used. Fair play, or so it seems.
Otherwise,
“The Data That Powers A.I. Is Disappearing Fast” (Kevin Roose, The New York
Times, July 19th).
Although, per research “by the Data Provenance Initiative, an M.I.T.-led
research group,” “three commonly used A.I. training data sets” had restricted
only 5% of their data (though “25 percent… from the highest-quality sources”),
but the operation is in progress.
Conclusive definition of legal information use is not here yet, as “A.I.
companies have claimed that their use of public web data is legally protected
under fair use.” Perhaps, per the
author, “if you take advantage of the web, the web will start shutting its
doors.”
Another way
out was described in Forbes Daily on July 24th: “The Internet
Isn’t Big Enough To Train AI. One
Fix? Fake Data.” “OpenAI’s ChatGPT, the chatbot that helped
mainstream AI, has already been trained on the entire public internet,
roughly 300 billion words including all of Wikipedia and Reddit” (italics in
original), meaning that “at some point, there will be nothing left.” A company, Gretel, wants to provide AI firms
with “fake data made from scratch,” which is not totally new, as “Anthropic, Meta,
Microsoft and Google have all used synthetic data in some capacity to train
their models.” Two issues with it are that
“it can exaggerate biases in an original dataset and fail to include outliers,”
which “could make AI’s tendency to hallucinate even worse.” If, that is, it does not “simply fail to
produce anything new.” We will find out,
probably within the year, if artificial data is a worthwhile partial or
complete substitute.
To the point
of the final first-paragraph problem is “What happens when you feed
AI-generated content back into an AI model?
Put simply: absolute chaos”
(Maggie Harrison Dupre, Futurism.com, July 26th). Per a recent study, “AI models trained on
AI-generated material will experience rapid “model collapse” … as an AI model
cannibalizes AI-generated data, its outputs become increasingly bizarre,
garbled, and nonsensical.” The problem
is out there now, as “there are thousands of AI-powered spammy “news” sites
cropping up in Google; Facebook is quickly filling with bizarre AI imagery…
Very little of this content is marked as AI-generated, meaning that web
scraping, should AI companies continue to attempt to gather their data from the
digital wilds, is becoming a
progressively dubious means of collecting AI training data.”
Despite the
hope in the second story above, none of this looks good for future AI
releases. These problems will not be
easy to solve. We already have the issue
that AI is nowhere near ready to produce even page-length writing releasable without
human scrutiny – the concerns here will, most likely, keep that capability at
bay. Until then, AI will fail to even
approximate the utility expected by its customers and backers. That means, even without regard to other obstacles
such as insufficient power for fundamentally more advanced releases, that artificial
intelligence is in deep trouble. All
should govern themselves accordingly.
No comments:
Post a Comment