sourcegraph
February 25, 2024

For more than 20 years, Kit Loffstadt has written fan fiction exploring alternate universes for Star Wars heroes and Buffy the Vampire Slayer villains, sharing her stories for free online.

But in May, Lofstadter stopped posting her work after learning that a data company had copied her story and fed it into the artificial intelligence technology of the viral chatbot ChatGPT. Frustrated, she hid her work behind a locked account.

Ms Lofstadt also helped organize an insurgency against AI systems last month. She, along with dozens of other fan fiction writers, has published a flood of irreverent stories online to overwhelm and confuse the data collection services that feed writers’ work into artificial intelligence technology.

Ms Lofstadt, a 42-year-old voice actress from South Yorkshire, said: “Each of us has to do everything we can to show them the output of our creativity, rather than letting the machines harvest whatever they want.”

Fan-fiction writers are just one of the groups that are now campaigning against artificial intelligence systems as a frenzy over artificial intelligence technology sweeps Silicon Valley and the world. In recent months, social media companies like Reddit and Twitter, news organizations like The New York Times and NBC News, Paul Tremblay and actress Sarah Silverman, among others Writers are all against AI sucking up their data without permission.

Their protests took different forms. Writers and artists are locking up their files to protect their work, or boycotting certain sites that publish AI-generated content, while companies like Reddit want to charge for access to their data. At least 10 lawsuits have been filed against artificial intelligence companies this year, alleging that they trained their systems on the creative work of artists without their consent. Last week, Ms. Silverman and authors Christopher Golden and Richard Kadrey sued OpenAI, the maker of ChatGPT, and others, alleging that the AI ​​used their work.

At the heart of the rebellion is a new realization that information online—stories, artwork, news articles, message board posts, and photos—can have enormous untapped value.

The new wave of AI — dubbed “generative AI” because of the text, images and other content it generates — is built on complex systems like large language models that can generate human-like prose. These models are trained on a variety of data so they can answer people’s questions, mimic writing styles, or create comedy and poetry in abundance.

That has sparked tech companies looking for more data to feed their artificial intelligence systems. Google, Meta, and OpenAI basically use information from across the internet, including large fan fiction databases, massive news articles, and book collections, most of which are freely available online. In tech industry parlance, this is known as “scraping” the internet.

OpenAI’s GPT-3, an AI system released in 2020, encompasses 500 billion “tokens,” each representing a portion of a word primarily found online. Some AI models cover more than a trillion tokens.

The practice of scraping the internet has been around for a long time, and is mostly disclosed by the companies and nonprofits that do it. But companies with this data don’t understand or see it as a particularly problematic problem well. That changed after ChatGPT debuted in November, and the public learned more about the underlying AI models that power chatbots.

“What’s happening here is a fundamental realignment of the value of data,” said Brandon Duderstadt, founder and CEO of artificial intelligence company Nomic. “Before, the thought was, Get value out of your data by opening it up to everyone and running ads. Now, the idea is to lock down your data because when you use it as input to artificial intelligence, you can extract more value”

In the long run, the data protests may have little impact. Deep-pocketed tech giants like Google and Microsoft already sit on vast amounts of proprietary information and have the resources to license even more. But as the era of easy-to-crawl content draws to a close, small AI upstarts and nonprofits that hoped to compete with big corporations may not have access to enough content to train their systems.

OpenAI said in a statement that ChatGPT was trained on “licensed content, public content, and content created by human AI trainers.” It added, “We respect the rights of creators and authors and look forward to continuing to work with them to protect their interests.”

In a statement, Google said it was participating in talks about how publishers would manage their content in the future. “We believe everyone will benefit from a vibrant content ecosystem,” the company said. Microsoft did not respond to a request for comment.

After ChatGPT became a global phenomenon last year, a data rebellion broke out. In November, a group of programmers file a class action lawsuit Against Microsoft and OpenAI, claiming the companies violated their copyrights after their code was used to train AI-powered programming assistants.

In January, Getty Images, which provides stock photos and videos, sued Stability AI, an artificial intelligence company that creates images from text descriptions, claiming the startup used copyrighted photos to train its systems.

Then in June, the Los Angeles-based Clarksons law firm filed a 151-page class-action lawsuit against OpenAI and Microsoft, describing how OpenAI collected data on minors and saying web scraping violated copyright law and constituted “theft.” “. On Tuesday, the company filed a similar lawsuit against Google.

Ryan Clarkson said: “The data insurgency we’re seeing across the country is society’s way of resisting the notion that Big Tech simply has the right to take any and all information from any source and Become your own message.” Founder of Clarksons.

Eric Goldman, a professor at Santa Clara University School of Law, said the lawsuit’s arguments are broad and unlikely to be accepted by the courts. But he said the wave of lawsuits has only just begun, with a “second and third wave” coming that will define the future of artificial intelligence.

Big corporations are also resisting AI crawlers. In April, Reddit said it wanted to charge for access to its application programming interface (API), a method by which third parties can download and analyze the social network’s vast database of person-to-person conversations.

Reddit CEO Steve Huffman said at the time that his company “doesn’t need to give all this value away for free to some of the biggest companies in the world.”

That same month, Stack Overflow, a question-and-answer site for computer programmers, said it would also require artificial intelligence companies to pay for data. The site has nearly 60 million questions and answers.It was previously reported that the by Wired.

News organizations are also resisting AI systems. In an internal memo in June on the use of generative AI, The Times said AI companies should “respect our intellectual property”. A Times spokesman declined to elaborate.

For individual artists and writers, fighting AI systems means rethinking where they publish.

Nicholas Kole, 35, an illustrator in Vancouver, British Columbia, was appalled at how the AI ​​system was replicating his unique art style and suspected the technology had scratched his work. He plans to continue posting his work on Instagram, Twitter and other social media sites to attract clients, but he has stopped posting on sites such as ArtStation, which publish AI-generated content alongside human-generated content.

“It felt like wanton theft from me and other artists,” Mr. Cole said. “It filled my stomach with existential dread.”

Writers on Archive of Our Own, a fan fiction database of more than 11 million stories, have been pressuring the site to ban data scraping and AI-generated stories.

Dozens of writers rose up in May when some Twitter accounts shared examples of ChatGPT parodying the style of fan fiction popular on Archive of Our Own. They masked their stories and wrote subversive content to mislead AI scrapers. They also urged the leaders of the Archive of Our Own to stop allowing AI to generate content.

Betsy Rosenblatt, a University of Tulsa law professor who provided legal advice to “Archive of Our Own,” said the site has a “maximum inclusiveness” policy and doesn’t want to discern which stories were written .with artificial intelligence

For fan-fiction writer Lofstadter, the struggle with artificial intelligence comes as she writes the story of Horizon Zero Dawn, a video game in which humans battle artificial intelligence in a post-apocalyptic world. Smart-driven robot battles. In the game, she said, some bots did a good job and some did a terrible job.

But in the real world, she said, “they are twisted to do bad things because of arrogance and corporate greed.”



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *