Generative AI Is Making Everyone Even More Thirsty for Your Data

blog post

Generative AI Is Making Everyone Even More Thirsty for Your Data

The outcry over Zoom’s tweak to its data policy shows how the race to build more powerful AI models creates new pressure to source training data, including by juicing it from users.

Zoom, the company that normalized attending business meetings in your pajama pants, was forced to unmute itself this week to reassure users that it would not use personal data to train artificial intelligence without their consent.

A keen-eyed Hacker News user last week noticed that an update to Zoom’s terms and conditions in March appeared to essentially give the company free rein to slurp up voice, video, and other data, and shovel it into machine learning systems.

This is an edition of WIRED’s Fast Forward newsletter, a weekly dispatch from the future by Will Knight, exploring AI advances and other technology set to change our lives. Will is a senior writer for WIRED, covering artificial intelligence. He was previously a senior editor at MIT Technology Review, where he wrote about fundamental advances in AI and China’s AI boom. He studied anthropology and journalism in the UK before turning his attention to machines.

New Terms

The new terms stated that customers “consent to Zoom’s access, use, collection, creation, modification, distribution, processing, sharing, maintenance, and storage of Service Generated Data” for purposes including “machine learning or artificial intelligence (including for training and tuning of algorithms and models).”

The discovery prompted critical news articles and angry posts across social media. Soon, Zoom backtracked. On Monday, Zoom’s chief product officer, Smita Hasham, wrote a blog post stating, “We will not use audio, video, or chat customer content to train our artificial intelligence models without your consent.” The company also updated its terms to say the same.

Those updates seem reassuring enough, but of course many Zoom users or admins for business accounts might click “OK” to the terms without fully realizing what they’re handing over. And employees required to use Zoom may be unaware of the choice their employer has made. One lawyer notes that the terms still permit Zoom to collect a lot of data without consent. A spokesperson for the company, CJ Lin, says that customers get to choose whether to enable generative AI features or share their content with Zoom to help it improve its products.

Feed the Monster

The kerfuffle shows the lack of meaningful data protections at a time when the generative AI boom has made the tech industry even hungrier for data than it already was. Companies have come to view generative AI as a kind of monster that must be fed at all costs—even if it isn’t always clear what exactly that data is needed for or what those future AI systems might end up doing.

The ascent of AI image generators like DALL-E 2 and Midjourney, followed by ChatGPT and other clever-yet-flawed chatbots, was made possible thanks to huge amounts of training data—much of it copyrighted—that was scraped from the web. And all manner of companies are currently looking to use the data they own, or that is generated by their customers and users, to build generative AI tools.

Zoom is already on the generative bandwagon. In June, the company introduced two text-generation features for summarizing meetings and composing emails about them. Zoom could conceivably use data from its users’ video meetings to develop more sophisticated algorithms. These might summarize or analyze individuals’ behavior in meetings, or perhaps even render a virtual likeness for someone whose connection temporarily dropped or hasn’t had time to shower.

Previously Unimagined Outcomes

The problem with Zoom’s effort to grab more data is that it reflects the broad state of affairs when it comes to our personal data. Many tech companies already profit from our information, and many of them like Zoom are now on the hunt for ways to source more data for generative AI projects. And yet it is up to us, the users, to try to police what they are doing.

“Companies have an extreme desire to collect as much data as they can,” says Janet Haven, executive director of the think tank Data and Society. “This is the business model—to collect data and build products around that data, or to sell that data to data brokers.”

The US lacks a federal privacy law, leaving consumers more exposed to the pangs of ChatGPT-inspired data hunger than people in the EU. Proposed legislation, such as the American Data Privacy and Protection Act, offers some hope of providing tighter federal rules on data collection and use, and the Biden administration’s AI Bill of Rights also calls for data protection by default. But for now, public pushback like that in response to Zoom’s moves is the most effective way to curb companies’ data appetites. Unfortunately, this isn’t a reliable mechanism for catching every questionable decision by companies trying to compete in AI.

In an age when the most exciting and widely praised new technologies are built atop mountains of data collected from consumers, often in ethically questionable ways, it seems that new protections can’t come soon enough. “Every single person is supposed to take steps to protect themselves,” Haven says. “That is antithetical to the idea that this is a societal problem.”

The Wild, Wild Western Desert

In other words, we have been pulled into a wild western desert, void of laws, governance, enforcement, jurisprudence and due process. The laws that governed our prior universe no longer apply because the mechanics for data acquisition have changed and the sourcing is virtually undetectable. Even to the language models themselves.

While we await some form of regulatory guidance from Washington, the other data-related risk that grows daily and over which we have no control, is model poisoning. By manipulating the data which a deep learning model trains upon, an attacker can either corrupt the model (untargeted) or even manipulate its output to produce favorable results for the attacker (targeted).

Model poisoning is just the tip of the iceberg in the corpus of data integrity issues that threaten AI integrity. All models are under the microscope of risk and resiliency as researchers explore how issues like feedback loops and AI bias can make AI output unreliable.

Google did not go commercial with their best LLM, LaMDA, which is one of the most powerful language models in the world. LaMDA is the one that Google’s engineers, including Blake Lemoine, claimed was sentient. After all, it DID pass the Turing Test. He and his colleagues were terminated as a result of his opinion sharing and a startup called Character.ai was launched by two founders who had worked on LaMDA.

Whether LaMDA is sentient or not and whether other LLMs will come even closer to independent human thought and demotion, will not change the arc of progress toward a state with which we will be completely unable to control or manage, called the near future.

For those who think that this outcome is over-hyped, I only hope you are correct, though I know better.

Author

Steve King

Managing Director, CyberEd

King, an experienced cybersecurity professional, has served in senior leadership roles in technology development for the past 20 years. He has founded nine startups, including Endymion Systems and seeCommerce. He has held leadership roles in marketing and product development, operating as CEO, CTO and CISO for several startups, including Netswitch Technology Management. He also served as CIO for Memorex and was the co-founder of the Cambridge Systems Group.

blog post

Author

Managing Director, CyberEd

Contact Us

Get In Touch!