The AI Industry Was Built on Copyrighted Content Nobody Asked to Use. California Wants That on the Record.
AB 2013, xAI, and the Copyright Reckoning the AI Industry Has Been Avoiding
Elon Musk's xAI spent the last few months of 2025 trying to convince a federal court that California had no business asking how Grok was built. It failed. On March 6, U.S. District Judge Jesus Bernal denied xAI's request to block AB 2013 — California's Generative AI Training Data Transparency Act — finding the company hadn't shown a likelihood of success on its constitutional claims. The law stands. The disclosures are due. And the industry will need to uncomfortably answer the question of where they got all this training data from — because it only looks Legalish to me.
AB 2013, signed by Governor Newsom in September 2024, requires any developer of a generative AI system available to Californians to post documentation on its website about the data used to train its models. The required disclosures include "the sources or owners of the datasets," "whether the datasets include any data protected by copyright, trademark, or patent," "whether the datasets were purchased or licensed by the developer," and the specific dates those datasets were first used in development. Cal. Civ. Code § 3111(a).
The statute calls all of this a "high-level summary." Twelve mandatory line items is not most people's definition of high-level — and the legislature never defined the term. That ambiguity is xAI's strongest remaining argument as the case proceeds. If you don't know what level of specificity constitutes compliance, you don't know when you're violating the law. That's a due process problem with teeth.
But the vagueness issue may be the least of the industry's worries. The more uncomfortable question buried in section (a)(5) is the copyright one: did your training data include copyrighted material? For the overwhelming majority of frontier AI developers, the honest answer is almost certainly yes — and the legal framework around how that copying occurred makes the situation thornier than most public statements acknowledge.
Training a large language model requires ingesting enormous volumes of text, images, and code scraped from the internet. Most of that content is copyrighted by default. Under MAI Systems Corp. v. Peak Computer, Inc., 991 F.2d 511 (9th Cir. 1993), even loading copyrighted material into RAM temporarily constitutes reproduction under the Copyright Act. Under Sega Enterprises v. Accolade, Inc., 977 F.2d 1510 (9th Cir. 1992), intermediate copying — copying that occurs during a process even if the end product doesn't retain the original — can itself constitute infringement, regardless of what the finished model looks like. A developer who scraped the web, trained on it, and deleted everything afterward isn't necessarily in the clear. The copying already happened.
The companies best positioned to defend themselves are the ones that owned the platforms: Meta had user-generated content under broad license agreements. Google had Books, YouTube, and Search. Twitter, now X, fed its own firehose to xAI. Everyone else has been largely operating on a fair use bet — and fair use is not a license. It's an affirmative defense you don't get to claim until you're already in court. No court has yet held that training a frontier model on scraped copyrighted content constitutes fair use at scale. The upside of that bet was building a multi-hundred-billion-dollar industry. The downside, if the bet is wrong, could be existential.
The historical analogy the industry reaches for — VCRs, MP3 players, Google Books — is weaker than it appears. In Sony Corp. v. Universal City Studios, 464 U.S. 417 (1984), the Supreme Court protected Sony from liability for time-shifting. In Authors Guild v. Google, 804 F.3d 202 (2d Cir. 2015), Google's book scanning survived as transformative. But in both eras, enforcement pressure fell disproportionately on individual users — not the companies deploying the technology. The record labels sued teenagers. The movie studios went after Napster users. The companies that built the infrastructure largely escaped. That asymmetry may not hold when the product is the ingestion.
AB 2013 won't resolve any of this directly. It has no enforcement mechanism of its own and imposes no penalties on its face. But the exposure it creates runs on multiple tracks. The California AG — who celebrated the AB 2013 ruling and is actively building an AI accountability program — has broad authority to act under existing consumer protection law, including the Unfair Competition Law. The statute's silence on private rights of action won't stop plaintiffs' lawyers from finding hooks in other statutes once disclosures are on the record. And most acutely, the wave of copyright plaintiffs already in federal court — Getty Images, The New York Times, the authors' class actions — have been fighting brutal discovery battles to find out exactly what went into these models. AB 2013 potentially hands them that information voluntarily, posted on a public website, in the company's own words.
There is also a broader context worth noting: Congress attempted to impose a ten-year moratorium on state AI laws as part of federal budget reconciliation this year, and the Senate voted 99-1 to strip it from the bill. Until Congress actually writes a federal AI framework — something it has so far shown little appetite for — states like California are the only ones writing the rules.
xAI called AB 2013 "a trade-secrets-destroying disclosure regime." It might also be described as a reckoning.
Also on Legalish: Anthropic Told the DoW Secretary Hegseth to Pound Sand. Now It's Being Treated Like Huawei.
Further Reading
The litigation that AB 2013 disclosures will feed:
- The New York Times Co. v. Microsoft Corp., No. 1:23-cv-11195 (S.D.N.Y.) — The landmark suit alleging OpenAI trained on NYT's copyrighted articles without license. Now consolidated into the OpenAI MDL.
- In re: OpenAI, Inc. Copyright Infringement Litigation, MDL No. 3143 (S.D.N.Y.) — The consolidated multidistrict litigation bringing together the NYT case and related publisher and author suits against OpenAI.
- Getty Images (US), Inc. v. Stability AI, Inc., No. 3:25-cv-06891 (N.D. Cal.) — Getty's US suit over Stability AI's use of its image library to train Stable Diffusion. Refiled in N.D. Cal. after the Delaware case was terminated; motion to dismiss pending, trial set January 2028.
- Getty Images (US), Inc. v. Stability AI Ltd, [2025] EWHC 2863 (Ch) — The UK parallel, decided November 2025. The first major court ruling anywhere on AI training data and copyright. The High Court largely rejected Getty's claims, finding no secondary copyright infringement — but Getty won a narrow trademark ruling on outputs bearing distorted Getty watermarks. The training claims were withdrawn because Stable Diffusion was trained outside the UK. Instructive for what these cases look like on the merits.
Legalish is supported by Lynch LLP — Trademark · Copyright · Patents