Why CORPUS vs. Scraping, Buy-Outs, or Major Deals

If you are choosing a training data foundation for a generative music model, you have four practical options: scrape the open web, buy a catalogue outright, sign a deal with a major label, or license through a system like CORPUS. The first three each fail one of three conditions that a sustainable framework has to satisfy at the same time: legal defensibility, fair compensation, and economic scalability. This page explains how, and what CORPUS does differently.
The three conditions are not negotiable in the markets we are building for. An automotive OEM shipping a model in millions of vehicles cannot accept latent litigation risk. A hospital integrating a therapeutic device cannot deploy on a dataset whose provenance it cannot demonstrate to a regulator. A startup cannot compete if the cost of clean data is in the tens of millions before a single product ships. Any framework that meets one or two of the conditions but not all three pushes its users into the gap between them.
Why scraping fails
Scraping is cheap and convenient. It is also legally indefensible at commercial scale, and the liability does not stop at the company that trained the model.
- No consent. Streaming catalogues, archives, and open-web sources are ingested without contributor agreement. GEMA, SACEM, and others have already exercised opt-outs under Article 4 of the EU DSM Directive, making continued unlicensed training on their repertoires structurally non-compliant in the EU.
- No attribution, no compensation. Contributors whose work shapes the model receive nothing. As models become commercially valuable, that asymmetry becomes the basis for litigation.
- Downstream liability. Music tools, games, automotive integrators, clinical deployments — all inherit the exposure of the model they license. Indemnities are only as good as the vendor's balance sheet.
- Fair Use is narrowing. Warhol v. Goldsmith and the active Authors Guild cases have tightened the four-factor test; for music training, three of four factors weigh against scraping. Relying on Fair Use is a bet that the case law will reverse direction.
Scraping solves a data acquisition problem by deferring a legal problem. The legal problem is now arriving.
Why buy-outs fail
Buy-outs give you legal certainty for the catalogue you buy. They fail on cost, on coverage, and on the relationship they create with the people who made the music.
- Cost. A competitive model needs hundreds of thousands of hours of music. Even modest catalogues cost six to seven figures; a full training base through buy-outs runs to nine figures before any infrastructure or compute. A barrier only the major platforms can clear.
- Static catalogues. A bought catalogue is fixed at acquisition. It cannot adapt to new model architectures that need different annotations, new application domains that need underrepresented genres, or new regulatory requirements that demand fuller provenance. Every gap requires another acquisition.
- Exclusion of independent and regional rights holders. Buy-outs concentrate negotiations on catalogues large enough to be worth structuring a deal around. Independent musicians, regional CMOs, and non-Western traditions are systematically excluded — exactly the material a diverse training corpus needs most.
- No ongoing participation. Once paid, contributors have no share in the value the model goes on to create. This reproduces the structural problem that made the current crisis.
Buy-outs solve the legal problem at a price that only entrenches the incumbents who already have catalogues to sell.
Why Major Label deals are not enough
The Universal/Udio, Warner/Udio, Warner/Suno, and Merlin/Udio agreements solve a legal problem. They do not solve a verification problem — and in AI training, the verification problem is the only one that determines whether a deal actually changes anything.
You cannot read a training set off of model weights. A model trained on millions of tracks processes them in random batches across billions of parameter updates. The influence of any individual track is mathematically indistinguishable from noise. No regulatory inspection, audit procedure, or forensic technique can reliably reconstruct what was in the training data from the trained model.
This has two consequences for anyone considering a model trained under a major deal:
- Compliance is a declaration, not a demonstration. "We trained only on licensed music" is a contractual claim with no external verification mechanism. The deal certifies that some music is licensed; it does not certify that nothing else is in the model.
- The commercial incentive to retain the unlicensed base does not go away. A model trained only on Western-commercial repertoire performs worse than a model trained on the full breadth of recorded music. No commercially rational company will accept that gap if it can avoid it, and no external party can tell the difference. The incentive is structural.
For a procurement officer in a regulated industry, the question is whether the model can be defended under audit. A licensing regime that cannot be inspected depends on trust in the same parties whose past behaviour made it necessary. The longer-form argument is in The Wrong Debate.
What CORPUS does differently
CORPUS treats verification as a structural property of the system, not an attestation layered on top of it.
- Licensing on the input side, by explicit opt-in. Every contribution enters under a recorded agreement that names all rights holders and their splits, captured as an immutable snapshot at the moment of upload. See Ownership and Consent.
- Append-only provenance. Every contribution, license, and training run is logged in a tamper-evident registry. Procurement teams and regulators can inspect what went into a model — not as a vendor claim, but as a property of the architecture. See Audit Trail.
- Licensing trained models, not training data. CORPUS primarily licenses the models we train on the corpus, not the corpus itself. Where partners need to train their own architectures on CORPUS data, training runs inside CORPUS infrastructure — keeping the relationship with contributors intact. See Access Models.
- Contributors share in what they help build. Royalties follow downstream model usage, and CRPS (Corpus Participation Rights) accumulate as a lasting stake. The relationship is sustainable, which is what keeps the data flowing as the corpus grows. See How Royalties Flow.
A corpus that compounds
Scraping freezes a snapshot of the past. A bought catalogue is fixed at acquisition. A label deal hands over the back-catalog that was already cleared to sell. All three give a model a dataset that stops improving the moment the contract is signed.
A licensed, contributor-fed corpus moves the other way. It keeps growing as musicians keep contributing, so it stays current with how music actually sounds now, and it fills the gaps a model needs rather than the gaps a catalogue happened to contain. Underrepresented genres, regional traditions, and new styles enter while they are still new, each one widening the expressive range every model trained on the corpus inherits.
This is a structural advantage. A corpus built on an ongoing, fairly compensated relationship is the only one of the four options that gets better over time, which is why contributor participation is the mechanism that keeps the data, and the quality, flowing.
CORPUS and general-purpose generation
CORPUS is not a Suno competitor. A licensed corpus will not match seventy million scraped tracks for general-purpose pop generation, and that is not the problem CORPUS is solving. The markets it is built for — automotive, healthcare, interactive media, brand-sensitive advertising, cultural and educational deployment — need contextual precision, cultural specificity, and provenance that holds up under audit. The longer-form argument for that market shift is in The Wrong Debate.
Next: access models.