The Problem CORPUS Is Solving
CORPUS is being built in the open. Some of what you read here is live, some is still design intent — expect it to evolve.
Companies need legal training data for music AI. Musicians need a way to be paid when their work is used to train models. Neither has a viable path today.
Scraping invites lawsuits. Buy-outs cost six to nine figures and freeze the catalogue. Major-label deals don't solve the verification problem behind them. And the markets that could absorb adaptive music at scale — automotive, healthcare, games, robotics — can't deploy without a foundation they can defend to procurement and regulators.
No legal training data source
- Scraping is the cheap default and increasingly indefensible. Article 4 of the EU DSM Directive lets rights holders opt out — GEMA and other CMOs already have. The EU AI Act adds mandatory provenance disclosure for general-purpose models. Models trained on scraped data are structurally non-compliant in major markets.
- Buy-outs give legal certainty at prohibitive cost. A competitive model needs hundreds of thousands of hours of music. Even modest catalogues cost six to seven figures; a full training base runs to nine. Only the largest platforms clear that bar, and the catalogue is static the moment it's acquired.
- Major-label deals solve a legal problem but leave the verification problem untouched: you cannot read a training set off of model weights. Full case in Why CORPUS.
No economic participation for musicians
Scraping ingests work without consent or compensation. Buy-outs end the relationship at purchase — no share in what the model goes on to generate. Streaming already showed where this leads: Spotify's market cap exceeded $80 billion while the musicians who supply its catalogue debate whether streaming income covers their recording costs. Generative AI will amplify the same asymmetry if the licensing layer is built on the same logic.
New markets blocked at deployment
These markets need a different kind of asset: music that behaves, not music that is played. A vehicle's sonic environment has to track driver state, weather, and route in real time. A therapeutic device has to follow a patient's condition. A game engine has to score the player's choices. None of that runs on a fixed library.
Adaptive sound in vehicles, therapeutic music in healthcare, responsive environments in games and XR, brand-sensitive advertising, semantic interaction in robotics, cultural and educational deployment — all need training data that is diverse, rights-cleared, and auditable. No existing source provides all three. See Applications for where this is taking shape, and The Wrong Debate in the journal for the longer-form case that adaptive sound, not song generation, is the consequential market.
CORPUS is built so the same protocol addresses all three problems at once. Contributors keep their rights and accumulate a lasting stake; data stays inside CORPUS infrastructure; the system is open and auditable. The next page covers what that means in concrete terms.