VFF - The signal in the noise
News

Atlantic Maps Four Music Datasets Powering AI Models

Read original
Share
Atlantic Maps Four Music Datasets Powering AI Models

The Atlantic's Alex Reisner has created a searchable public database of four music datasets used to train AI models, including two massive collections of 12 million and 9 million tracks. The datasets have been downloaded thousands of times, with Google and Stability AI confirming their use in research papers. The discovery highlights the scale of music data being fed into AI systems and raises questions about artist consent and compensation.

  • The Atlantic identified and made searchable four datasets containing music used to train AI models
  • Two datasets contain 12 million and 9 million tracks respectively, with two others holding over 100,000 songs each
  • Google and Stability AI have confirmed using these datasets in published research
  • The datasets have been downloaded thousands of times, though exact usage remains difficult to track

This disclosure exposes the scale and sources of music data powering generative AI systems, a critical gap in transparency around AI training practices. Artists and rights holders have limited visibility into whether their work is being used to train commercial AI models, making this database a rare window into the actual data fueling the industry.

Companies developing music-generating AI and other generative models rely on large-scale training datasets, often sourced from public or semi-public repositories. Understanding which datasets are in use helps stakeholders track potential licensing and rights issues, while also revealing competitive intelligence about training approaches.

  • The music industry lacks effective mechanisms to track and control use of artist work in AI training, creating ongoing legal and ethical exposure for AI developers
  • Public datasets remain a primary source for AI training despite growing scrutiny, suggesting regulatory frameworks have not yet constrained data sourcing practices
  • Transparency tools like this database may become necessary for artists and rights holders to identify and challenge unauthorized use of their work

Monitor whether this disclosure prompts legal action from artists or rights organizations against companies confirmed to have used these datasets. Watch for industry responses around data licensing standards and whether AI developers shift toward licensed or proprietary training data sources.

Share

Subscribe to the newsletter

The latest stories and analysis, delivered to your inbox.

Free. No spam. Unsubscribe any time.

Related stories

General Intuition Seeks $300M for Embodied AI at $2B Valuation

General Intuition Seeks $300M for Embodied AI at $2B Valuation

General Intuition is in talks to raise $300 million at a valuation around $2 billion, according to sources. The startup trains embodied AI and world models using Medal's dataset of 2 billion videos per year sourced from 10 million monthly active users. The funding would signal investor confidence in embodied AI as a category and General Intuition's approach to training models on real-world video data.

by Rebecca Bellan· TechCrunch AI
Blackwell Sweeps MLPerf Training 6.0 Across All Benchmarks
TrendingNews

Blackwell Sweeps MLPerf Training 6.0 Across All Benchmarks

NVIDIA's Blackwell platform swept MLPerf Training 6.0 benchmarks, achieving the fastest training times across all seven tests, scaling to 8,192 GPUs, and being the only platform with submissions across the entire suite. The results reflect deep co-engineering between NVIDIA and cloud partners like Microsoft Azure and CoreWeave on system architecture, networking, and software optimization for large-scale model training.

by Shruti Koparkar· NVIDIA Blog (AI)
Meta embeds AI search into Facebook using public posts

Meta embeds AI search into Facebook using public posts

Meta is launching AI Mode, a new search feature on Facebook that generates AI-powered results by pulling from publicly-posted content across its platforms. The feature appears alongside traditional search modes like People and Marketplace, and allows users to ask follow-up questions to AI-generated results. This rollout is part of a broader set of new AI features Meta is introducing, including photo presets for swapping sports jerseys and collage template suggestions.

by Stevie Bonifield· The Verge AI
Warner Music acquires AI attribution startup Sureel AI

Warner Music acquires AI attribution startup Sureel AI

Warner Music Group has acquired Sureel AI, an attribution startup focused on tracking how artists' work is used in AI-generated content and model training. The deal reflects growing industry concern over unauthorized use of copyrighted music in AI systems. WMG aims to gain better visibility and control over its catalog's deployment in AI applications.

by Aisha Malik· TechCrunch AI