This release is powered by our Snapshot API’s Structured Contents beta, which outputs Wikimedia project data in a developer-friendly, machine-readable format. Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content—making this ideal for training models, building features, and testing NLP pipelines.
The problem is, this assumes that even if the kind of AI creators that are scraping relentlessly (and there’s a fair few that do) took this data source directly, that they’d then put an exception in their scrapers to avoid wikipedia’s site. I doubt they would bother.
You can download a torrent of the whole thing, they don’t need to give it to anyone.
https://en.m.wikipedia.org/wiki/Wikipedia:Database_download
The problem is, this assumes that even if the kind of AI creators that are scraping relentlessly (and there’s a fair few that do) took this data source directly, that they’d then put an exception in their scrapers to avoid wikipedia’s site. I doubt they would bother.