Wikipedia is giving AI developers its data to fend off bot scrapers

Tea@programming.dev · 9 days ago

Wikipedia is giving AI developers its data to fend off bot scrapers

Eager Eagle@lemmy.world · 9 days ago

This release is powered by our Snapshot API’s Structured Contents beta, which outputs Wikimedia project data in a developer-friendly, machine-readable format. Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content—making this ideal for training models, building features, and testing NLP pipelines.

r00ty@kbin.life · 9 days ago

The problem is, this assumes that even if the kind of AI creators that are scraping relentlessly (and there’s a fair few that do) took this data source directly, that they’d then put an exception in their scrapers to avoid wikipedia’s site. I doubt they would bother.

Wikipedia is giving AI developers its data to fend off bot scrapers

Wikipedia is giving AI developers its data to fend off bot scrapers

Wikipedia Kaggle Dataset using Structured Contents Snapshot