• Eager Eagle@lemmy.world
      link
      fedilink
      English
      arrow-up
      22
      ·
      8 days ago

      This release is powered by our Snapshot API’s Structured Contents beta, which outputs Wikimedia project data in a developer-friendly, machine-readable format. Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content—making this ideal for training models, building features, and testing NLP pipelines.

      • r00ty@kbin.life
        link
        fedilink
        arrow-up
        19
        ·
        8 days ago

        The problem is, this assumes that even if the kind of AI creators that are scraping relentlessly (and there’s a fair few that do) took this data source directly, that they’d then put an exception in their scrapers to avoid wikipedia’s site. I doubt they would bother.