• Eager Eagle@lemmy.world
    link
    fedilink
    English
    arrow-up
    22
    ·
    9 days ago

    This release is powered by our Snapshot API’s Structured Contents beta, which outputs Wikimedia project data in a developer-friendly, machine-readable format. Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content—making this ideal for training models, building features, and testing NLP pipelines.

    • r00ty@kbin.life
      link
      fedilink
      arrow-up
      19
      ·
      9 days ago

      The problem is, this assumes that even if the kind of AI creators that are scraping relentlessly (and there’s a fair few that do) took this data source directly, that they’d then put an exception in their scrapers to avoid wikipedia’s site. I doubt they would bother.