Hacking on Wikipedia: links for 2008-03-20
I was recently talking to a friend about writing to screen-scraping tool for wikipedia.
My first two thoughts:
- Scraping can be problematic at best
- Dealing with nasty, nasty markup
- Prone to change
- There has to be a better way
- Preferably someone else has already done the legwork
After some digging, I uncovered the following bits:
- Wikipedia Database Download: You can get it as XML or SQL, with various degrees of details (ie all revisions, all languages, with comments, etc)
- MediaWiki API: Web service for various tasks (mostly just searching at the moment). Can get results in many formats, including XML, YAML, and JSON
- Ruby client for Wikipedia API: A Ruby wrapper in its infancy
The new API is very new, so your mileage may vary.
Update:
Rob Cakebread pointed me at a couple more resources:
- DBpedia a project to extract semantic information out of wikipedia and make it available on the web
- Linking Open Data dataset cloud shows how DBpedia fits into the Linking Open Data project