Monday 26 April 2010

Wikipedia API User Agent String in PHP and cURL

We run a site that pulls data from wikipedia.org and recently the site stopped working. The site was using the following code to interact with the wikipedia.org API; this code queries the API to see if a page with the given title exists.
$url = sprintf('http://en.wikipedia.org/w/api.php?action=query&titles=%s&prop=info&format=json', urlencode($search));
$f = fopen($url, 'r');
$res = '';
while (!feof($f)) {
 $res .= fgets($f);
}
require_once 'Zend/Json.php';
$val = Zend_Json::decode($res);
Once this has executed, $val is an array with the response details.
The problem we started to encounter was that this code started to throw a 403 HTTP status error. The 403 status code means access is denied.
A quick investigation turned up the following page meta.wikimedia.org/wiki/User-Agent_policy which details how, in order to use the API, you now need to pass a User Agent string along with the request. Requests without the User Agent string are refused. User Agent strings are sent by requests from browsers and are used to describe the software that is making the request.
The problem was that fopen() doesn't send a User Agent string and can't be used to do so.
This is where cURL comes in (www.php.net/manual/en/intro.curl.php). cURL is a library for communicating over various internet protocols and allows you to set headers in requests. The same code above, rewritten to use cURL, is as follows:
$url = sprintf('http://en.wikipedia.org/w/api.php?action=query&titles=%s&prop=info&format=json', urlencode($search));
$ch=curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'your website address or app name'); 
$res = curl_exec($ch);
curl_close($ch);
require_once 'Zend/Json.php';
$val = Zend_Json::decode($res);
This small change was all that was needed to appease wikipedia's User Agent requirement.
Enhanced by Zemanta

9 comments:

  1. Thank you so much for this article; I was looking at all the wrong reasons for the new 403 my script was recieving.

    One question though (I'm new to php) -- what does "Zend_Json::decode($res);" do (including the handling in the Zend_Json.php file)?

    Thanks,
    Casey

    ReplyDelete
  2. Hi Casey

    Zend_Json::decode() is part of the Zend Framework (http://zendframework.com/manual/en/zend.json.basics.html) and is used here to turn the JSON string that Wikipedia returns into native PHP associative arrays.

    ReplyDelete
  3. Thanks! Exactly what I wanted to know.

    ReplyDelete
  4. Hi moo. Thankyou for your example, ive managed to get a response however the array only tells me this information from wikipedia (searched for football) -

    [pages] => Array
    (
    [23976719] => Array
    (
    [pageid] => 23976719
    [ns] => 0
    [title] => Football
    [touched] => 2011-04-28T16:48:04Z
    [lastrevid] => 426407882
    [counter] =>
    [length] => 89919
    )

    )


    how would i actually get the information about the subject football?

    Thanks,

    DIM3NSION

    ReplyDelete
  5. Hi DIM3NSION

    To get the actual content, I used the parse action from the wiki API. So, the URL in your case would be

    http://en.wikipedia.org/w/api.php?action=parse&page=football&redirects=1&format=json&prop=text

    I hope this helps.


    Michael

    ReplyDelete
  6. Perfect, thanks for that. I'm struggling to figure out a way of parsing just the introduction text that accompanies the article. Any ideas?

    Thanks,

    DIM3NSION

    ReplyDelete
  7. In follow up to my previous message moo ive been able to cut it down by adding - &section=0 to the end of the URL. However this returns the images all i want is the introduction text? Id much appreciate your help on this topic

    thanks,
    DIM3NSION

    ReplyDelete
  8. Is goog, I want user it in my web http://www.satelliteview.org

    ReplyDelete
  9. Thank you for this helpful post.

    ReplyDelete