Monday 26 April 2010

Wikipedia API User Agent String in PHP and cURL

We run a site that pulls data from wikipedia.org and recently the site stopped working. The site was using the following code to interact with the wikipedia.org API; this code queries the API to see if a page with the given title exists.
$url = sprintf('http://en.wikipedia.org/w/api.php?action=query&titles=%s&prop=info&format=json', urlencode($search));
$f = fopen($url, 'r');
$res = '';
while (!feof($f)) {
 $res .= fgets($f);
}
require_once 'Zend/Json.php';
$val = Zend_Json::decode($res);
Once this has executed, $val is an array with the response details.
The problem we started to encounter was that this code started to throw a 403 HTTP status error. The 403 status code means access is denied.
A quick investigation turned up the following page meta.wikimedia.org/wiki/User-Agent_policy which details how, in order to use the API, you now need to pass a User Agent string along with the request. Requests without the User Agent string are refused. User Agent strings are sent by requests from browsers and are used to describe the software that is making the request.
The problem was that fopen() doesn't send a User Agent string and can't be used to do so.
This is where cURL comes in (www.php.net/manual/en/intro.curl.php). cURL is a library for communicating over various internet protocols and allows you to set headers in requests. The same code above, rewritten to use cURL, is as follows:
$url = sprintf('http://en.wikipedia.org/w/api.php?action=query&titles=%s&prop=info&format=json', urlencode($search));
$ch=curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'your website address or app name'); 
$res = curl_exec($ch);
curl_close($ch);
require_once 'Zend/Json.php';
$val = Zend_Json::decode($res);
This small change was all that was needed to appease wikipedia's User Agent requirement.
Enhanced by Zemanta