Mooduino: Wikipedia API User Agent String in PHP and cURL

Monday, 26 April 2010

Wikipedia API User Agent String in PHP and cURL

We run a site that pulls data from wikipedia.org and recently the site stopped working. The site was using the following code to interact with the wikipedia.org API; this code queries the API to see if a page with the given title exists.

$url = sprintf('http://en.wikipedia.org/w/api.php?action=query&titles=%s&prop=info&format=json', urlencode($search));
$f = fopen($url, 'r');
$res = '';
while (!feof($f)) {
 $res .= fgets($f);
}
require_once 'Zend/Json.php';
$val = Zend_Json::decode($res);

Once this has executed, $val is an array with the response details.
The problem we started to encounter was that this code started to throw a 403 HTTP status error. The 403 status code means access is denied.
A quick investigation turned up the following page meta.wikimedia.org/wiki/User-Agent_policy which details how, in order to use the API, you now need to pass a User Agent string along with the request. Requests without the User Agent string are refused. User Agent strings are sent by requests from browsers and are used to describe the software that is making the request.
The problem was that fopen() doesn't send a User Agent string and can't be used to do so.
This is where cURL comes in (www.php.net/manual/en/intro.curl.php). cURL is a library for communicating over various internet protocols and allows you to set headers in requests. The same code above, rewritten to use cURL, is as follows:

$url = sprintf('http://en.wikipedia.org/w/api.php?action=query&titles=%s&prop=info&format=json', urlencode($search));
$ch=curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'your website address or app name'); 
$res = curl_exec($ch);
curl_close($ch);
require_once 'Zend/Json.php';
$val = Zend_Json::decode($res);

This small change was all that was needed to appease wikipedia's User Agent requirement.

9 comments:

jarlaxle18028 May 2010 at 00:18
Thank you so much for this article; I was looking at all the wrong reasons for the new 403 my script was recieving.

One question though (I'm new to php) -- what does "Zend_Json::decode($res);" do (including the handling in the Zend_Json.php file)?

Thanks,
Casey
ReplyDelete
Replies
Unknown10 July 2010 at 18:31
Hi Casey

Zend_Json::decode() is part of the Zend Framework (http://zendframework.com/manual/en/zend.json.basics.html) and is used here to turn the JSON string that Wikipedia returns into native PHP associative arrays.
ReplyDelete
Replies
Casey13 July 2010 at 00:22
Thanks! Exactly what I wanted to know.
ReplyDelete
Replies
DIM3NSION1 May 2011 at 13:38
Hi moo. Thankyou for your example, ive managed to get a response however the array only tells me this information from wikipedia (searched for football) -

[pages] => Array
(
[23976719] => Array
(
[pageid] => 23976719
[ns] => 0
[title] => Football
[touched] => 2011-04-28T16:48:04Z
[lastrevid] => 426407882
[counter] =>
[length] => 89919
)

)

how would i actually get the information about the subject football?

Thanks,

DIM3NSION
ReplyDelete
Replies
Unknown1 May 2011 at 14:33
Hi DIM3NSION

To get the actual content, I used the parse action from the wiki API. So, the URL in your case would be

http://en.wikipedia.org/w/api.php?action=parse&page=football&redirects=1&format=json&prop=text

I hope this helps.

Michael
ReplyDelete
Replies
DIM3NSION2 May 2011 at 16:14
Perfect, thanks for that. I'm struggling to figure out a way of parsing just the introduction text that accompanies the article. Any ideas?

Thanks,

DIM3NSION
ReplyDelete
Replies
DIM3NSION3 May 2011 at 10:58
In follow up to my previous message moo ive been able to cut it down by adding - &section=0 to the end of the URL. However this returns the images all i want is the introduction text? Id much appreciate your help on this topic

thanks,
DIM3NSION
ReplyDelete
Replies
Anonymous28 July 2011 at 05:57
Is goog, I want user it in my web http://www.satelliteview.org
ReplyDelete
Replies
Anonymous19 December 2013 at 22:36
Thank you for this helpful post.
ReplyDelete
Replies

Add comment

Monday, 26 April 2010

Wikipedia API User Agent String in PHP and cURL

Related articles by Zemanta

9 comments: