Saturday, August 28, 2010

Data Scraping Part-2

Here is second article on data scrapping. As I have mentioned in my previous blog that I was working on data scraping from Asp.Net MVC website. In that I have faced one more problem.

There was a field on the page which is visible to only logged in user. So first of all I have to create Asp.Net session by PHP cURL then I can request for the particular page. So how to do that. Following is the procedure for that.

First of all create a session through PHP cURL and then use that cURL request to get the restricted page. Following is the code for that.

$url = $this->login_url;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_USERAGENT ,'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8');

curl_setopt($ch, CURLOPT_AUTOREFERER,1);

curl_setopt ($ch, CURLOPT_POSTFIELDS, 'userName='.urlencode($this->user).'&password='.urlencode($this->pass).'&rememberMe=true&rememberMe=false');

curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
curl_setopt ($ch, CURLOPT_COOKIEFILE, 'cookie.txt');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION , 1);

In above code two things are new CURLOPT_USERAGENT ,CURLOPT_POSTFIELDS.

If you carefully see CURLOPT_POSTFIELDS there you can see some different data then normal post fields. Generally every login page has remember me options.

Another different thing is CURLOPT_USERAGENT. This is the information of browser and system from where the request is being sent.

How you can properly build these two fields? The option I have used is Live HTTP headers extension in Firefox. You can find it from this website http://livehttpheaders.mozdev.org/

Install it in your Firefox and then simply browse that the website for which you want to get the information. This extension will capture all the data of your browsed page. You can use it in your cURL request. Use it in above code and then use following code to get restricted information on the page.

curl_setopt($ch, CURLOPT_URL, $page_url);
curl_setopt ($ch, CURLOPT_POST, false);
$store = curl_exec ($ch);
curl_close ($ch);

Hope this helps.



No comments:

Post a Comment