A simple basic PHP Web Crawler in 5 mins..

We have heard of crawlers  being used   and   may seem like a big thing at that point. However as  we get our hands dirty  may not appear that complex . We will see  here about steps  for building a very basic crawler  using PHP which then can be enhanced and  made into something complex and big if needed .Since the idea is same  this holds true for any other language . To start off, a very basic crawler   involves :

1) Open the url and read  its contents .

2)Parse the HTML content  using your custom parser or any DOM parser .Extract any piece of information required .

Taking above steps in perspective lets  get into coding :

In PHP  we can use  file_get_contents(“http://…”) to open any url and read its content however  it usually results in  error as :

PHP Warning:  file_get_contents(“http://): failed to open stream: HTTP request failed!

Hence we will make use of curl for this purpose .For “curl”  to work  we will have to have curl installed and enabled .For debian based machine use :     sudo  apt-get install php5-curl

Then restart apache  server using :    sudo  /etc/init.d/apache2 start

Now  you are ready to  use curl in php code . Lets see the code now to open an url and get its content  :

$ch=curl_init();  //initialize curl
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); //Set curl to return the data instead of printing it to the browser.
curl_setopt($ch, CURLOPT_URL, “http://google.com/”);
$data = curl_exec($ch);  //returns the content as a String
curl_close($ch);     echo $data ;  //see the content

After  we have got the  contents   it has to be parsed  to get some valuable information. In our case we will try to find all the links present in the page  and for that we can use  regular expression instead of any DOM Parsing library.

For matching all the links  can be achieved using

if (!empty($data)){
preg_match_all(‘/<a([^>]+)\>(.*?)\<\/a\>/i’, $data, $links); //matches al the <a></a> tags

$links= !empty($links[1]) ? $links[1] : FALSE;
print_r($links);    //Print all the links which are stored as array.
}

Similarly we can get images  and other information from the pages .

Thats it !!!!! we have opened url , read it contents and got all the links present in that page .

Share this article


About the author



Next post