We have heard of crawlers being used and may seem like a big thing at that point. However as we get our hands dirty may not appear that complex . We will see here about steps for building a very basic crawler using PHP which then can be enhanced and made into something complex and big if needed .Since the idea is same this holds true for any other language . To start off, a very basic crawler involves :
1) Open the url and read its contents .
2)Parse the HTML content using your custom parser or any DOM parser .Extract any piece of information required .
Taking above steps in perspective lets get into coding :
In PHP we can use file_get_contents(“http://…”) to open any url and read its content however it usually results in error as :
PHP Warning: file_get_contents(“http://): failed to open stream: HTTP request failed!
Hence we will make use of curl for this purpose .For “curl” to work we will have to have curl installed and enabled .For debian based machine use : sudo apt-get install php5-curl
Then restart apache server using : sudo /etc/init.d/apache2 start
Now you are ready to use curl in php code . Lets see the code now to open an url and get its content :
$ch=curl_init(); //initialize curl
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); //Set curl to return the data instead of printing it to the browser.
curl_setopt($ch, CURLOPT_URL, “http://google.com/”);
$data = curl_exec($ch); //returns the content as a String
curl_close($ch); echo $data ; //see the content
After we have got the contents it has to be parsed to get some valuable information. In our case we will try to find all the links present in the page and for that we can use regular expression instead of any DOM Parsing library.
For matching all the links can be achieved using
preg_match_all(‘/<a([^>]+)\>(.*?)\<\/a\>/i’, $data, $links); //matches al the <a></a> tags
$links= !empty($links) ? $links : FALSE;
print_r($links); //Print all the links which are stored as array.
Similarly we can get images and other information from the pages .
Thats it !!!!! we have opened url , read it contents and got all the links present in that page .