PHP: Screen Scraping ESEA

Recently I had some free time and I decided I wanted to automate some common tasks of mine. And let me tell you honestly, I hate having to do screen scraping. It’s an annoying, tedious task. Making regex for this, and for that, and then to find out my hours were wasted as that regex won’t work on another site.

That’s a thing of the past.

I’m super excited about this find. Maybe I’m the last to discover it, but it’s just too awesome to pass up.

The project is called: PHP Simple HTML DOM Parser.

Literally, this takes almost all of the frustration out of screen scraping. Here’s an example from a quick and dirty login and grab my stats for ESEA.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
<?php
require_once('simple_html_dom.php'); // get our class
define('COOKIE_JAR','./cookie/cookie'); // cookie jar for cURL
/**
* We need to set up our referrer and user agent, for cURL
**/
$referrer = "http://www.esportsea.com/";
$agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.4) Gecko/2008102920 Firefox/3.0.4";
/**
* str = our direct link to our user page
**/
$str = "http://www.esportsea.com/users/<your user id>";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$str);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_REFERER, $referrer);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt($ch, CURLOPT_COOKIEFILE, COOKIE_JAR);
curl_setopt($ch, CURLOPT_COOKIEJAR, COOKIE_JAR);
curl_setopt($ch, CURLOPT_TIMEOUT, '10');
$result = curl_exec ($ch);
curl_close ($ch);
// done with cURL
$html = new simple_html_dom(); // create our HTML object
$html->load($result);
$pug_stats = $html->find('#body-matches-pug table tr'); // load up pug stats
// this loops through each <tr>
$output = array();
foreach($pug_stats as $stat)
{
$row['game'] = trim($stat->find('img',0)->title);
$row['link'] = trim($stat->find('a',0)->href);
$row['score'] = trim($stat->find('a',0)->plaintext);
$row['srv'] = trim($stat->find('a',1)->href);
$row['srv_txt'] = trim($stat->find('a',1)->plaintext);
$output[] = $row;
}
echo "<pre>",print_r($output,true),"</pre>";
exit;

And that’s it.

Now take a look at that, and realize how much stuff I’m not forced to do.

I know this isn’t some great new invention, loading the source into a DOM object and parsing it, but man, this almost eliminates the need to think about screen scraping entirely.