A journey into php-cli and scraping
I recently had a couple days to myself and I wanted to experiment more with this php-cli thing I’d been thinking about. To help the process (and feed my guitar addiction; I have a serious problem) I decided to write a script to hit up the Stupid Deal page for Musicians Friend and send me an email if the deal of the day matched a given term list.
Prep
I’m pretty sure all Windows installs of php include php-cli but to check execute this in the cmd:
Download
php -vYou should see something like the below; note (cli):
PHP 5.2.6 (cli) (built: May 2 2008 18:02:07) Copyright (c) 1997-2008 The PHP Group Zend Engine v2.2.0, Copyright (c) 1998-2008 Zend Technologies with Xdebug v2.0.3, Copyright (c) 2002-2007, by Derick Rethans
Assuming it’s all worked out here are some additional requirements:
1. Must work like *nix cli program; it’s just going to make things easier for me. For example the program should be executed like:
C:\ProjectFiles\php_cli>php check_for_guitars.php --search="guitar,amp,tablature" --email="foo@bar.com"
2. Must have error checking and validation.
3. Must prevent duplicate notifications.
4. Provide a “help” mode (–help, -help, -h, -?).
5. Ability to be set as Automated Task (Windows Cron equivalent)
Argument Handling
To begin, I needed to change the way passed parameters are interpreted. Before version 5.3, php handled parameters passed to scripts in a pretty messed up way; but there’s a function available in the notes of the php manual that helps a lot.
inc.php
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | function arguments($argv) { $_ARG = array(); foreach ($argv as $arg) { if (preg_match('#^-{1,2}([a-zA-Z0-9]*)=?(.*)$#', $arg, $matches)) { $key = $matches[1]; switch ($matches[2]) { case '': case 'true': $arg = true; break; case 'false': $arg = false; break; default: $arg = $matches[2]; } /* make unix like -afd == -a -f -d */ if(preg_match("/^-([a-zA-Z0-9]+)/", $matches[0], $match)) { $string = $match[1]; for($i=0; strlen($string) > $i; $i++) { $_ARG[$string[$i]] = true; } } else { $_ARG[$key] = $arg; } } else { $_ARG['input'][] = $arg; } } return $_ARG; } |
Using the above function works like so:
C:\ProjectFiles\php_cli>php check_for_guitars.php --search="guitar,amp,tablature" --email="foo@bar.com"
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | $input = arguments($argv); print_r($input); /* Array ( [input] => Array ( [0] => get_music.php ) [search] => guitar,amp,tablature [email] => foo@bar.com ) */ |
Now that we can access the passed variables we need to validate and verify them like any other script. The code below checks if a key is present in the $input array and if not goes into a loop sending a request to STDIN and validates the returned value; if TRUE it breaks out of the loop.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | //make sure we have a value for "search" $validate_search = FALSE; if(!array_key_exists('search',$input)){ $validate_search = TRUE; } else { if(strlen($input['search']) <= 2){ $validate_search = TRUE; } } if($validate_search){ echo "Please enter what to search for:\n"; while(1){ $input['search'] = trim(fgets(STDIN)); // reads one line from STDIN if(strlen($input['search']) <= 2){//it's a valid string break; } echo "Please enter a something to search for "; echo "(at least 2 charachters:\n"; echo "Example: \"guitar,bass,dvd\"\n"; } } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | //make sure we have a valid email address $validate_email = FALSE; if(!array_key_exists('email',$input)){ $validate_email = TRUE; } else { if(!checkEmail_basic($input['email'])){ $validate_email = TRUE; } } if($validate_email){ echo "Please enter an email to send the alert to:\n"; while(1){ $input['email'] = trim(fgets(STDIN)); // reads one line from STDIN if(checkEmail_basic($input['email'])){//it's a valid email break; } echo "Please enter a valid email address:\n"; } } |
Help
To access the help mode there’s an example there that maintains the *nix tradition of “–help, -h or -?” like the below:
C:\ProjectFiles\php_cli>php check_for_guitars.php --help Takes a given string (--search) and searches the Stupid Deal of the Day for a match. If a match is found an email is sent to (--email) Usage: check_for_guitars.php <option> <option> With the --help, -help, -h, or -? options, you can get this help. Example: check_for_guitars.php --search="term1" --email="foo@bar.com"
The accompanying php code works like the below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | <?php /** * Check if we're dealing with 0 paramaters or help */ if(isset($argv[1]) && in_array($argv[1], array('--help', '-h', '-?'))) { ?> Takes a given string (--search) and searches the Stupid Deal of the Day for a match. If a match is found an email is sent to (--email) Usage: <?php echo $argv[0]; ?> <option> <option> With the --help, -help, -h, or -? options, you can get this help. Example: <?php echo $argv[0]; ?> --search="term1" --email="foo@bar.com" <?php } ?> |
Now that the above is done things are starting to work just like a traditional web app.
Grab and Parse Page
The first thing we need to do is get the actual page. To do this I used Snoopy.
1 2 3 4 5 6 | $uri_to_check = 'http://www.musiciansfriend.com/stupid'; $snoopy = new Snoopy; $snoopy->agent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"; $snoopy->referer = "http://www.yahoo.com/"; $snoopy->fetch($uri_to_check); $results = $snoopy->results; |
The above returns the entire contents of $uri_to_check into a string in $results. Now we need to parse $results and find all the values we need. Here’s how to get the page title:
1 2 3 | $pattern = "'<[^>]*h1[^>]*>(.*?)<[^>]*/h1[^>]*>'"; preg_match($pattern, $results, $match); $page_title = $match['1']; |
Next, find out if there is a match in $input['search'] and create an array of the values:
1 2 3 4 5 6 7 8 9 10 | //check if there's a match in the passed $input['search'] array $total = count($input['search']); $match_for = array(); $FOUND = FALSE; for($i=0;$i<$total;$i++){ if(stristr($page_title, trim($input['search'][$i])) !== FALSE) { $match_for[] = trim($input['search'][$i]); $FOUND = TRUE; } } |
Basically, if $FOUND is TRUE than check if an alert has already been sent and send a new alert if not:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | $htmlmessage = <<<HTML Match found for <a href="$uri_to_check">%%search%%</a><br> Title: %%title%% <br> Sale Price: %%sale_price%%<br> Original Price: %%og_price%%<br> HTML; if($FOUND){ //check if the search was done today... $sql = "SELECT * FROM mf_checks WHERE title = '".$DB->es($page_title)."' AND DATE_FORMAT(`date_checked`,'%m') = '".date('m')."' AND DATE_FORMAT(`date_checked`,'%d') = '".date('d')."' AND DATE_FORMAT(`date_checked`,'%Y') = '".date('Y')."' LIMIT 1"; $DB->query($sql); if($DB->getNumRows() == '1'){ //alert has already been sent so break out... echo "Already sent today... exiting..."; exit; } //match was found so get the price now $price_arr = explode('<div style="font-size:3em;color:#FF0000;font-weight:normal;padding:20px 0;">',$results); $price_arr = explode("\n",$price_arr['1']); $sale_price = strip_tags($price_arr['0']); $og_price = str_replace('Reg ','',strip_tags($price_arr['1'])); $htmlmessage = str_replace(array('%%search%%','%%title%%','%%sale_price%%','%%og_price%%'),array('"'.implode(', ',$match_for).'"',$page_title,$sale_price,$og_price),$htmlmessage); $mail = new Mailer(); $mail->From = $input['email']; $mail->FromName = $input['email']; $mail->Subject = 'Found: '.$page_title; $mail->AltBody = strip_tags($htmlmessage); $mail->MsgHTML($htmlmessage); $mail->AddAddress($input['email']); if($mail->Send()){ echo "Mail Sent"; } else { echo "Mail Not Sent"; } //add to the db $sql = "INSERT INTO mf_checks SET term = '".$DB->es(implode(', ',$match_for))."', title = '".$DB->es($page_title)."', sale_price = '".$DB->es($sale_price)."', og_price = '".$DB->es($og_price)."', date_checked = now(), alert_sent = '1'"; $DB->query($sql); } |
Automating
To set the script to automatically check on a regular interval you have to setup an Automatic Task in Start->Programs->Accessories->System Tools->Task Scheduler and add something like the below to the Triggers tab of a new task:
C:\php\php-win.exe C:\ProjectFiles\php_cli>php check_for_guitars.php --search="guitar,amp,tablature" --email="foo@bar.com"
Note the full path to php-win.exe. If you use “php” by itself you’ll get an annoying dos box popping up every time the script executes; use the full path to your php-win.exe program.
Email
Twitter