Made of Everything You're Not

Thoughts on programming, people and life
  • Home
  • Projects
  • Portfolio
  • Resume
« AT&T Phone Codes
WordPress Plugin in 12 hours »

A journey into php-cli and scraping

I recently had a couple days to myself and I wanted to experiment more with this php-cli thing I’d been thinking about. To help the process (and feed my guitar addiction; I have a serious problem) I decided to write a script to hit up the Stupid Deal page for Musicians Friend and send me an email if the deal of the day matched a given term list.

Prep

I’m pretty sure all Windows installs of php include php-cli but to check execute this in the cmd:
Download

php -v

You should see something like the below; note (cli):

PHP 5.2.6 (cli) (built: May  2 2008 18:02:07)
Copyright (c) 1997-2008 The PHP Group
Zend Engine v2.2.0, Copyright (c) 1998-2008 Zend Technologies
with Xdebug v2.0.3, Copyright (c) 2002-2007, by Derick Rethans

Assuming it’s all worked out here are some additional requirements:
1. Must work like *nix cli program; it’s just going to make things easier for me. For example the program should be executed like:

C:\ProjectFiles\php_cli>php check_for_guitars.php --search="guitar,amp,tablature" --email="foo@bar.com"

2. Must have error checking and validation.
3. Must prevent duplicate notifications.
4. Provide a “help” mode (–help, -help, -h, -?).
5. Ability to be set as Automated Task (Windows Cron equivalent)

Argument Handling

To begin, I needed to change the way passed parameters are interpreted. Before version 5.3, php handled parameters passed to scripts in a pretty messed up way; but there’s a function available in the notes of the php manual that helps a lot.
inc.php

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
function arguments($argv) {
   $_ARG = array();
   foreach ($argv as $arg) {
       if (preg_match('#^-{1,2}([a-zA-Z0-9]*)=?(.*)$#', $arg, $matches)) {
           $key = $matches[1];
           switch ($matches[2]) {
               case '':
               case 'true':
               $arg = true;
               break;
               case 'false':
               $arg = false;
               break;
               default:
               $arg = $matches[2];
           }
 
           /* make unix like -afd == -a -f -d */            
           if(preg_match("/^-([a-zA-Z0-9]+)/", $matches[0], $match)) {
               $string = $match[1];
               for($i=0; strlen($string) > $i; $i++) {
                $_ARG[$string[$i]] = true;
               }
           } else {
               $_ARG[$key] = $arg; 
           }            
       } else {
           $_ARG['input'][] = $arg;
       }        
   }
   return $_ARG; 
}

Using the above function works like so:

C:\ProjectFiles\php_cli>php check_for_guitars.php --search="guitar,amp,tablature" --email="foo@bar.com"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
$input = arguments($argv);
print_r($input);
/*
Array
(
    [input] => Array
        (
            [0] => get_music.php
        )
 
    [search] => guitar,amp,tablature
    [email] => foo@bar.com
)
*/

Now that we can access the passed variables we need to validate and verify them like any other script. The code below checks if a key is present in the $input array and if not goes into a loop sending a request to STDIN and validates the returned value; if TRUE it breaks out of the loop.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
//make sure we have a value for "search"
$validate_search = FALSE;
if(!array_key_exists('search',$input)){
	$validate_search = TRUE;
} else {
	if(strlen($input['search']) <= 2){
		$validate_search = TRUE;
	}
}
 
if($validate_search){
	echo "Please enter what to search for:\n"; 
	while(1){
 
		$input['search'] = trim(fgets(STDIN)); // reads one line from STDIN
		if(strlen($input['search']) <= 2){//it's a valid string
			break;
		}
		echo "Please enter a something to search for ";
		echo "(at least 2 charachters:\n";
		echo "Example: \"guitar,bass,dvd\"\n";
	}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
//make sure we have a valid email address
$validate_email = FALSE;
if(!array_key_exists('email',$input)){
	$validate_email = TRUE;
} else {
	if(!checkEmail_basic($input['email'])){
		$validate_email = TRUE;
	}
}
 
if($validate_email){
	echo "Please enter an email to send the alert to:\n"; 
	while(1){
 
		$input['email'] = trim(fgets(STDIN)); // reads one line from STDIN
		if(checkEmail_basic($input['email'])){//it's a valid email
			break;
		}
		echo "Please enter a valid email address:\n";
	}
}

Help

To access the help mode there’s an example there that maintains the *nix tradition of “–help, -h or -?” like the below:

C:\ProjectFiles\php_cli>php check_for_guitars.php --help
 
Takes a given string (--search) and searches the
Stupid Deal of the Day for a match. If a match is
found an email is sent to (--email)
 
 Usage:
 check_for_guitars.php <option>
 
 <option> With the --help, -help, -h,
 or -? options, you can get this help.
 
 Example:
 check_for_guitars.php --search="term1" --email="foo@bar.com"

The accompanying php code works like the below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<?php
/**
 * Check if we're dealing with 0 paramaters or help
 */
if(isset($argv[1]) && in_array($argv[1], array('--help', '-h', '-?'))) {
?>
Takes a given string (--search) and searches the 
Stupid Deal of the Day for a match. If a match is 
found an email is sent to (--email)
 
 Usage: 
 <?php echo $argv[0]; ?> <option>
 
 <option> With the --help, -help, -h,
 or -? options, you can get this help.
 
 Example:
 <?php echo $argv[0]; ?> --search="term1" --email="foo@bar.com"
<?php } ?>

Now that the above is done things are starting to work just like a traditional web app.

Grab and Parse Page

The first thing we need to do is get the actual page. To do this I used Snoopy.

1
2
3
4
5
6
$uri_to_check = 'http://www.musiciansfriend.com/stupid';
$snoopy = new Snoopy;
$snoopy->agent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)";
$snoopy->referer = "http://www.yahoo.com/";
$snoopy->fetch($uri_to_check);
$results = $snoopy->results;

The above returns the entire contents of $uri_to_check into a string in $results. Now we need to parse $results and find all the values we need. Here’s how to get the page title:

1
2
3
$pattern = "'<[^>]*h1[^>]*>(.*?)<[^>]*/h1[^>]*>'";
preg_match($pattern, $results, $match);
$page_title = $match['1'];

Next, find out if there is a match in $input['search'] and create an array of the values:

1
2
3
4
5
6
7
8
9
10
//check if there's a match in the passed $input['search'] array
$total = count($input['search']);
$match_for = array();
$FOUND = FALSE;
for($i=0;$i<$total;$i++){
	if(stristr($page_title, trim($input['search'][$i])) !== FALSE) {
		$match_for[] = trim($input['search'][$i]);
		$FOUND = TRUE;
	} 
}

Basically, if $FOUND is TRUE than check if an alert has already been sent and send a new alert if not:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
$htmlmessage = <<<HTML
Match found for <a href="$uri_to_check">%%search%%</a><br>
Title: %%title%% <br>
Sale Price: %%sale_price%%<br>
Original Price: %%og_price%%<br>
HTML;
if($FOUND){
 
	//check if the search was done today...
	$sql = "SELECT * FROM mf_checks WHERE title = '".$DB->es($page_title)."' AND DATE_FORMAT(`date_checked`,'%m') = '".date('m')."' AND DATE_FORMAT(`date_checked`,'%d') = '".date('d')."' AND DATE_FORMAT(`date_checked`,'%Y') = '".date('Y')."' LIMIT 1";
	$DB->query($sql);
	if($DB->getNumRows() == '1'){ //alert has already been sent so break out...
		echo "Already sent today... exiting...";
		exit;
	}
 
	//match was found so get the price now
	$price_arr = explode('<div style="font-size:3em;color:#FF0000;font-weight:normal;padding:20px 0;">',$results);
	$price_arr = explode("\n",$price_arr['1']);
	$sale_price = strip_tags($price_arr['0']);
	$og_price = str_replace('Reg ','',strip_tags($price_arr['1']));
 
	$htmlmessage = str_replace(array('%%search%%','%%title%%','%%sale_price%%','%%og_price%%'),array('"'.implode(', ',$match_for).'"',$page_title,$sale_price,$og_price),$htmlmessage);
 
	$mail = new Mailer();
	$mail->From = $input['email'];
	$mail->FromName = $input['email'];
	$mail->Subject = 'Found: '.$page_title;
	$mail->AltBody = strip_tags($htmlmessage);
	$mail->MsgHTML($htmlmessage);
	$mail->AddAddress($input['email']);
	if($mail->Send()){
		echo "Mail Sent";
	} else {
		echo "Mail Not Sent";
	}
 
	//add to the db 
	$sql = "INSERT INTO mf_checks SET term = '".$DB->es(implode(', ',$match_for))."', title = '".$DB->es($page_title)."', sale_price = '".$DB->es($sale_price)."', og_price = '".$DB->es($og_price)."', date_checked = now(), alert_sent = '1'";
	$DB->query($sql);
}

Automating

To set the script to automatically check on a regular interval you have to setup an Automatic Task in Start->Programs->Accessories->System Tools->Task Scheduler and add something like the below to the Triggers tab of a new task:

C:\php\php-win.exe C:\ProjectFiles\php_cli>php check_for_guitars.php --search="guitar,amp,tablature" --email="foo@bar.com"

Note the full path to php-win.exe. If you use “php” by itself you’ll get an annoying dos box popping up every time the script executes; use the full path to your php-win.exe program.

Code

Download Check Guitar

Bookmark and Share

This entry was written by Eric Lamb and posted on Thursday, January 1st, 2009 at 2:14 pm and is filed under Code, Programming. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Leave a Reply

Click here to cancel reply.

  • Subscribe: Entries | Comments
  • About Me

    Email Email
    Twitter Twitter
    310.739.3322
  • Categories

    • Brain Dump
    • Business
    • Code
    • IT
    • Programming
    • Rant
    • Servers
  • Archives

    • October 2011
    • August 2011
    • July 2011
    • June 2011
    • May 2011
    • April 2011
    • March 2011
    • February 2011
    • January 2011
    • December 2010
    • November 2010
    • October 2010
    • September 2010
    • August 2010
    • July 2010
    • June 2010
    • May 2010
    • April 2010
    • March 2010
    • February 2010
    • January 2010
    • December 2009
    • November 2009
    • October 2009
    • September 2009
    • August 2009
    • July 2009
    • June 2009
    • May 2009
    • April 2009
    • March 2009
    • February 2009
    • January 2009
    • December 2008
    • November 2008
    • October 2008

Copyright © 2008 - 2012 Eric Lamb - All rights reserved