Using Web Bots to hunt for B2B marketing leads

January 3, 2019

How We Obtained Vendor Email Addresses

Let’s use Houzz.com as our target for this example (for educational purposes). Our goal was to obtain email addresses from the businesses listed in their online vendor directory.

The core problem was that the email addresses ARE NOT directly available on the Houzz website.

Below, we’ll walk you through the strategy and implementation we used to overcome this challenge and acquire the necessary data.

The Houzz BOT at work….Console reporting back results

Analysis of Site and Strategy Used

Houzz vendor listings

Targets Indexed

So, we used the Houzz vendor listing pages to index all the vendors, who were our initial targets.

Vendor Detail Page

Email Workaround

We then programmed our bot to visit each vendor’s individual profile page on Houzz and collect any relevant details available there. Unfortunately, as anticipated, no email address was listed directly on these pages. But, they did provide the vendor’s official website URL. This gave us our next target.

Vendors Website

Obtain Payload

Our bot was then directed to the vendor’s own website. We instructed it to scour the various pages of that site (commonly looking at “Contact Us,” “About Us,” or footer information) specifically in search of an email address – our desired “payload.”

Here is the PHP code for the bot’s logic, which you can access in the Bitbucket repository.

The MySQL database insertion code is commented out in the provided script. This is in case you prefer to store the retrieved data in a database rather than a file. In this specific implementation, I opted to place the results directly into a CSV file.

The script writes two CSV files:

One file used in Step 1 for indexing the vendor website URLs found on Houzz.
In Step 2, this CSV file of target vendor website URLs is then used as the input for the bot to search for email addresses on those external sites.

Feedback and progress updates are outputted in the terminal during the script’s execution using fwrite(STDOUT).

```php
    //You can get these files over at my https://bitbucket.org/nicknguyenzrd/houzzbot/
    require("crawler.php");
    require("CSSQuery.php");

    /* Uncomment below to store data in MYSQL
    $servername = "localhost";
    $username = "root";
    $password = "";
    $dbname = "invoice";

// Create connection
$conn = new mysqli($servername, $username, $password, $dbname);

    // Check connection
    if ($conn->connect_error) {
        die("Connection failed: " . $conn->connect_error);
    }
    */

    //Step 1: Gather Houzz Links
//Open Links File because thats where well dump our data payload
$handle = fopen("links.txt", "r");
    $id=94;
$type=1;
//Data Placeholder Array
$data['href']=array();
$data['company']=array();
$data['type']=array();
$data['id']=array();
$id=1;

//Deal with multiple page results with The All Powerful Iterative Loop
for ($i = 1; $i <= 30; $i++) {
$doc = new DOMDocument();

if($i===1) {
    $p=0; //To grab the first page had a different URL
    $doc->loadHTML( file_get_contents( "http://www.houzz.com/professionals/landscape-architect/orange-county"));
} else {
    //Every Page after the first page "/p/{page number}"
    $doc->loadHTML( file_get_contents( "http://www.houzz.com/professionals/landscape-architect/orange-county/p/" . $p ) );
}

//Webpage loaded for us
$css = new CSSQuery( $doc );
$arr = array();
$arr = $css->query( 'a.pro-title' );

foreach ( $arr as $a ) {
    //Get URL Link Filter out Javascript
    if ( $a->attributes->getNamedItem( 'href' )->value === "javascript:;" ) {
    } else {
        //Store link and company name
        $data['id'][]=$id;
        $data['href'][]    = $a->attributes->getNamedItem( 'href' )->value;
        $data['company'][] = $a->nodeValue;
        $data['type'][]=1;

        //Open our List of Links Page
        $handle = fopen('links.txt',"a+");
        $somecontent = $a->attributes->getNamedItem( 'href' )->value."\r\n"; // Use \r\n for Windows/Linux compatibility
        fwrite($handle,$somecontent);
        fwrite(STDOUT, $somecontent);
        fclose($handle);
        $id++;
    }
}
$p=$p+15;
sleep(1);
unset($doc);
unset($css);
//var_dump( $data );
}

//Step 2: Gather company details (Houzz doesnt list email addresses), so well have to improvise and go to there website to acquire target email contact if its listed on there website.
//Make sure we double check were dealing with valid URLS cause that can really fuck things up once this bitch is fired up!
function get_valid_url( $url ) {
$regex = "((https?|ftp)://)?"; // Scheme
$regex .= "([a-z0-9+!*(),;?&=\$_.-]+(:[a-z0-9+!*(),;?&=\$_.-]+)?@)?"; // User and Pass
$regex .= "([a-z0-9-.]*).([a-z]{2,3})"; // Host or IP
$regex .= "(:[0-9]{2,5})?"; // Port
$regex .= "(/([a-z0-9+\$_-].?)+)*/?"; // Path
$regex .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+/\\$_.-]*)?"; // GET Query
$regex .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor
return preg_match("/^$regex$/", $url);
}

if ($handle) {
while ( ( $line = fgets( $handle ) ) !== false ) {
    $email="";
    $website="";
    $url="";
    $name="";
    $company="";
    $phone="";
    $link="";
    $tier="";
    $location="";
    $license="";
    $error="";
    $sql="";

    $doc = new DOMDocument();
    // Suppress errors for malformed HTML
    @$doc->loadHTML( file_get_contents( trim($line) ) ); // Trim whitespace from line

    $css               = new CSSQuery( $doc );

    //Houzz Link to profile
    $data['link']=$line;
    $link=trim($line);

    //Company Name
    $nrr               = $css->query( 'a.profile-full-name' );
    if (isset($nrr[0]) && $nrr[0]->textContent) {
        $data['company'][] = $nrr[0]->textContent;
        $company=$nrr[0]->textContent;
        fwrite(STDOUT, "Starting: ".$id.":".$nrr[0]->textContent."\r\n");
    } else {
         $company = "N/A";
         fwrite(STDOUT, "Starting: ".$id.": Company Name Not Found\r\n");
    }

    //Website and Email Addresses TODO add conditional statement
    $arr               = $css->query( 'a.proWebsiteLink' );
    $website_found = false;
    foreach ( $arr as $a ) {
        $url= $a->attributes->getNamedItem( 'href' )->value;

        // Basic URL validation before attempting to crawl
        if (filter_var($url, FILTER_VALIDATE_URL)) { // More robust URL validation
            $data['website'][] = $url;
            $website = $url;
            $website_found = true; // Mark that a website was found
            fwrite(STDOUT, "Attempting site: ".$url."\r\n");

            // Simple email extraction - a real crawler would be more sophisticated
            $site_content = @file_get_contents($url); // Use @ to suppress errors for unreachable sites
            if ($site_content !== FALSE) {
                 // Regex to find email addresses
                if (preg_match('/\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/', $site_content, $matches)) {
                    $email = $matches[0];
                    // Output indicating email address discovered CLI
                    fwrite(STDOUT, "Found Email Address: ".$email."\r\n");
                } else {
                     fwrite(STDOUT, "No Email Address Found on Site\r\n");
                }
            } else {
                fwrite(STDOUT, "Could not retrieve content from site: ".$url."\r\n");
                $error .= "Could not retrieve content from site; ";
            }

            // Only try the first valid URL found
            break;
        } else {
            fwrite(STDOUT, "Invalid Website URL found: ".$url."\r\n");
             $error .= "Invalid Website URL; ";
        }
    }
     if (!$website_found) {
        $website = "N/A";
         fwrite(STDOUT, "No Website URL found on Houzz profile\r\n");
     }

    //Phone Number
    $phone_found = false;
    $crr = $css->query( 'span.pro-contact-text' );
    foreach ( $crr as $c ) {
        if($c->nodeValue!=="Website") { // Exclude the "Website" text itself
            $phone = trim($c->nodeValue); // Trim whitespace
            $data['phone'][] = $phone;
            $phone_found = true;
            break; // Assume only one phone number listed this way
        }
    }
    if (!$phone_found) {
        $phone = "N/A";
    }

    //All company details (Contact, Location, License, Tier)
    $info = $css->query( 'div.info-list-text' );
    $name = "N/A";
    $location = "N/A";
    $license = "N/A";
    $tier = "N/A";

    foreach ( $info as $i ) {
        $text = trim($i->nodeValue); // Trim whitespace from text content

        //Person to contact
        if (strpos( $text, "Contact:" )!==FALSE) {
            $name = str_replace( "Contact:",'', $text );
            $name = trim($name);
            $data['contact'][] =$name;
        }
        //Address/Location
        if (strpos( $text, "Location:" )!==FALSE) {
            $location = str_replace( "Location:",'', $text );
            $location = trim($location);
            $data['location'][]=$location;
        }
        //License Number
        if (strpos( $text, "License Number:" )!==FALSE) {
            $license=str_replace( "License Number:",'', $text );
            $license=trim($license);
            $data['license'][] =$license;
        }
        //Tier (Typical Job Costs)
        if (strpos( $text, "Typical Job Costs:" )!==FALSE) {
            $tier =str_replace( "Typical Job Costs:",'', $text );
            $tier=trim($tier);
            $data['tier'][]=$tier;
        }
    }

    // Write architect contact information into a CSV file
    $wr= fopen('archs.csv',"a+");
    // Use fputcsv for proper CSV formatting and escaping
    fputcsv($wr, [$id, $type, $company, $phone, $website, $email, $link, $name, $location, $license, $tier]);

    //Disable Comment Below to OutPut to CLI
    //fwrite(STDOUT, $details);
    $id++;
    fclose($wr);
/*  Uncomment below if youd rather insert scrapped data into MySQL Database
        $sql = "INSERT INTO ip_oppurtunities(`type`,`company`,`phone`,`website`,`email`,`link`,`contact`,`location`,`license`,`tier`)
VALUES (1,'$company','$phone','$website','$email','$link','$name','$location','$license','$tier')";

    if ($conn->query($sql) === TRUE) {
        fwrite(STDOUT, $id.'-'.$company." Added \r\n");
    } else {
        $error=mysqli_error($conn);
        fwrite(STDOUT,  "Error: ".$company."=[".$sql."]".$error."\r\n");
        echo $error;
        die(); // Consider logging error and continuing instead of dying
    }
    $id++;
*/
}
fclose($handle); // Close the links file after processing
// $conn->close(); // This should be outside the while loop if using DB insertion
}