
Using Web Bots to hunt for B2B marketing leads
How We Obtained Vendor Email Addresses
Let’s use Houzz.com as our target for this example (for educational purposes). Our goal was to obtain email addresses from the businesses listed in their online vendor directory.
The core problem was that the email addresses ARE NOT directly available on the Houzz website.
Below, we’ll walk you through the strategy and implementation we used to overcome this challenge and acquire the necessary data.
The Houzz BOT at work….Console reporting back results
Analysis of Site and Strategy Used
Houzz vendor listings
Targets Indexed
So, we used the Houzz vendor listing pages to index all the vendors, who were our initial targets.

Vendor Detail Page
Email Workaround
We then programmed our bot to visit each vendor’s individual profile page on Houzz and collect any relevant details available there. Unfortunately, as anticipated, no email address was listed directly on these pages. But, they did provide the vendor’s official website URL. This gave us our next target.

Vendors Website
Obtain Payload
Our bot was then directed to the vendor’s own website. We instructed it to scour the various pages of that site (commonly looking at “Contact Us,” “About Us,” or footer information) specifically in search of an email address – our desired “payload.”
Here is the PHP code for the bot’s logic, which you can access in the Bitbucket repository.
The MySQL database insertion code is commented out in the provided script. This is in case you prefer to store the retrieved data in a database rather than a file. In this specific implementation, I opted to place the results directly into a CSV file.
The script writes two CSV files:
- One file used in Step 1 for indexing the vendor website URLs found on Houzz.
- In Step 2, this CSV file of target vendor website URLs is then used as the input for the bot to search for email addresses on those external sites.
Feedback and progress updates are outputted in the terminal during the script’s execution using fwrite(STDOUT)
.
```php
//You can get these files over at my https://bitbucket.org/nicknguyenzrd/houzzbot/
require("crawler.php");
require("CSSQuery.php");
/* Uncomment below to store data in MYSQL
$servername = "localhost";
$username = "root";
$password = "";
$dbname = "invoice";
// Create connection
$conn = new mysqli($servername, $username, $password, $dbname);
// Check connection
if ($conn->connect_error) {
die("Connection failed: " . $conn->connect_error);
}
*/
//Step 1: Gather Houzz Links
//Open Links File because thats where well dump our data payload
$handle = fopen("links.txt", "r");
$id=94;
$type=1;
//Data Placeholder Array
$data['href']=array();
$data['company']=array();
$data['type']=array();
$data['id']=array();
$id=1;
//Deal with multiple page results with The All Powerful Iterative Loop
for ($i = 1; $i <= 30; $i++) {
$doc = new DOMDocument();
if($i===1) {
$p=0; //To grab the first page had a different URL
$doc->loadHTML( file_get_contents( "http://www.houzz.com/professionals/landscape-architect/orange-county"));
} else {
//Every Page after the first page "/p/{page number}"
$doc->loadHTML( file_get_contents( "http://www.houzz.com/professionals/landscape-architect/orange-county/p/" . $p ) );
}
//Webpage loaded for us
$css = new CSSQuery( $doc );
$arr = array();
$arr = $css->query( 'a.pro-title' );
foreach ( $arr as $a ) {
//Get URL Link Filter out Javascript
if ( $a->attributes->getNamedItem( 'href' )->value === "javascript:;" ) {
} else {
//Store link and company name
$data['id'][]=$id;
$data['href'][] = $a->attributes->getNamedItem( 'href' )->value;
$data['company'][] = $a->nodeValue;
$data['type'][]=1;
//Open our List of Links Page
$handle = fopen('links.txt',"a+");
$somecontent = $a->attributes->getNamedItem( 'href' )->value."\r\n"; // Use \r\n for Windows/Linux compatibility
fwrite($handle,$somecontent);
fwrite(STDOUT, $somecontent);
fclose($handle);
$id++;
}
}
$p=$p+15;
sleep(1);
unset($doc);
unset($css);
//var_dump( $data );
}
//Step 2: Gather company details (Houzz doesnt list email addresses), so well have to improvise and go to there website to acquire target email contact if its listed on there website.
//Make sure we double check were dealing with valid URLS cause that can really fuck things up once this bitch is fired up!
function get_valid_url( $url ) {
$regex = "((https?|ftp)://)?"; // Scheme
$regex .= "([a-z0-9+!*(),;?&=\$_.-]+(:[a-z0-9+!*(),;?&=\$_.-]+)?@)?"; // User and Pass
$regex .= "([a-z0-9-.]*).([a-z]{2,3})"; // Host or IP
$regex .= "(:[0-9]{2,5})?"; // Port
$regex .= "(/([a-z0-9+\$_-].?)+)*/?"; // Path
$regex .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+/\\$_.-]*)?"; // GET Query
$regex .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor
return preg_match("/^$regex$/", $url);
}
if ($handle) {
while ( ( $line = fgets( $handle ) ) !== false ) {
$email="";
$website="";
$url="";
$name="";
$company="";
$phone="";
$link="";
$tier="";
$location="";
$license="";
$error="";
$sql="";
$doc = new DOMDocument();
// Suppress errors for malformed HTML
@$doc->loadHTML( file_get_contents( trim($line) ) ); // Trim whitespace from line
$css = new CSSQuery( $doc );
//Houzz Link to profile
$data['link']=$line;
$link=trim($line);
//Company Name
$nrr = $css->query( 'a.profile-full-name' );
if (isset($nrr[0]) && $nrr[0]->textContent) {
$data['company'][] = $nrr[0]->textContent;
$company=$nrr[0]->textContent;
fwrite(STDOUT, "Starting: ".$id.":".$nrr[0]->textContent."\r\n");
} else {
$company = "N/A";
fwrite(STDOUT, "Starting: ".$id.": Company Name Not Found\r\n");
}
//Website and Email Addresses TODO add conditional statement
$arr = $css->query( 'a.proWebsiteLink' );
$website_found = false;
foreach ( $arr as $a ) {
$url= $a->attributes->getNamedItem( 'href' )->value;
// Basic URL validation before attempting to crawl
if (filter_var($url, FILTER_VALIDATE_URL)) { // More robust URL validation
$data['website'][] = $url;
$website = $url;
$website_found = true; // Mark that a website was found
fwrite(STDOUT, "Attempting site: ".$url."\r\n");
// Simple email extraction - a real crawler would be more sophisticated
$site_content = @file_get_contents($url); // Use @ to suppress errors for unreachable sites
if ($site_content !== FALSE) {
// Regex to find email addresses
if (preg_match('/\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/', $site_content, $matches)) {
$email = $matches[0];
// Output indicating email address discovered CLI
fwrite(STDOUT, "Found Email Address: ".$email."\r\n");
} else {
fwrite(STDOUT, "No Email Address Found on Site\r\n");
}
} else {
fwrite(STDOUT, "Could not retrieve content from site: ".$url."\r\n");
$error .= "Could not retrieve content from site; ";
}
// Only try the first valid URL found
break;
} else {
fwrite(STDOUT, "Invalid Website URL found: ".$url."\r\n");
$error .= "Invalid Website URL; ";
}
}
if (!$website_found) {
$website = "N/A";
fwrite(STDOUT, "No Website URL found on Houzz profile\r\n");
}
//Phone Number
$phone_found = false;
$crr = $css->query( 'span.pro-contact-text' );
foreach ( $crr as $c ) {
if($c->nodeValue!=="Website") { // Exclude the "Website" text itself
$phone = trim($c->nodeValue); // Trim whitespace
$data['phone'][] = $phone;
$phone_found = true;
break; // Assume only one phone number listed this way
}
}
if (!$phone_found) {
$phone = "N/A";
}
//All company details (Contact, Location, License, Tier)
$info = $css->query( 'div.info-list-text' );
$name = "N/A";
$location = "N/A";
$license = "N/A";
$tier = "N/A";
foreach ( $info as $i ) {
$text = trim($i->nodeValue); // Trim whitespace from text content
//Person to contact
if (strpos( $text, "Contact:" )!==FALSE) {
$name = str_replace( "Contact:",'', $text );
$name = trim($name);
$data['contact'][] =$name;
}
//Address/Location
if (strpos( $text, "Location:" )!==FALSE) {
$location = str_replace( "Location:",'', $text );
$location = trim($location);
$data['location'][]=$location;
}
//License Number
if (strpos( $text, "License Number:" )!==FALSE) {
$license=str_replace( "License Number:",'', $text );
$license=trim($license);
$data['license'][] =$license;
}
//Tier (Typical Job Costs)
if (strpos( $text, "Typical Job Costs:" )!==FALSE) {
$tier =str_replace( "Typical Job Costs:",'', $text );
$tier=trim($tier);
$data['tier'][]=$tier;
}
}
// Write architect contact information into a CSV file
$wr= fopen('archs.csv',"a+");
// Use fputcsv for proper CSV formatting and escaping
fputcsv($wr, [$id, $type, $company, $phone, $website, $email, $link, $name, $location, $license, $tier]);
//Disable Comment Below to OutPut to CLI
//fwrite(STDOUT, $details);
$id++;
fclose($wr);
/* Uncomment below if youd rather insert scrapped data into MySQL Database
$sql = "INSERT INTO ip_oppurtunities(`type`,`company`,`phone`,`website`,`email`,`link`,`contact`,`location`,`license`,`tier`)
VALUES (1,'$company','$phone','$website','$email','$link','$name','$location','$license','$tier')";
if ($conn->query($sql) === TRUE) {
fwrite(STDOUT, $id.'-'.$company." Added \r\n");
} else {
$error=mysqli_error($conn);
fwrite(STDOUT, "Error: ".$company."=[".$sql."]".$error."\r\n");
echo $error;
die(); // Consider logging error and continuing instead of dying
}
$id++;
*/
}
fclose($handle); // Close the links file after processing
// $conn->close(); // This should be outside the while loop if using DB insertion
}