1. Install
  2. Sign in
  3. Scrape!

Join 70,000 users now!

  • Works on most popular sites
  • See results in seconds
  • Automate your data extraction
  • Live customer support

Writing recipes with Data Miner

Lesson 7: Containers and Xpath

Public Recipe Picture
Updated: 12/15/2016 by David
Contact Me

A) Find Containers:

screenshot - containers

Data Miner works by, first identifying the containers that surround your data and then extracting elements from within those containers.

In a class="container-text"><TABLE>a container is <TR> (row), in a list, a container can be either a <DIV> or <LI> that contains all the elements of one item.

B) Use Relative Xpath

screenshot - item

Once a container is identified to specify the values inside the container we use relative xpath from the container to the element we want to extract:

For example: If the xpath for identifying the container is //table/tr then the relative xpath for the first column in that row would be tr[1] and the second column tr[2]

Lesson 8: Creating your own Recipes

Public Recipe Picture
Updated: 12/15/2016 by David
Contact Me

Watch a Step by Step Guide

Watch the guide on see how to scrape a list from Etsy.com. Using the Chrome Developer tools we will show you how to find the xpath for a website element and then input it into Data Miner to extract the information.

Start by opening up a "New Recipe" from within the Data Miner window. We've provided the xpaths from the video to help you follow along.

  • Xpath for the Item containers:
  • //div[@class='buyer-card card']
  • Xpath for the Item URL
  • .//a[contains(@class, 'card-title')]/@href
  • Xpath for the Item Title
  • .//a[contains(@class, 'card-title')]
  • Xpath for the Item Price
  • .//span[contains(@class, 'currency ')]
  • Xpath for the Item Shop Name
  • .//a[contains(@class, 'card-shop-name')]
  • Xpath for the Item Shop URL
  • .//a[contains(@class, 'card-shop-name')]/@href
  • Next page:
  • //div[contains(@class, 'pagination')]//span[contains(@class, 'ss-navigateright')]

Lesson 9: Javascript Snippets

Public Recipe Picture
Updated: 12/4/2016 by Ben
Contact Me

A) Clean up data after you scrape with Javascript:

js cleanup Example of Data clean up Script:
var cleanup = function(results) {
  // loop through each row of results and change each column

  //debugger;

  $.each(results, function(){
    this.values[0] = "xxxx -" + this.values[0];
    this.values[1] = this.values[1] + "- yyyyy";
  });

  return results; // return modified results
};
                
Using Javascript, you can clean up your scraped results and do more sophisticated data extraction than is possible with just xPath. Data Miner will pass the scraped data to a javascript function that you provide. Then you can modify the data and pass it back to Data Miner for saving into your data collection.

With custom Javascript you can:

  • Extract Email addresses from text
  • Remove unwanted text from scraped data
  • Change currency type, change units.
  • Separate or join column data

B) Click on elements before scraping

js hooks
You can provide your own function in Javascript that DataMiner will run before it scrapes the data. Pre and Post scraping hooks give you an incredible power to do any work before or after scraping is performed.


Examples of how Pre and Post hooks can help you:

  • With Pre-hook, you can wait for an element to be present on the page before starting the scrape process.
  • With Pre-hook, you can fill a from and submit before scraping the page.
  • With Pre-hook, you can click on an item on the page or do AJAX calls.
  • With Post-hook, you can clean up your data. Or Click on a button.

C)Filling forms with Javascript:

Scrape Search Page
var workflow = {
    paginationType: "ajax",

    fillForm: function(context, resolve) {
        console.log("starting POST hook");

        if (!context.inputData)
            context.inputData = {
                name: "pizza",
            };

        return [{
            type: "text",
            selector: "input[name$='find_desc']",
            value: context.inputData.name,
            waitAfter: 1
        }, {
            type: "button",
            selector: ".main-search_submit",
            done: function() {
                resolve();
            }
        }];
    }
};             

With Data Miner you can automatically fill forms by uploading a CSV into your Collections and using a form filling recipe. To create a form filling recipe you must include the Javascript snippet and updated the selectors to the right attributes for you site. In addition, make sure the CSV column titles match exactly to the key titles (Example, pizza's key is "name"). Once the recipe is complete, run a job with the CSV as the source collection and your new form filling recipe as the recipe.

Watch an Example Blow

Now Give Form Filling a try!

Download the CSV file HERE
And head over to YELP




See even more examples of Javascript hooks blow:


/* --------------------------------------------------------------------------------------------------------------------

Here is an example of pre-scrape hook. In this example an element is found using the jquery slector ".tsd_name > a".
Then the element is clicked. Then we wait for 2 seconds for the page to change and then we tell Data Miner to continue
to scrape the page.

*/

var workflow = {

    "preScrape": function(request, callBack) {
        console.log("starting Pre-scrape hook");

        var $el = $(".tsd_name > a"); // Element to click on.
        var waitTime = 2; // Wait for n seconds and then continue to scrape the page

        if ($el.length > 0) {
            $el[0].click()
        }

        setTimeout(function() {
            callBack();
        }, waitTime * 1000);
    }
}


/* --------------------------------------------------------------------------------------------------------------------

Here is another example of pre-scrape hook. In this example the pre-scrape hook will wait for 5 seconds until
the element specified by jquery selector #footer appears on the page. In a loop we test for the presence of #footer and
if not present we wait for 1 second and repeat the loop. Once the element is found we call CallBack which transfers
the control back to Data Miner to continue to scrape.

*/

var workflow = {

    /* --------------------------------------------------------------
    preScrape function:
        Will be executed before any scraping is done. Must callBack to give the execution control back to Data Miner

    Input:
        request: Context for the request. URL, scraping, parameter etc.
        callBack: callback function to be called when all the pre-scraping work is done.
    Return:
        nothing

    ------------------------------------------------------------- */
    "preScrape": function(request, callBack) {
        console.log("starting Pre-scrape hook");
        //debugger;

        var condition = "#footer"; // Wait for presence of this element before scraping
        loopCounterMax = 5; // Maximum number seconds to wait before giving up

        var wait = function() {
            var $test = $(condition);

            if ($test.length > 0 || loopCounter > loopCounterMax) {
                if (callBack)
                    callBack();    // Must be called at the end when all the PreScrape work is done

            } else {
                loopCounter++;
                setTimeout(wait, 1000);
            }
        };

        loopCounter = 0;
        wait();
    },
}


/* --------------------------------------------------------------------------------------------------------------------

Here is another example of post-scrape hook. In this example you are given the data that was scraped from the page in
form on an array. Then you can modify the result and return the array back to Data Miner.

*/

var workflow = {

    /* --------------------------------------------------------------
    postScrape function:
        Will be executed after the scraping is finished. You will get the scraped results and can
        clean up or modify them

    Input:
        results: Scraped data array
    Return:
        results: Modified data array

    ------------------------------------------------------------- */
    "postScrape": function(results) {
        console.log("starting Post-scrape hook");

      // loop through each row of results and change each column

      //debugger;

      $.each(results, function(){
        this.values[0] = "xxxx -" + this.values[0];
        this.values[1] = this.values[1] + "- yyyyy";
      });

      return results; // return modified results

    },


/* --------------------------------------------------------------------------------------------------------------------

Here is an example of scrape hook. You can simply replace the scrape functionality of Data Miner by providing you own
scrape function which will be call instead of the scrape function of Data Miner.

*/

var workflow = {

    /* --------------------------------------------------------------
    scrape function:
        Will be executed instead of the default [originalScrape] scrape function of Data Miner.
        The xpaths in the data Miner UI will be ignored. However the number of columns of data returned
        must match the number of columns specified the the UI.

    Input:
        request: Context for the request. URL, scraping, parameter etc.
        originalScrape: the default scrape function of Data Miner.
        callBack: callback function to return the results.
    Return:
        results: Modified data array

    ------------------------------------------------------------- */
    "scrape": function(request, originalScrape, callBack) {
        console.log("starting scrape hook");

        var result=[];
        result.push({
            "values": [
                "1234", "1234"
            ]
        });

        callBack(results);
    }
};

 /* --------------------------------------------------------------
For Splitting names(Splits by a space)
Use cleanup
------------------------------------------------------------- */
var cleanup = function(results) {
	//debugger;
	$.each(results, function() {
		var x = this.values[2].indexOf(" ");
		this.values[2] = this.values[2].substring(0, x);
		this.values[3] = this.values[3].substring(x, this.values[3].length);
	});
	return results; // return modified results
};

 /* --------------------------------------------------------------
Split names by space and are in “Last, First” format, also removes comma,
--------------------------------------------------------------*/
	var cleanup = function(results) {
		//debugger;
		$.each(results, function() {
			var x = this.values[1].indexOf(" ");
			var y = this.values[1].indexOf(",");
			this.values[1] = this.values[1].substring(x, this.values[2].length);
			this.values[2] = this.values[2].substring(0, y);
			console.log(x);
		});
		return results; // return modified results
	};

 /* --------------------------------------------------------------
Replace any non alphanumeric character with a“ - “
--------------------------------------------------------------*/
	var cleanup = function(results) {
		//debugger;
		$.each(results, function() {
			this.values[1] = this.values[1].replace(/[^a-z0-9()_]/gi, '-');
		});
		return results; // return modified results
	};
 /* --------------------------------------------------------------
Click a Button
--------------------------------------------------------------*/
var workflow = {
	"preScrape": function(request, callBack) {
		console.log("starting Pre-scrape hook");
		var condition = "a[class~='xxxx']";
		var $test = $(condition);
		if ($test.length > 0) {
			$test[0].click();
			var wait = function() {
				callBack();
			};
			setTimeout(wait, 3000);
		} else callBack();
		return results;
	}
};

/* --------------------------------------------------------------
Button click and close
--------------------------------------------------------------*/
var workflow = {
	"preScrape": function(request, callBack) {
		console.log("starting Pre-scrape hook");
		//debugger;
		var condition = "button[data-lira-action~='edit-contact-info'"; // Wait for presence of this element before scraping
		//debugger
		var $test = $(condition);
		if ($test.length > 0) {
			$test[0].click();
			var wait = function() {
				callBack();
			};
			setTimeout(wait, 3000);
		} else callBack();
	},
	"postScrape": function(results) {
		console.log("starting Post-scrape hook");
		var $close = $(".dialog-close");
		if ($close.length > 0) {
			$close[0].click();
		}
		return results;
	}
};

/* --------------------------------------------------------------
Filter Data Miner results
 --------------------------------------------------------------*/
var workflow = {
	"postScrape": function(results) {
		console.log("starting Post-scrape hook");
		// loop through each row of results and change each column
		//debugger;
		var results2 = [];
		$.each(results, function() {
			//	debugger;
			//console.log("in each", this);
			if (this.values[2] !== "https://www.linkedin.com/") results2.push(this);
		});
		return results2; // return modified results
	}
};


/* --------------------------------------------------------------
Auto Scrolling with an interval and a max(twitter)
 --------------------------------------------------------------*/
var workflow = {
	"preScrape": function(request, callBack) {
		console.log("starting Pre-scrape hook");
		//debugger;
		var waitTime = 3000; // milliseconds
		var maxLoopCount = 50;
		var count = 0;
		var loopCount = 0;

		function loop() {
			loopCount++;
			if ($("li[class~='stream-item']").length !== count && loopCount < maxLoopCount) {
				window.scrollTo(0, document.body.scrollHeight);
				count = $("li[class~='stream-item']").length;
			} else if (callBack) callBack();
		}
		var tid = setInterval(loop, waitTime);
	}
};


/* --------------------------------------------------------------
Isolate data by Index
 --------------------------------------------------------------*/
var cleanup = function(results) {
	//debugger;
	$.each(results, function() {
		this.values[3] = this.values[3].substring(0, 13);
		this.values[4] = this.values[4].substring(14, 30);
	});
	return results; // return modified results
};


/* --------------------------------------------------------------
Using Form filling with drop down menus. The following javascript will click to open a form,
click a drop down menu, select an item from within the list and then click submit.

A CVS with a column titled "location" and then data below it 0 though the number of elements in the drop down menu
can be injected into a selector allowing you to select different items in a drop down and search them, when injecteing
basic text doesn't work.
 --------------------------------------------------------------*/
var workflow = {
    paginationType: "ajax",

    fillForm: function(context, resolve) {
        console.log("starting POST hook");

        if (!context.inputData)
            context.inputData = {
                location: "0", //starting from 0, the location is where the item lives within the list in the drop down.

            };

        return [{
            type: "button",
            selector: "a[class~='XXXX']", //open button selector
            waitAfter: 2
        },{
            type: "button",
            selector: "*[class~='XXXX']", //form button selector
            waitAfter: 2
        },{
            type: "button",
            selector: "*[id~='XXXX" + context.inputData.location +"']", //inputData.location is the number defined above or
                    injected from the CSV and then added to the drop down item selector.
            waitAfter: 2
        },{
            type: "button",
            selector: "button[name~='skipandexplore']", //submit button selector
            done: function() {
                resolve();
            }
        }];
    }
};

Get Beta version of Data Miner

Preview New Features

  • Bug fixes
  • Run your own custom Javascript code

Note: We recommend that you run both Production version and Beta versions Data Miner side by side. If you find a blocking issue in the beta version. Each version runs independently and they don't interfere with each other.

Download latest Beta version.