The first step to a successful scrape and picking the right recipe is understanding the two page styles that Data Miner looks for. A search page and a details page.
A search page typically has a search bar and as you search, the page will show you a list of items found.
Each item will have a URL to a detail page.
Search pages will usually have multiple pages and a need for pagination, which is how you go to the next page.
A detail page displays a single item from the Search page.
The detail page typically has a larger image, more detailed information and contact info.
Detail pages typically do not have more than one page and wont require pagination.
Some recipes are first tested by Data Miner and are indicated by a “ $ ”at the beginning of the recipe name.
Recipe type is most important when running a job. A job requires a URL to access the deeper levels of data. These urls can only be extracted from Search pages with Search page recipes. To learn about Jobs and multi level scraping continue to lesson 5.
An automated job is when Data Miner uses the results of a search recipe to automatically go to each detail page and scrape that page using the detail page recipe.
1) Starting from a search page, find the data you want to scrape, Extract the information using a public recipe as you did in Lesson 1.This will extact the list information as well as the URLs of individual detail pages.
2) Download the results by clicking “download” in the bottom right corner.It will download as a CSV.
3) Navigate to your data collections folder. To get to the data collections folder and jobs page, just click “collections” in the bottom right corner of the the data miner window.
4)Upload the CSV containing the URLs. Click, “upload a csv” and select the CSV file from the first scrape.
5)Once the CSV is upload, it’s time to creat the job. To create a Job, click on the Jobs tab from the left hand panel. Begin filling out the nessesary feilds
6) Once the Job is saved, it will appear at the top of the Jobs page. Press Run.
7) The first URL will be opened up in a new window. Data Miner will scrape the information and then close the window and move into the next URL.
8) As the recipe runs you can check the progress by clicking data collections data miner tab and then clicking on the output file that you name earlier. If you have scraped all the data you need, click stop/close on the pop up window or wait till the the job reaches the end of the URLs and it will stop automatically.
9) Once finished, click on Data Collections, select the output file and then download by selecting your file preference Excel(XLS) or CSV.
Jobs have many advanced tools to assist with your scraping needs. We will cover the 2 most useful ones in this section. We would recommend exploring the rest on your own to take full advantage of jobs.
To Find the tools - from the jobs page, simply click on Advanced Options at the bottom of the page.
Wait Time Between Scrapes, is a setting to use when troubleshooting. Data Miner is a tool that runs only on your computer, which means each scrape is unique to your own Internet speed.
For that reason, if the Internet connection is running slow and Data Miner can’t scrape before the page loads, it will return a failure. To prevent this, simply increase the wait time to 60 or 90 seconds and press “Save”. By default, it’s set to 15 seconds.
Please note: never decrease the wait time below 15 seconds, which could cause the job to fail.
Job Pagination is a tool used to increase the scraping capability. There will be times when a Detail Page has multiple pages and you would like to scrape them. That is when this tool is useful.
All you have to do is check “Fallow Paginated Page Results” and then enter the max amount of pages you expect each detail to page to have.
Please note: The detail page recipe must have pagination capability. If it does not and you’re not sure how to add it. Continue onto Lesson 7-9!