Semalt Shares An Easy Way Of Extracting Information From Websites
Web Scraping is a popular method of obtaining content from websites. A specially programmed algorithm comes to the main page of the site and begins to follow all internal links, assembling the interiors of divs you specified. As a result - ready CSV file containing all the necessary information lying in a strict order. The resulting CSV can be used for the future creating almost unique content. And in general, as a table, such data is of great value. Imagine that the entire product list of a construction shop is presented in a table. Moreover, for each product, for each type and brand of the product, all fields and characteristics are filled. Any copywriter working for an online store would be happy to have such a CSV file.
There are lots of tools for extracting data from websites or web scraping and don't worry if you're not familiar with any programming languages, in this article I will show one of the easiest ways – using Scrapinghub.
First of all, go to scrapinghub.com, register, and login.
The next step about your organization can be just skipped.
Then you get to your profile. You need to create a project.
Here you need to choose an algorithm (we will use the algorithm "Portia") and give a name to the project. Let's call it somehow unusual. For example, "111".
Now we get into the working space of the algorithm where you need to type URL of the website you wish to extract data from. Then click on "New Spider".
We'll go to the page that is going to serve as an example. The address is updated in the header. Click "Annotate This Page".
Move your mouse cursor to the right which will make the menu appear. Here we are interested in the "Extracted item" tab, where you need to click "Edit Items".
Yet the empty list of our fields is displayed. Click "+ Field".
Everything is simple here: you need to create a list of fields. For each item, you need to enter a name (in this case, a title and content), specify whether this field is required ("Required") and whether it can vary ("Vary"). If you specify that an item is "required", the algorithm will simply skip pages where it won't be able to fill this field. If not flagged, the process can last forever.
Now simply click on the field we need and indicate what it is:
Done? Then in the header of website click "Save Sample". After that, you can return to the working space. Now the algorithm knows how to get something, we need to set a task for it. To do this, click "Publish Changes".
Go to task board, click "Run Spider". Choose website, priority and click "Run".
Well, scraping is now in process. Its speed is shown by pointing your cursor on the number of sent requests:
The speed of getting ready strings in CSV - by pointing at another number.
To see a list of already made items just click on this number. You will see something similar:
When it's finished, the result can be saved by clicking this button:
That's it! Now you can extract information from websites without any experience in programming.