Multi-task crawling

multitaskThis type of usage can be fully automated to work on the server side as a background task once it is configured. User adds a ‘Site’ object that is managed by the service and associated with the user’s session and/or account.

Processing goes on the service side and includes crawling and scraping, if configured. The crawler processes collected URLs in parallel computation mode. Depending on the configuration, the process can be finite or infinite, with scheduled re-crawling periods and limitations of maximum URLs that can be processed during one iteration.

Site can also have many root URLs that are entry points for the site scan for each re-crawl. Re-crawl process consists of downloading resources from root URLs and collecting URL links from them. Then that process is repeated for each collected URL, and so on.

If configured, downloaded contents are processed with scraping algorithm as a separate processing stage in isolated processes.

Results of the site processing (crawling and scraping) can be downloaded one-by-one using resource management page in the administrative management panel. Alternatively, a periodic data archive creation can be scheduled, and a notification with a link or even a callback upload http request can be done to put the data results into the target destination.

In this mode, processing the Site is fully automated, including the crawling and scraping.

Crawling and the scraping operations are not sequential processes, and scraping has a time lag that depends on the total general load on the host server and the number of tasks performing difficult long time I/O operations.

Dedicated server or multi-host

server networkThis type of DC mode uses one or more virtual or physical Host Servers and to deploy dedicated installation of DC service with the Control Panel and management console.

After configuration, customer allowed to use the service management console directly on dedicated isolated installation. Management console makes possible to create, change, manage state, and delete sites with a number of configuration settings.

All types of service accounts can be used in test mode as a free trial. All the possible test modes are limited in terms of number of requests, size of data and period, but are fully functional to get an idea of what the service can do.
“Professional plan” works in unlimited mode which includes support.

All accounts powered with REST API to perform correspondent request programs automatically. For examples of API usage, click here.

