16 Oct

Under the Hood of News Hub (main functionality)

1. Crawling:

  • scan websites
  • analyze and parse web pages
  • detect and collect URLs links and web resources.

2. Download resources from web-servers using automatically collected or provided URLs including dynamic JS rendered web-pages and store them in a shard local raw file storage.

3. Processing of a web page with customizable applied algorithms like unstructured textual content scraping, statistical data mining, NLP data mining and so on

4. Store results in local SQL DB storage with distributed multi-host and multi-process architecture model.

5. Crawling, processing, and data archiving management.

6. Distributed data architecture tasks like aging, purging, statistical and more.

7. Tasks scheduling and balancing using tasks management service of multi-host architecture or real-time multi-threaded load-balancing client-server architecture.

Share this