16 Oct

Under the Hood of News Hub (main functionality)

1. Crawling:

  • scan websites
  • analyze and parse web pages
  • detect and collect URLs links and web resources.

2. Download resources from web-servers using automatically collected or provided URLs including dynamic JS rendered web-pages and store them in a shard local raw file storage.

3. Processing of a web page with customizable applied algorithms like unstructured textual content scraping, statistical data mining, NLP data mining and so on

4. Store results in local SQL DB storage with distributed multi-host and multi-process architecture model.

5. Crawling, processing, and data archiving management.

6. Distributed data architecture tasks like aging, purging, statistical and more.

7. Tasks scheduling and balancing using tasks management service of multi-host architecture or real-time multi-threaded load-balancing client-server architecture.

Share this
15 Oct

News Hub facts you wanted to know about

The system works on HCE-DC – multipurpose high productivity, scalable and extensible engine of internet data mining.

It consists of several HCE project’s sub-products and technologies:

  1. HCE-node – network transport cluster application
  2. Distributed Crawler (DC) service
  3. Distributed Tasks Manager (DTM) service
  4. Web administration management console
  5. Tools and libraries for crawling and scraping algorithms with REST API and bindings for a Python and PHP development environments.

It provides flexible configuration, automated deployment, and easy integration with 3rd party data mining and analysis projects.

Share this
21 Oct

NewsHub introduces graphical top terms time-tracking tool

Top news website, NewsHub, introduces a tool that helps with analytics and review, and visual presentation of news articles

NewsHub has recently introduced a tool that helps users of news articles to track the news and provide visual presentation, making news more fun, comprehensive and easy-to-comprehend.

NewsHub is famous for being a source of reliable and trending happenings and events across the globe, providing its global users with the latest in the world. Available in Chinese, US, Ukrainian, and German versions, the website ensures that no one is left out of happenings across the globe.

The new tool, called Top Terms Time-tracking Tool, otherwise known as the 4T, helps with tracking news within particular time ranges. This helps with better user-experience as users can now look for news based on time range without necessary spending long hours in search of news happenings especially for long periods ago.

The tool does not only help to reveal the time lines of top terms from different news articles, but also comes with a small chart widget. The widget displays a two-day time range and opens up a full chart of a seven-day period when the small widget chart is click.

The 4T is great for building different types of chart and views for different times. It also helps with analytics and the review of the news for different uses. The fast detection and visual representation features of the tool are other amazing features of the 4T.

The NewsHub is accessible to users across the globe with news grouped in different categories for ease of selection and review, and the addition of the Top Terms Time-tracking Tool will only make user-experience better and more efficient.

About NewsHub

NewsHub is arguably the largest news database and the source of the latest news and happening across the globe, crawling news from over 460 online sources all over the world using TagsReaper Scraping, with a daily extract of over 60,000 news pages. The platform allows for the generation, download, and sharing of pdf digest by email.

Share this
22 Feb

News to Digest: NewsHub aggregation service now powered with Tags Reaper

Most recent news aggregation service – NewsHub tends to provide a users with latest digests in PDF from more then major 450 sources all over the World.

NewsHub uses crawling and scraping platform of TagsReaper.com that allows to process more then 60,000 pages daily, generating pdf digests split by language and subjects ready to download.

Now NewsHub not only delivers to subscribers freshest pdf-digests but also provides largest and most recent news articles database for search, selection and analisys over API by external apps.

Share this
03 Dec

Tags Reaper November Changelog

Demo form

  • Visual Tag Picker tool support the “Multi Item” or “Product” selection mode.
  • Visual Tag Picker tool support the CSS, JavaScript and Highlight On/Off switching.
  • demo form support the “Multi Item” or “Product” scraping mode switching.
  • Visual form of results view in new tab, and the visual tab as main results view in simple mode including items enumeration, images visualization, errors codes and summary, auto switch depending on format.
  • The API request and response view tabs.

read more

Share this
03 Nov

Tags Reaper October Changelog

Demo Form

  • Added Templates library of 150 US news media web sites for the Template scraping;
  • Import and export templates;
  • Redesign of the Demo form, simple and advanced modes, captcha “one click go”.

Administration Control Panel

  • Management of the Templates library for the Template scraping for site configuration;
  • Import and export templates;
  • Import and export sites;
  • Full life cycle of the site as data source management including the CRUD for site and configuration of crawling and scraping settings automated with easy simple form of user interface. Full life cycle of the user’s accounts management including permissions and roles assignment.
  • 6) Multi-host demo installation with set US sites crawling and scraping on the basis of News scraping algorithms. Tested productivity is 6-12K articles per day, the estimated about 50K.

read more

Share this

© 2015-2016 TagsReaper. All rights reserved.