03 Nov

Tags Reaper October Changelog

Demo Form

  • Added Templates library of 150 US news media web sites for the Template scraping;
  • Import and export templates;
  • Redesign of the Demo form, simple and advanced modes, captcha “one click go”.

Administration Control Panel

  • Management of the Templates library for the Template scraping for site configuration;
  • Import and export templates;
  • Import and export sites;
  • Full life cycle of the site as data source management including the CRUD for site and configuration of crawling and scraping settings automated with easy simple form of user interface. Full life cycle of the user’s accounts management including permissions and roles assignment.
  • 6) Multi-host demo installation with set US sites crawling and scraping on the basis of News scraping algorithms. Tested productivity is 6-12K articles per day, the estimated about 50K.

DC service

  • Digest creation engine and utility with HTML templates, textual data refining and articles filtration with features to convert into pdf and another structured documents formats. Can be tested in manual mode demonstration and samples of periodic digest creation for the set of US, JP and UA sites in volume about 1-2K of articles per 6-12 hours. Samples of digests includes the Table Of Content (TOC), the Announces (list of the titles of articles with small image and publication date and bodies of the articles with big image, author and publication date;
  • Improved images and publication dates detection functionality for Top US news maker sites and on general level.
  • Scraped textual content refining for better view in digests on the basis of printing principles like punctuation, filtration, duplications and so on.
  • Extended logging and tracing of all modules configurable and flexible way management and debugging and states snapshots.
  • Totally full profiling, speed and productivity optimizations of all modules and algorithms.

 

New Features
  • The Templates library of 150 US news media sites for the Template scraping.
  • The simple and advanced modes of demo form usage.
  • 4th scraping type – “Multi Item” or “Product”.

 

Administration console:

  • Templates library of 150 US news media websites management;
  • Management of the Templates library for template scraping for site configuration;
  • Import and export sites;
  • Data collectors with functionality of scheduled tasks to collect data with set of conditions from the DC service periodically, group and create archives and digests with access by user’s subdomen with HTTP on TR site, email notifications and subscription on digests, direct access to the digests in the HTML format and more.

 

Improvements

Demo form for the TR website:

  • VTP tool: fixes and additions to make correct loading of several sites with dynamic content; improvements of the publication date detect algorithm; fixes for the visualization of selection and markup selected areas;
  • Set of additions and redesigns of view of request form, added simple and advanced modes, parameters settings including tabs, borders, icons back-ground colors, url’s area, buttons and Alaska’a style tuning and many more.
  • Redevelopment of the several control elements with more robust and intuitive view (templates list, settings, scrapers list, captcha and “Submit” button, results tabs, errors visualization, URLs area.

Administration console:

  • Sites search, view, edit, update, statistics, logs, processing options configuration.
  • Resources search, view, statistics, logs.
  • Users view, edit.

DC service

  • Extended logging and tracing of all modules configurable and flexible way management of debugging and states snapshots
  • Improved images and publication dates detection functionality for top US news maker sites and on general level.
  • All modules and algorithms profiling, speed and productivity optimizations.
  • Robots.txt obey speed optimization.
  • Complete refactoring of modules structure to generalize and make an application life-cycle common design.
  • Added two modes of the algorithms module usage – process-based and import-based. Both integrated into regular and real-time API.
  • Real-time API and Batch processing design structure and productivity optimization.
  • Design and deployment of Night builds environment for the Debian OS 7 and 8. Configured and integrated the Jenkins projects for full cycle night builds including the HCE products packages installation, dependencies packages installation, run complete functional black-box tests, stop and deinstall all products and dependencies.
Share this