Request json structure for RSS

Example json for URL, RSS:

{
    "id": 1,
    "crawlerType": 4,
    "maxIterations": "1",
    "items": [
        {
            "siteId": "0",
            "urlContentResponse": null,
            "siteObj": {
                "fetchType": 1,
                "id": "0",
                "uDate": null,
                "tcDate": null,
                "cDate": null,
                "resources": null,
                "iterations": null,
                "description": null,
                "urls": [
                ],
                "filters": [
                    {
                        "pattern": "http:\/\/(.*)",
                        "siteId": "0",
                        "type": 1
                    }
                ],
                "properties": {
                },
                "state": null,
                "priority": null,
                "maxURLs": null,
                "maxResources": null,
                "maxErrors": null,
                "maxResourceSize": null,
                "requestDelay": null,
                "httpTimeout": null,
                "errorMask": null,
                "errors": null,
                "urlType": 0,
                "contents": null,
                "processingDelay": null,
                "size": null,
                "avgSpeed": null,
                "avgSpeedCounter": null,
                "userId": null,
                "recrawlPeriod": null,
                "recrawlDate": null,
                "maxURLsFromPage": 20,
                "collectedURLs": 0,
                "updateType": 1
            },
            "urlObj": {
                "status": 2,
                "linksI": 0,
                "linksE": 0,
                "contentMask": 0,
                "processingTime": 0,
                "CDate": null,
                "mRateCounter": 0,
                "httpTimeout": 10000,
                "size": 0,
                "urlPut": null,
                "batchId": 0,
                "lastModified": null,
                "tagsCount": 0,
                "mRate": 0,
                "charset": "",
                "state": 0,
                "httpCode": 0,
                "priority": 0,
                "maxURLsFromPage": 20,
                "processingDelay": 0,
                "crawlingTime": 0,
                "type": 0,
                "processed": 0,
                "totalTime": 0,
                "siteSelect": 0,
                "contentType": "",
                "pDate": null,
                "errorMask": 0,
                "httpMethod": "get",
                "eTag": "",
                "siteId": "0",
                "freq": 0,
                "tcDate": null,
                "rawContentMd5": "",
                "crawled": 0,
                "UDate": null,
                "contentURLMd5": "",
                "requestDelay": 0,
                "depth": 0,
                "parentMd5": "",
                "urlUpdate": null,
                "tagsMask": 0,
                "urlMd5": "b94bf7be5a252a52cf2223d8432b0cca",
                "url": "http:\/\/feeds.bbci.co.uk\/news\/world\/europe\/rss.xml"
            },
            "urlPutObj": {
                "putDict": {
                },
                "urlMd5": "b94bf7be5a252a52cf2223d8432b0cca",
                "contentType": 0,
                "siteId": "0",
                "fileStorageSuffix": null,
                "criterions": null
            },
            "properties": {
                "DB_TASK_MODE": "RO",
                "HTTP_REDIRECTS_MAX": 5,
                "HTML_REDIRECTS_MAX": 5,
                "HTML_RECOVER": "0",
                "ROBOTS_MODE": "0",
                "PROCESSOR_NAME": "FEED_PARSER",
                "PROCESS_CTYPES": "text\/html,text\/xml,application\/rss+xml",
                "template": {
                    "templates": [
                        {
                            "output_format": {
                                "type": "rss",
                                "name": "json",
                                "header": "[\n{",
                                "items_header": "",
                                "item": "\n\"%tag_name%\":\"%tag_value%\",\n\"%tag_name%_extractor\":\"%tag_name%_%extractor_value%\"",
                                "items_footer": "",
                                "footer": "\n}\n]\n"
                            },
                            "tags": [
                            ]
                        }
                    ]
                }
            },
            "urlId": "b94bf7be5a252a52cf2223d8432b0cca"
        }
    ]
}

General parameters, the same for all types of requested json, were consider here. Most of all fileds remained unchanging, but some difference certainly exists. If we compare it with json for news format – we’ll see that main different it’s two new properties and some parameters.

items.siteObj.maxURLsFromPage and items.urlObj.maxURLsFromPage must have the same values – this is how mush links from every level you want to process;
Also we have two new properties:
properties.PROCESSOR_NAME and properties.PROCESS_CTYPES against properties.PROCESSOR_PROPERTIES – this is properties spessial for RSS type, leave them as in example
Also little change output_format because of change returned fields:

json:
"output_format": {
                                "type": "news",
                                "name": "json",
                                "header": "[\n",
                                "items_header": "",
                                "item": "{\n\"pubdate\":\"%pubdate%\",\n\"title\":\"%title%\",\n\"description\":\"%description%\",\n\"media\":\"%media%\",\n\"author\":\"%author%\",\n\"dc_date\":\"%dc_date%\",\n\"link\":\"%link%\",\n\"keywords\":\"%keywords%\",\n\"content_encoded\":\"%content_encoded%\",\n\"html_lang\":\"%html_lang%\",\n\"pubdate_extractor\":\"%pubdate_extractor%\",\n\"title_extractor\":\"%title_extractor%\",\n\"description_extractor\":\"%description_extractor%\",\n\"media_extractor\":\"%media_extractor%\",\n\"author_extractor\":\"%author_extractor%\",\n\"dc_date_extractor\":\"%dc_date_extractor%\",\n\"link_extractor\":\"%link_extractor%\",\n\"keywords_extractor\":\"%keywords_extractor%\",\n\"content_encoded_extractor\":\"%content_encoded_extractor%\",\n\"html_lang_extractor\":\"%html_lang_extractor%\",\n\"crawler_time\":\"%crawler_time%\",\n\"scraper_time\":\"%scraper_time%\",\n\"errors_mask\":\"%errors_mask%\"\n}\n",
                                "items_footer": "",
                                "footer": "]\n"
 
                           }
html:
"output_format": {
                                "type": "rss",
                                "name": "html",
                                "header": "<!DOCTYPE html><head><title>Title<\/title><meta http-equiv=\"content-type\" content=\"text\/html; charset=UTF-8\"><\/head><body>\n",
                                "items_header": " <table>\n",
                                "item": "<tr><td>%tag_name%:<\/td><td>%tag_value%<\/td><\/tr>\n<tr><td>%tag_name%_extractor:<\/td><td>%tag_name%_%extractor_value%<\/td><\/tr>\n",
                                "items_footer": " <\/table>\n",
                                "footer": "<\/body><\/html>\n"
                            }
                            }
csv:
"output_format": {
                                "type": "rss",
                                "name": "csv",
                                "header": "",
                                "items_header": "",
                                "item": "\"%tag_value%\",\"%tag_name%_%extractor_value%\",",
                                "items_footer": "\n",
                                "footer": ""
                            }
text:
"output_format": {
                                "type": "rss",
                                "name": "text",
                                "header": "",
                                "items_header": "",
                                "item": "%tag_name%: %tag_value%\n%tag_name%_extractor: %tag_name%_%extractor_value%\n",
                                "items_footer": "",
                                "footer": ""
                            }
SQL:
"output_format": {
                                "type": "rss",
                                "name": "sql",
                                "header": "INSERT INTO my_table (my_tag, my_tag_extractor) VALUES \n",
                                "items_header": "",
                                "item": "(\"%tag_value%\",\"%tag_name%_%extractor_value%\"),",
                                "items_footer": "",
                                "footer": ";\n"
                            }
xml:
"output_format": {
                                "type": "rss",
                                "name": "xml",
                                "header": "<?xml version=\"1.0\"?>\n<response>\n",
                                "items_header": "  <item>\n",
                                "item": "    <%tag_name%><![CDATA[%tag_value%]]><\/%tag_name%>\n    <%tag_name%_extractor>%tag_name%_%extractor_value%<\/%tag_name%_extractor>\n",
                                "items_footer": "  <\/item>\n",
                                "footer": "<\/response>"
                            }

To change depth you need – as it is in json for news type, just change
maxIterations and items.depth parameters, that define depth for crawler: must have the same values – for example, for depth 2 – both may be ‘2’ and so on;
items.siteObj.filters.subject this is depth for filter, and for deeper crawling it must be the same value as maxIterations and items.depth;

© 2015-2016 TagsReaper. All rights reserved.