Channel: Why isn't XMLFeedSpider failing to iterate through the designated nodes? - Stack Overflow
Viewing all articles
Browse latest Browse all 2

Why isn't XMLFeedSpider failing to iterate through the designated nodes?


I'm trying to parse through PLoS's RSS feed to pick up new publications. The RSS feed is located here.

Below is my spider:

from scrapy.contrib.spiders import XMLFeedSpiderclass PLoSSpider(XMLFeedSpider):    name = "plos"    itertag = 'entry'    allowed_domains = ["plosone.org"]    start_urls = [         ('http://www.plosone.org/article/feed/search''?unformattedQuery=*%3A*&sort=Date%2C+newest+first')    ]    def parse_node(self, response, node):        pass

This configuration produces the following log output (note the exception):

$ scrapy crawl plos2015-02-06 00:19:08+0100 [scrapy] INFO: Scrapy 0.24.4 started (bot: plos)2015-02-06 00:19:08+0100 [scrapy] INFO: Optional features available: ssl, http11, boto2015-02-06 00:19:08+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'plos.spiders', 'SPIDER_MODULES': ['plos.spiders'], 'BOT_NAME': 'plos'}2015-02-06 00:19:08+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState2015-02-06 00:19:08+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats2015-02-06 00:19:08+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware2015-02-06 00:19:08+0100 [scrapy] INFO: Enabled item pipelines: 2015-02-06 00:19:08+0100 [plos] INFO: Spider opened2015-02-06 00:19:08+0100 [plos] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)2015-02-06 00:19:08+0100 [scrapy] DEBUG: Telnet console listening on 00:19:08+0100 [scrapy] DEBUG: Web service listening on 00:19:09+0100 [plos] DEBUG: Crawled (200) <GET http://www.plosone.org/article/feed/search?unformattedQuery=*%3A*&sort=Date%2C+newest+first> (referer: None)2015-02-06 00:19:09+0100 [plos] ERROR: Spider error processing <GET http://www.plosone.org/article/feed/search?unformattedQuery=*%3A*&sort=Date%2C+newest+first>    Traceback (most recent call last):      File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 824, in runUntilCurrent        call.func(*call.args, **call.kw)      File "/usr/lib/python2.7/dist-packages/twisted/internet/task.py", line 638, in _tick        taskObj._oneWorkUnit()      File "/usr/lib/python2.7/dist-packages/twisted/internet/task.py", line 484, in _oneWorkUnit        result = next(self._iterator)      File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 57, in <genexpr>        work = (callable(elem, *args, **named) for elem in iterable)    --- <exception caught here> ---      File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 96, in iter_errback        yield next(it)      File "/usr/local/lib/python2.7/dist-packages/scrapy/contrib/spidermiddleware/offsite.py", line 26, in process_spider_output        for x in result:      File "/usr/local/lib/python2.7/dist-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>        return (_set_referer(r) for r in result or ())      File "/usr/local/lib/python2.7/dist-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>        return (r for r in result or () if _filter(r))      File "/usr/local/lib/python2.7/dist-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>        return (r for r in result or () if _filter(r))      File "/usr/local/lib/python2.7/dist-packages/scrapy/contrib/spiders/feed.py", line 61, in parse_nodes        for selector in nodes:      File "/usr/local/lib/python2.7/dist-packages/scrapy/contrib/spiders/feed.py", line 87, in _iternodes        for node in xmliter(response, self.itertag):      File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/iterators.py", line 31, in xmliter        yield Selector(text=nodetext, type='xml').xpath('//'+ nodename)[0]    exceptions.IndexError: list index out of range2015-02-06 00:19:09+0100 [plos] INFO: Closing spider (finished)2015-02-06 00:19:09+0100 [plos] INFO: Dumping Scrapy stats:    {'downloader/request_bytes': 282,'downloader/request_count': 1,'downloader/request_method_count/GET': 1,'downloader/response_bytes': 7590,'downloader/response_count': 1,'downloader/response_status_count/200': 1,'finish_reason': 'finished','finish_time': datetime.datetime(2015, 2, 5, 23, 19, 9, 379574),'log_count/DEBUG': 3,'log_count/ERROR': 1,'log_count/INFO': 7,'response_received_count': 1,'scheduler/dequeued': 1,'scheduler/dequeued/memory': 1,'scheduler/enqueued': 1,'scheduler/enqueued/memory': 1,'spider_exceptions/IndexError': 1,'start_time': datetime.datetime(2015, 2, 5, 23, 19, 8, 834428)}2015-02-06 00:19:09+0100 [plos] INFO: Spider closed (finished)

Changing itertag = "entry" to itertag = "//entry" removes the exception, but no items are scraped. I also tried using scrapy.log.msg to log a message from within parse_node, but nothing appears and no nodes are reported as having been scraped.

What am I doing wrong?


Following alecxe's advice, here is a spider with namepaces defined. The documentation is a bit skimpy so I'm still not sure why my logging calls aren't showing up...

from scrapy import logfrom scrapy.contrib.spiders import XMLFeedSpiderclass PLoSSpider(XMLFeedSpider):    name = "plos"    allowed_domains = ["plosone.org"]    namespaces = [        ('plos',            ('http://www.plosone.org/article/feed/search''?unformattedQuery=*%3A*&sort=Date%2C+newest+first')        )    ]    itertag = 'plos:entry'def parse_node(self, response, node):    log.msg('*** PING ***')

And here is the output:

$ scrapy crawl plos2015-02-06 18:33:01+0100 [scrapy] INFO: Scrapy 0.24.4 started (bot: plos)2015-02-06 18:33:01+0100 [scrapy] INFO: Optional features available: ssl, http11, boto2015-02-06 18:33:01+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'plos.spiders', 'SPIDER_MODULES': ['plos.spiders'], 'BOT_NAME': 'plos'}2015-02-06 18:33:01+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState2015-02-06 18:33:02+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats2015-02-06 18:33:02+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware2015-02-06 18:33:02+0100 [scrapy] INFO: Enabled item pipelines: 2015-02-06 18:33:02+0100 [plos] INFO: Spider opened2015-02-06 18:33:02+0100 [plos] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)2015-02-06 18:33:02+0100 [scrapy] DEBUG: Telnet console listening on 18:33:02+0100 [scrapy] DEBUG: Web service listening on 18:33:02+0100 [plos] INFO: Closing spider (finished)2015-02-06 18:33:02+0100 [plos] INFO: Dumping Scrapy stats:    {'finish_reason': 'finished','finish_time': datetime.datetime(2015, 2, 6, 17, 33, 2, 65414),'log_count/DEBUG': 2,'log_count/INFO': 7,'start_time': datetime.datetime(2015, 2, 6, 17, 33, 2, 60311)}2015-02-06 18:33:02+0100 [plos] INFO: Spider closed (finished)

It should further be noted that running scrapy shell "http://www.plosone.org/article/feed/search?unformattedQuery=*%3A*&sort=Date%2C+newest+first" followed by response.xpath('//entry') produces an empty list ([]). Yet, if you look at the raw XML data, you can see the <entry> tags plain as day. I'm at a complete loss, here...

Viewing all articles
Browse latest Browse all 2

Latest Images

Trending Articles

Latest Images

<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>
<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596344.js" async> </script>