python - scrapy unable to extract some data from website -

- July 15, 2011

i using scrapy crawl page, able simple things visible text. there texts not visible crawler , end showing spaces.

for instance seeing page sources allow me see these fields:

https://www.dropbox.com/s/f056mffmuah6uu4/screenshot%202015-07-23%2018.23.32.png?dl=0

i've tried numerous times access field through xpath , css , not able these fields after each attempt.

when try like:

response.xpath('//text()').extract()

i not able see these fields in text dump @ all.

would have idea why these fields not visible scrapy? website is: https://www.buzzbuzzhome.com/uc/units/houses/sapphire

in spider, need make additional xhr post request https://www.buzzbuzzhome.com/bbhajax/development/unitpricehistory endpoint price history providing necessary headers , post parameters:

import json import scrapy   class buzzspider(scrapy.spider):     name = 'buzzbuzzhome'     allowed_domains = ['buzzbuzzhome.com']     start_urls = ['https://www.buzzbuzzhome.com/uc/units/houses/sapphire']      def parse(self, response):         unit_id = response.xpath("//div[@id = 'unitdetails']/@data-unit-id").extract()[0]         development_url = "uc"         new_relic_id = response.xpath("//script[contains(., 'xpid')]").re(r'xpid:"(.*?)"')          params = {"developmenturl": development_url, "unitid": unit_id}         yield scrapy.request("https://www.buzzbuzzhome.com/bbhajax/development/unitpricehistory",                              method="post",                              body=json.dumps(params),                              callback=self.parse_history,                              headers={                                  "accept": "*/*",                                  "user-agent": "mozilla/5.0 (macintosh; intel mac os x 10_10_2) applewebkit/537.36 (khtml, gecko) chrome/43.0.2357.134 safari/537.36",                                  "x-requested-with": "xmlhttprequest",                                  "x-newrelic-id": new_relic_id,                                  "origin": "https://www.buzzbuzzhome.com",                                  "host": "www.buzzbuzzhome.com",                                  'content-type': 'application/json; charset=utf-8'                              })      def parse_history(self, response):         row in response.css("div.row"):             title = row.xpath(".//div[@class='content-title']/text()").extract()[0].strip()             text = row.xpath(".//div[@class='content-text']/text()").extract()[0].strip()              print title, text

prints:

05/25/2015 unit listed sold 12/18/2014 unit listed sale 11/24/2014 unit price increased  1.54% $461,990 11/04/2014 unit price increased  6.81% $454,990 10/02/2014 unit price increased  4.67% $425,990 01/22/2014 unit price increased  2.52% $406,990 12/06/2013 unit listed sale @ $396,990

Search This Blog

Click Hand

python - scrapy unable to extract some data from website -

Comments

Post a Comment

Popular posts from this blog

apache - setting document root in antoher partition on ubuntu -

cytoscape.js - How to add nodes to Dagre layout with Cytoscape -

python - pip install -U PySide error -