Scrape the web speedily, reliably, and simply with scrapy

Asheesh Laroia (Eventbrite)
Python
Location: D136
Average rating: ***..
(3.50, 10 ratings)

Extracting data from the web is often error-prone, hard to test, and slow. Scrapy changes all of that.

In this talk, we consider two different types of web data retrieval – one that scrapes data out of HTML, and another that uses a RESTful API – and show how both can be improved by Scrapy.

Part I: Scraping without Scrapy

  • Web pages render into DOM nodes
  • Demonstrate a basic way to scrape a page: urllib2.urlopen() + lxml.html
  • Send the data somewhere by a synchronous call

Part II: Importing Scrapy components for programmer sanity

  • Using scrapy.items.Item to define what you are scraping out
  • Using scrapy.spider.BaseSpider to clarify the code
  • Running spiders: You just got async for free
  • Discussion: What does async buy you? Quick benchmarks of 200 simultaneous connections with Scrapy and without.
  • Sending data out the data pipeline

Part III: Everyone “loves” JavaScript

  • spidermonkey with Scrapy
  • Automating an entire Firefox with Selenium RC

Part IV: Automated testing when using Scrapy

  • Why testing is hard with synchronous scrapers
  • How to run scrapy.spider.BaseSpiders from Python unittest
  • How to test offline (by keeping a copy of needed pages)
  • (No synchronous calls, so tests run fast)

Part V: Improving a Wikipedia API client with Scrapy

  • Start with a synchronous API client
  • When the web service is down, watch it crash
  • Make it a “scrapy.spider”, and get automatic retry on failure
  • Configure the request scheduler to not hammer Wikipedia

Conclusion: Asheesh’s rules for sane scraping

  • Separate downloading from parsing.
  • Maintain high test coverage.
  • Be explicit about what data you pass from the wild, wild web into your application code.
  • Coding with Scrapy gives you all of these, unlike other scraping libraries.
  • When Scrapy isn’t appropriate
  • Short scripts sometimes feel a serious burden from the verbose API.
  • If you really want exceptions on failure.

Even if you use something else, you will love Scrapy’s documentation on scraping in general.

Photo of Asheesh Laroia

Asheesh Laroia

Eventbrite

Asheesh loves growing camaraderie among geeks. He chaired the Johns Hopkins Association for Computing Machinery and taught Python classes at Noisebridge, San Francisco’s hackerspace. He realizes that most of the work that makes projects successful is hidden underneath the surface.

He has volunteered his technical skills for the UN in Uganda, the EFF, and Students for Free Culture, and is a Developer in Debian. Until recently, he engineered software and scalability at Creative Commons in San Francisco; today, he works at OpenHatch as its project lead.

Sponsors

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Sharon Cordesse at (707) 827-7065 or scordesse@oreilly.com.

Contact Us

View a complete list of OSCON contacts