Peter Bengtsson

Screenscraping by CSS

By: Peter Bengtsson, 6th of April 2009

6th of April 2009

A lot of developers know how to use an XML parser and manipulate a DOM tree. Some developers even know how to use BeautifulSoup to download some broken HTML and parse its generated DOM tree. But how many know how ridiculously easy it is to parse and search this DOM tree with CSS? Using lxml.html here's a snippet to download all news-link texts from www.fry-it.com in 5 lines:

 >>> from lxml.html import parse
 >>> from lxml.cssselect import CSSSelector
 >>> selector = CSSSelector('div.newsitem h4 a')
 >>> page = parse('http://www.fry-it.com/').getroot()
 >>> for a in selector(page):
 ...     print repr(a.text)
 ... 
 'Fry-It is Gold Sponsor for Plone 2008 Conference'
 'MerbOutpost'
 u'Fry-IT aims to save Hampshire/Surrey CAMHS \xa310k per year'
 'Saint Gobain choose Fry-IT for custom development'
 'Great Ormond Street Hospital and ICH choose Fry-IT'
 'FlashVideo 0.8 released'
 'GE subsiduary chooses Fry-IT product'
 'London Deanery health libraries site'
 'Disability Now Managed Website'
 'United Nations development network created'

See how magically simple it is? It even gets the Unicode character right for £ and since it returns an iterator it's really easy to work with as a simple loop. Once the "masses" catch on to how simple this is I think/hope we'll see more innovative and clever uses of screen-scraping.

I had to use this recently in a project but in this case sometimes the HTML to download was a URL and sometimes it was provided as a big HTML string directly. Here's how I did that:

 >>> from lxml import etree
 >>> parser = etree.HTMLParser()
 >>> if isinstance(html, unicode):
 ...     html = html.encode('utf8')
 >>> page = etree.parse(StringIO(html), parser).getroot()




Comment

Jan - 6th April 2009  [«« Reply to this]
Wow, very useful! I'll make sure to use this next time I need to parse stuff out of web pages in my scripts.
Zada - 23rd May 2009  [«« Reply to this]
Cool, very intresting stuff.
 



hide my email address.

Your email address will be encoded to prevent email-extraction spiders from reading it so you won't get spammed if you decide to show your email address.