10 March 2011

110% Easier Web Scraping With The Yahoo Query Language Library

I've been examining and evaluating web scraping frameworks for Python over the last month and have found a few really good ones. The issue that I've been having though is they require too much time and effort for most people who simply want to perform quick and dirty scraping operations for small personal projects.

Today while surfing dzone I found a library that I think is way better for the needs of the majority of script writers; Yahoo Query Language.

"The Yahoo! Query Language is an expressive SQL-like language that lets you query, filter, and join data across Web services. With YQL, apps run faster with fewer lines of code and a smaller network footprint." -from the official site

In addition to being able to use a SQL-esque language for querying pages/services you can also use XPath selectors in the query statements to make it even easier to grab the data that you want. And as always, you can use firebug to retrieve the page element's XPath.

You can get started working with YQL in several ways:
  • Use the YQL web console
  • Import yql into the python REPL and have a go at it

Here are some links to get you started in developing web scraping utilities using YQL:

I'll post some code samples sometime today most likely to show off what I've been up to this morning with YQL.

3 comments:

  1. Scraping websites is usually pretty boring and annoying, but for some reason it always comes back. The Yahoo query language is just like SQL that helps us query, filter, and join data across Web services. Thanks a lot...

    Web Harvesting

    ReplyDelete
  2. Thanks for sharing, might be useful in the future. Do do web testing like selenium?

    ReplyDelete
  3. I've only done web testing using Perl with WWW::Mechanize and LWP::Agent. Selenium looks like it's a good tool to add to my toolbox. Thanks!

    ReplyDelete