IronPython and Linq to XML Part 2: Screen Scraping

First, I need to convert the HTML list of Rock Band songs into a machine readable format. That means doing a little screen scraping. Originally, I used Beautiful Soup but I found that UnicodeDammit got confused on names like Blue Öyster Cult and Mötley Crüe. I’m guessing it’s broken because IronPython doesn’t have non-unicode strings.

Instead, I used SgmlReader to provide an XmlReader interface over the HTML, then queried that data via Linq to XML. I used the version of SgmlReader from MindTouch since they include a compiled binary and it seems to be the only active maintained version. I wrapped it all up in a function called load that loads HTML from either disk or the network (based on the URI scheme) into an XDocument.

def loadStream(streamreader):
  from System.Xml.Linq import XDocument
  from Sgml import SgmlReader

  reader = SgmlReader()
  reader.DocType = "HTML"
  reader.InputStream = streamreader
  return XDocument.Load(reader)

def load(url):
  from System import Uri
  from System.IO import StreamReader

  if isinstance(url, str):
    url = Uri(url)

  if url.Scheme == "file":
    from System.IO import File
    with File.OpenRead(url.LocalPath) as fs:
      with StreamReader(fs) as sr:
        return loadStream(sr)
  else:
    from System.Net import WebClient
    wc = WebClient()
    with wc.OpenRead(url) as ns:
      with StreamReader(ns) as sr:
        return loadStream(sr)

def parse(text):
  from System.IO import StringReader
  return loadStream(StringReader(text))

I call load, passing in the URL to the list of songs. The “official” Rock Band song page loads the actual content from a different page via AJAX, so I just load the actual list directly via my load function.

Once the HTML is loaded as an XDocument, I need a way to find the specific HTML nodes I was looking for. As I said earlier, XDocument uses Linq to XML – there is not other API for querying the XML tree. In the HTML, there’s a div tag with the id “content” that contains all the song rows as table row elements. I built a simple function that uses the LINQ Single method to find the tag by it’s id attribute value.

def FindById(node, id):
  def CheckId(n):
    a = n.Attribute('id')
    return a != None and a.Value == id

  return linq.Single(node.Descendants(), CheckId)

(Side note – I didn’t like the verbosity of the a != None and a.Value == id line of code, but XAttributes are not comparable by value. That is, I can’t write node.Attribute('id') == XAttribute('id', id). And writing ``node.Attribute('id').Value == id11 only works if every node has an id attribute. Not making XAttribute comparable by value seems like a strange design choice to me.)

LINQ to objects works just fine from IronPython, with a few caveats. First, IronPython doesn’t have extension methods, so you can’t chain calls together sequentially like you can in C#. So instead of collection.Where(…).Select(…), you have to write Select(Where(collection, …), …). Second, all the LINQ methods are generic, so you have to use the verbose list syntax (for example: Single[object] or Select[object,object]). Since Python doesn’t care about the generic types, I wrote a bunch of simple helper functions around the common LINQ methods that just use object as the generic type. Here are a few examples:

def Single(col, fun):
  return Enumerable.Single[object](col, Func[object, bool](fun))

def Where(col, fun):
  return Enumerable.Where[object](col, Func[object, bool](fun))

def Select(col, fun):
  return Enumerable.Select[object, object](col, Func[object, object](fun))

Once I have the content node, all the songs are in tr nodes beneath it. I wrote a function called ScrapeSong that transforms a song tr node into a Song object (which I’ll talk about in the next installment of this series). I use LINQ methods Select, OrderBy and ThenBy to provide me an enumeration of Song objects, ordered by date added (descending) than artist name.

def ScrapeSong(node):
  tds = list(node.Elements(xhtml.ns+'td'))
  anchor = list(tds[0].Elements(xhtml.ns+'a'))[0]

  title = anchor.Value
  url = anchor.Attribute('href').Value
  artist = tds[1].Value
  year = tds[2].Value
  genre = tds[3].Value
  difficulty = tds[4].Value
  _type = tds[5].Value
  added = DateTime.Parse(tds[6].Value)

  return Song(title, artist, added, url, year, genre, difficulty, _type)

songs = ThenBy(OrderByDesc(
          Select(content.Elements(xhtml.ns +'tr'), ScrapeSong),
          lambda s: s.added), lambda s: s.artist)

And that’s pretty much it. Next, I’ll iterate thru the list of songs and get the details I need from Zune’s catalog web services in order to write out a playlist that the Zune software will understand.

Comments:

import System, sys import linq linqs = {} for name in dir(linq): if not name.startswith('__'): linqs[name] = getattr(linq, name) class IpyLinq: def __init__(self, col): self.col = col def __iter__(self): return iter(self.col) def __str__(self): return '[%s]' % ', '.join( (str(v) for v in self) ) def __repr__(self): return str(self) def __getattr__(self, name): def decorator(*arg, **kws): self.col = linqFunc(self.col, *arg, **kws) return self linqFunc = linqs[name] return decorator if __name__ == '__main__': for x in IpyLinq([1, 2, 3, 4, 5]).Where(lambda x: x > 1).Where(lambda x: x < 5): print x

This comment was posted by Ada on December 04, 2008 at 2:48 AM.

My original implementation has a bug. class IpyLinq: def __init__(self, col): self.col = col def __iter__(self): return iter(self.col) def __str__(self): return '[%s]' % ', '.join( (str(v) for v in self) ) def __repr__(self): return str(self) def __getattr__(self, name): def decorator(*arg, **kws): result = linqs[name](self.col, *arg, **kws) if hasattr(result, '__iter__'): return IpyLinq(result) else: return result return decorator

This comment was posted by Ada on December 04, 2008 at 3:29 AM.

Series

Disclaimer

The information in this weblog is provided "AS IS" with no warranties, and confers no rights. This weblog does not represent the thoughts, intentions, plans or strategies of my employer. It is solely my opinion. Inappropriate comments will be deleted at the authors discretion.

IronPython and Linq to XML Part 2: Screen Scraping

Comments:

Categories

Tags

Series

Disclaimer