Passion * Technology * Ruthless Competence

Wednesday, November 26, 2008

IronPython and Linq to XML Part 2: Screen Scraping

First, I need to convert the HTML list of Rock Band songs into a machine readable format. That means doing a little screen scraping. Originally, I used Beautiful Soup but I found that UnicodeDammit got confused on names like Blue Öyster Cult and Mötley Crüe. I’m guessing it’s broken because IronPython doesn’t have non-unicode strings.

Instead, I used SgmlReader to provide an XmlReader interface over the HTML, then queried that data via Linq to XML. I used the version of SgmlReader from MindTouch since they include a compiled binary and it seems to be the only active maintained version. I wrapped it all up in a function called load that loads HTML from either disk or the network (based on the URI scheme) into an XDocument.

def loadStream(streamreader):
  from System.Xml.Linq import XDocument
  from Sgml import SgmlReader
  
  reader = SgmlReader()
  reader.DocType = "HTML"
  reader.InputStream = streamreader
  return XDocument.Load(reader)
  
def load(url):
  from System import Uri
  from System.IO import StreamReader
  
  if isinstance(url, str):
    url = Uri(url)
  
  if url.Scheme == "file":
    from System.IO import File
    with File.OpenRead(url.LocalPath) as fs:
      with StreamReader(fs) as sr:
        return loadStream(sr)
  else:
    from System.Net import WebClient
    wc = WebClient()
    with wc.OpenRead(url) as ns:
      with StreamReader(ns) as sr:
        return loadStream(sr)

def parse(text):
  from System.IO import StringReader
  return loadStream(StringReader(text))

I call load, passing in the URL to the list of songs. The “official” Rock Band song page loads the actual content from a different page via AJAX, so I just load the actual list directly via my load function.

Once the HTML is loaded as an XDocument, I need a way to find the specific HTML nodes I was looking for. As I said earlier, XDocument uses Linq to XML – there is not other API for querying the XML tree. In the HTML, there’s a div tag with the id “content” that contains all the song rows as table row elements. I built a simple function that uses the LINQ Single method to find the tag by it’s id attribute value.

def FindById(node, id):
  def CheckId(n):
    a = n.Attribute('id')
    return a != None and a.Value == id
  
  return linq.Single(node.Descendants(), CheckId)

(Side note – I didn’t like the verbosity of the “a != None and a.Value == id” line of code, by XAttributes are not comparable by value. That is, I can’t write “node.Attribute(‘id’) == XAttribute(‘id’, id)”. And writing “node.Attribute(‘id’).Value == id” only works if every node has an id attribute. Not making XAttribute comparable by value seems like a strange design choice to me.)

LINQ to objects works just fine from IronPython, with a few caveats. First, IronPython doesn’t have extension methods, so you can’t chain calls together sequentially like you can in C#. So instead of collection.Where(…).Select(…), you have to write Select(Where(collection, …), …). Second, all the LINQ methods are generic, so you have to use the verbose list syntax (for example: Single[object] or Select[object,object]). Since Python doesn’t care about the generic types, I wrote a bunch of simple helper functions around the common LINQ methods that just use object as the generic type. Here are a few examples:

def Single(col, fun):
  return Enumerable.Single[object](col, Func[object, bool](fun))
  
def Where(col, fun):
  return Enumerable.Where[object](col, Func[object, bool](fun))
  
def Select(col, fun):
  return Enumerable.Select[object, object](col, Func[object, object](fun))

Once I have the content node, all the songs are in tr nodes beneath it. I wrote a function called ScrapeSong that transforms a song tr node into a Song object (which I’ll talk about in the next installment of this series). I use LINQ methods Select, OrderBy and ThenBy to provide me an enumeration of Song objects, ordered by date added (descending) than artist name.

def ScrapeSong(node):    
  tds = list(node.Elements(xhtml.ns+'td'))   
  anchor = list(tds[0].Elements(xhtml.ns+'a'))[0]   
     
  title = anchor.Value   
  url = anchor.Attribute('href').Value   
  artist = tds[1].Value   
  year = tds[2].Value   
  genre = tds[3].Value   
  difficulty = tds[4].Value   
  _type = tds[5].Value   
  added = DateTime.Parse(tds[6].Value)   
     
  return Song(title, artist, added, url, year, genre, difficulty, _type)   

songs = ThenBy(OrderByDesc(  
          Select(content.Elements(xhtml.ns +'tr'), ScrapeSong),   
          lambda s: s.added), lambda s: s.artist)

And that’s pretty much it. Next, I’ll iterate thru the list of songs and get the details I need from Zune’s catalog web services in order to write out a playlist that the Zune software will understand.

Posted By Harry Pierson at 5:16 PM Pacific Standard Time
IronPython | LINQ | Rock Band | XML | Zune
Thursday, December 04, 2008 2:48:01 AM (Pacific Standard Time, UTC-08:00)
import System, sys
import linq

linqs = {}
for name in dir(linq):
if not name.startswith('__'):
linqs[name] = getattr(linq, name)

class IpyLinq:
def __init__(self, col):
self.col = col

def __iter__(self):
return iter(self.col)

def __str__(self):
return '[%s]' % ', '.join( (str(v) for v in self) )

def __repr__(self):
return str(self)

def __getattr__(self, name):
def decorator(*arg, **kws):
self.col = linqFunc(self.col, *arg, **kws)
return self

linqFunc = linqs[name]
return decorator

if __name__ == '__main__':
for x in IpyLinq([1, 2, 3, 4, 5]).Where(lambda x: x > 1).Where(lambda x: x < 5):
print x
Thursday, December 04, 2008 3:29:41 AM (Pacific Standard Time, UTC-08:00)
My original implementation has a bug.

class IpyLinq:
def __init__(self, col):
self.col = col

def __iter__(self):
return iter(self.col)

def __str__(self):
return '[%s]' % ', '.join( (str(v) for v in self) )

def __repr__(self):
return str(self)

def __getattr__(self, name):
def decorator(*arg, **kws):
result = linqs[name](self.col, *arg, **kws)
if hasattr(result, '__iter__'):
return IpyLinq(result)
else:
return result
return decorator
Ada
Name
E-mail
Home page

Comment (HTML not allowed)  

Enter the code shown (prevents robots):

Live Comment Preview
Change Congress
Recent Bookmarks
Tags .NET Framework (2) ADO.NET (5) Agile (7) AJAX (3) Architecture (284) Guidance (6) Interop (2) Modelling (61) Patterns (7) Process (4) SOA (93) Web Services (5) ASP.NET (24) Azure (1) Battlestar Galactica (3) BI (2) BizTalk (4) Blogging (115) dasBlog (11) Podcasting (4) BPM (1) C# (10) C++ (4) Capitals (5) CardSpace (3) CLR (2) College Football (10) Comedy Central (1) Community (81) Concurrency (6) Consumer Electronics (1) Database (13) Dependency Injection (2) Development (117) C Plus Plus (1) Embedded (5) Lanugages (38) Media (2) P2P (11) Rotor (1) SharePoint (6) SOP (3) DIY (1) DLR (18) Domain Specific Languages (14) Durable Messaging (5) Dynamic Languages (10) Dynamic Silverlight (1) Education (3) Enterprise 2.0 (1) Entertainment (14) ETech (15) F# (51) Functional Programming (17) Game Development (2) Guidance Automation (3) Hardware (8) HawkEye (3) Hockey (29) Home Electronics (1) Home Network (5) Humor (5) IASA (1) Idempotence (3) infrastructure (5) Instrumentation (4) Integration (2) IronPython (52) IronRuby (12) Java (2) Job (3) LangNET (1) LINQ (23) Live Framework (3) Live Mesh (2) Lost (1) Master Data Management (1) Media 2.0 (6) Microsoft (30) MIX06 (2) Mobile Phone (1) Monads (5) Morning Coffee (172) Object Oriented (4) Office (5) Open Source (5) Open Space (2) Operations (3) Other (135) Art (1) Books (1) Family (31) Games (18) General Geekery (26) Home Theater (1) Movies (23) Music (20) Politics (3) Society (1) Sports (37) Working at MSFT (15) Parallel Programming (3) Parsing Expression Grammar (16) patterns & practices (2) PDC08 (5) Politics (47) PowerPoint (2) PowerShell (35) Presentation (5) Projects (1) HawkWiki (1) Python (4) Quote of the Day (4) Refactoring (1) Research (2) REST (18) Reuse (5) Robotics (2) Rock Band (4) Rome (5) Ruby (23) Ruby on Rails (1) Sci-Fi (2) Scripting (4) Security (3) Service Broker (14) SharePoint (2) Silverlight (18) Social Software (1) Software + Services (2) Software Design (1) Software Factories (11) Software Industry (1) Spark (1) SQL Server (2) Stephen Colbert (1) TechEd (7) TechEd06 (1) TechRec League (1) Television (6) Travel (6) Unified Client (1) Unit Testing (4) USC (1) UX (1) Virtual PC (2) Visual Basic (1) Visual Studio (20) Volta (2) Washington Capitals (34) WCF (31) Web 2.0 (65) Web Services (5) WF (21) Windows Live (26) WPF (7) Xbox (1) Xbox 360 (53) XML (11) XNA (14) Zune (4)
Disclaimer: The information in this weblog is provided "AS IS" with no warranties, and confers no rights. This weblog does not represent the thoughts, intentions, plans or strategies of my employer. It is solely my opinion. Inappropriate comments will be deleted at the authors discretion.