Deserializing XML with IronPython

Now that I can stream process XML, the next logical step is to deserialize it into some type of object graph. As I said in my last post, there are at least three different DOM-esque options on the .NET platform as well as two in the Python library (xml.dom and xml.minidom)

However, anyone who’s ever programmed against the DOM knows just what a major PITA it is.

Instead, you could deserialize the XML into a custom object tree, based on the nodes in the XML stream. In .NET, there are at least two libraries for doing this: the old-school XmlSerializer as well as the new-fangled DataContractSerializer. In these libraries, the PITA comes in defining the static types with all the various custom attribute adornments you need to tell the deserializer how to do it’s job. Actually, if you’re defining your code first, all those adornments aren’t that big a deal. However, if you’re starting from the XML, especially XML with lots of different namespaces – like say my RSS feed – defining a static type for this gets old fast.

Of course, if you’re not using a statically typed language… 😉

One of the cool aspects of dynamic languages is the ability to easily generate new types on the fly. In Python, you can create a new type by calling the type function. Here’s an example of creating a new type for a XML node:

def create_type(node, parent):  
  return type(node.name, (parent,), {'xmlns':node.namespace})

Since I’m working with XML, I wanted to make sure I handled namespaces. Thus, I add the namespace to the class definition (the third parameter in the type function above). This lets me walk up to any arbitrary object created from an XML element and check it’s namespace.

I used this dynamic type creation functionality in my xml2py module, which I added to my IronPython SkyDrive folder. It leverages ipypulldom, so make sure you get both. The heart of the module is the xml2py function, which recursively iterates thru the node stream and builds the tree. Attributes and child elements become named attributes on the object, so I can write code that looks like this:

import xml2py  
rss = xml2py.parse('http://feeds.feedburner.com/Devhawk')  
for item in rss.channel.item:  
  print item.title

You see? No screwing around with childNodes or getAttribute here.

The basic processing loop of xml2py creates a new instance of a new type when it encounters a start element tag. It then collects all the attributes and children of that element, and adds them as attributes on the element object, using the name of the type as name of the attribute. If there are multiple children with the same type name, xml2py converts that attribute to a list of values. For example, in an RSS feed, there will be likely be many rss.channel.item elements. In xml2py, the item attribute of the channel object will be a list of item objects.

Since attributes and child elements are getting slotted together, I added a _nodetype attribute on each so I can later tell (if I care) if the value was originally an attribute or element. I haven’t written py2xml yet, but that might be important then.

I do one optimization for simple string elements like <foo>bar</foo>. In this case, I create a type that inherits from string (hence the need for the parent parameter in the create_type function above) and contains the string text. It still has the xmlns and _nodetype attributes, so I could write item.title.xmlns (which is empty since RSS is in the default namespace) or item.title._nodetype (which would be XmlNodeType.Element)

It’s not much code – about 100 lines of code split evenly between the xml2py function and the _type_factory object. Given that you usually see the same element in an XML stream over an over, I didn’t want to create multiple types for the same element. So _type_factory caches types in a dictionary so I can reuse them. One of the cool things is that it’s a callable type (i.e. it implements __call__ so I can use the instance like a function. I started by defining a xtype function that didn’t cache anything, but then later switched xtype to be a _type_factory instance, but none of my code that called xtype had to change!

One other quick note. If you put xml2py.py and ipypylldom.py in a folder, you can experiment with them by launching ipy -i xml2py. This runs xml2py.py as a script, but dumps you into the interactive console when you’re thru. It will run the little snippet of code above which runs xml2py on my FeedBurner feed, but then you can play around with the rss object and see what it contains. Be sure to check out the xmlns attribute for each object in the rss.channel.link list.

Stream Processing XML in IronPython

When it comes to processing XML, there are two basic approaches – load it all into memory at once or process it a node at a time. In the .NET world where I have spent most of the past ten years, those two models are represented by XmlDocument and XmlReader. There are alternatives to XmlDocument, such as XDocument and XPathDocument, but you get the idea.

Out in non-MSFT land, the same two basic models exist, however the de facto standard for stream based processing is SAX, the Simple API for XML. SAX is supported by many languages, including Python.

Personally, I’ve never been a fan of SAX’s event-driven approach. Pushing events makes total sense for a human driven UI, but I never understood why anyone thought that was a good idea for stream processing XML. I like XmlReader’s pull model much better. When you’re ready for the next node, just call Read() – no mucking about setting content handlers or handling node processing events.

Luckily, the Python standard library supports both approaches. It provides both a SAX based parser as well as a pull based parser called pulldom. Pulldom doc’s are fairly sparse, but Paul Prescod wrote a nice introduction. Here’s an example from Paul’s site (slightly modified):

from xml.dom import pulldom
nodes = pulldom.parse( "file.xml" )  
for (event,node) in nodes:  
    if event=="START_ELEMENT" and node.tagName=="table":  
        nodes.expandNode( node )

Actually, I like this better than XmlReader, since it provides the nodes in a list-like construct that appeals to the functional programmer in me. I’d like it even more if Python had a native pattern matching syntax – you know, like F# – but you can get similar results by chaining together conditionals with elif.

However, IronPython doesn’t support any of the XML parsing modules from Python’s standard library. They’re all based on a C-based python module called pyexpat which IronPython can’t load. ¹ I wanted a pulldom type model, so I decided to wrap XmlReader to provide a similar API and lets me write code like this:

import ipypulldom  
nodes = ipypulldom.parse( "sample.xml" )
for node in nodes:
  if node.nodeType==XmlNodeType.Element:
    print node.xname

There are a few differences from pulldom, but it’s basically the same model. I’m using the native .NET type XmlNodeType rather than a string to indicate the node type. Furthermore, I made the node type a property of the node, rather than a separate variable. I also didn’t implement expandNode, though doing so would be a fairly straightforward combination of XmlReader.ReadSubtree and XmlDocument.Load.

I stuck the code for ipypulldom up in a new folder on my Skydrive: IronPython Stuff. It’s fairly short – only about 45 lines of code. Feel free to use it if you need it.

The FePy project has a .NET port of pyexpat as part of their distribution, so I assume that lets you use the standard pulldom implementation in IPy. FePy looks really cool but I haven’t had time to dig into it yet.↩

Rare Insight Into the IPy Team

An email conversation between two members of the IPy team that you, dear reader, might find amusing.

From:Srivatsn Narayanan
Sent: Friday, May 02, 2008 4:00 PM
Subject: You dont need pinky fingers? 😄

http://twitter.com/DevHawk/statuses/802239164

From:Dino Viehland
Sent: Friday, May 02, 2008 11:30 PM

Wow, am I really that loud? 😄

This is a response to the idea that no design is complete until you can’t remove anything else without sacrificing the overall goals of the end result. Is the pinky really necessary? We can “cut that feature”, right? 😄

From: Srivatsn Narayanan
Sent: Friday, May 02, 2008 11:49 PM

If you cut off the pinkie, then the ring finger becomes the pinkie and so iteratively you will have to lose all of them 😃. If I remember right in Europe, you hold up the pinkie to order beer. So it’s quite important 😄

From: Dino Viehland
Sent: Friday, May 02, 2008 11:59 PM

Well I find it hard to argue with ordering beer, but I doubt the pinkie’s original use was to order beer. Therefore I’d describe this as an abuse of an existing API that might be better served by a new API specifically designed for this purpose. But I question even if a new API is necessary.

Drinking beer by definition involves a container in which the beer is consumed from. Therefore the container it’s self (or the lack thereof) might be able to serve the dual role for both consuming the beer and indicating the need for a refill. For example when drinking sake I’ve been told it is customary to tip the bottle on its side to indicate the desire for a refill. Are we really to expect that the Human API exposes every need in a direct fashion? For example should I have another appendage for “I need more bread sticks” and yet another for “Can I get a refill or on the water?”. I’d propose the “2 creams no sugar” appendage because that’s the way I take my coffee.

On the other hand the iterative problem does seem to be more relevant – but it presupposes a purpose for the pinky in the first place! But alas, I admit I’ve just returned from drinking beer, so this may not be the best argument 😃.

From: Srivatsn Narayanan
Sent: Saturday, May 03, 2008 12:21 AM

The pinkie is like a reserved field/custom field in a API. Every user community can come up with its own use for it but they have to devise the protocol themselves. Apparently different civilizations have. Wikipedia as usual has a list of those. So do you get your “2 creams no sugar” appendage – sure if u can agree on the semantics with the waiter. Seems like a PM call to me 😄

PS: What beer did u have? I should try that – seems effective 😄

From: Dino Viehland
Sent: Saturday, May 03, 2008 12:26 AM

I hadn’t quite considered it like this but I’d propose an alternate encoding – binary. With 4 fingers one can communicate up to 16 different which seems to cover most of the listed issues. Yes 5 gives you up to 32 messages but 16 should be enough for everyone 😃.

And they were various Hale’s Ales fresh from the brewery.

From: Srivatsn Narayanan
Sent: Saturday, May 03, 2008 12:31 AM

I wouldn’t want to hold up 0010 really (although it won’t be the middle finger anymore) 😄

From: Dino Viehland
Sent: Saturday, May 03, 2008 12:40 AM

Ahh, but let’s just call that one reserved – that leaves the end-user with 15 other messages! Personally I’d suggest we reserve the entire bit – that way we could use other codes like 1010 for future expansion of various standard messages. Urination seems popular on the pinky list so maybe we’d want to standardize that in the future – we of course need to look at uses of the other fingers before we come to this conclusion. Users are now left with 8 user-defined messages but that seems plenty. Also they can then use 1000 or 0001 (depending their endianness, a.k.a. left-handed or right-handed) for the now deprecated pinky-messages.

From: Srivatsn Narayanan
Sent: Saturday, May 03, 2008 12:55 AM

1000, 1100, 1110 and 1111 are actually used for counting numbers 1,2,3,4 😄 They are too obvious to be non-standardized. So they are out as well.

That cuts the user messages to 5 (third bit is reserved anyway). Some of the ones left are hard to make as well – try 0101 😄

IronPython 2.0 Beta 2

We pushed out the latest beta of IronPython 2.0 this morning. From the release notes:

We’re pleased to announce the release of IronPython 2.0 Beta 2. In addition to the usual bug fixes (~25 reported on CodePlex and ~50 reported internally), this release has been partially focused on improving the performance of IronPython, in particular startup perf. Another focus of this release was improving upon our traceback support which had regressed quite a bit in 2.0B1 and had largely been broken in the 2.0 Alphas. Our traceback support should now be superior to that of IronPython 1.1!

We’ve also made a minor change to our packaging by adding a Microsoft.Scripting.Core.dll in addition to the Microsoft.Scripting.dll that’s been around since the start of 2.0. We are doing this purely as an architectural layering cleanup. Microsoft.Scripting.Core contains DLR features that are essential to building dynamic languages. Microsoft.Scripting will contain language implementation helpers that can either be re-used (e.g., BigInts) or copied (possibly e.g., the default binder). This process is all about our work to get the DLR architecture right and shouldn’t have any noticeable IronPython impact except that there’s now one more DLL to include in any package.

As a consequence of the new DLL, the deprecated file IronPython2005.sln is broken. This is the last release that will include this .sln file in the source zip file. Of course the Visual Studio 2008 version of this file, IronPython.sln, still builds.

We’d like to thank everyone in the community who reported these: kevgu, oldman, christmas, brucec, scottw, fuzzyman, haibo, Seo Sanghyeon, grizlupo, J. Merrill, perhaps, antont, 05031972, Jason Ferrara, Matt Beckius, and Davy Mitchell.

The full release notes have details about the bugs we fixed. Congrats to the team and thanks again to the community members for their assistance.

Importing Static Methods with IPy

Like .NET, Python uses namespaces to avoid name collisions. However, the semantics are a bit different. If you want to use a type or function from a a given namespace in Python, you have to import it into your current scope. For example, if you want to use the Python datetime built-in module, you would import it into the current scope and use it like this:

import datetime
bush_last_day = datetime.date(2009,1,20)

Notice that when I import a Python module this way, it’s scoped into it’s namespace, which forces me to use the entire namespace scoped name to access the type. Of course, that gets tedious quickly, so Python provides a way to import a type from a specific namespace into your current scope like this:

from datetime import date
bush_last_day = date(2009,1,20)

With IronPython, you can do import .NET namespaces as well. Here’s that same code using the standard .NET DateTime class.

from System import DateTime
bush_last_day = DateTime(2009,1,20)

What I didn’t know is that you can import static methods & properties from .NET types into the current scope using the same syntax. Here’s an example:

from System.DateTime import Now

if Now >= bush_last_day:
    print 'celebrate'
else:
    print (bush_last_day - Now).Days, 'days left'

Being able to import a static method into the current scope is pretty convenient. Thanks to my teammate Jimmy for cluing me into this IPy feature.

One caveat though: in Python, you can import an entire namespace into your current scope. You can do that with .NET namespaces, but not with .NET types

from datetime import *         # this works

from System import *           # so does this

from System.DateTime import *  # this doesn’t work

Update: Michael Foord pointed out that if you import Now as I describe above, it places a DateTime object representing the time you imported it into local scope, rather than placing the underlying get_Now static method in local scope. So while DateTime.Now always returns a new value, Now never changes. Sounds like an IPy bug to me, but I’ll have to circle back with the team to be sure.

Series

Disclaimer

The information in this weblog is provided "AS IS" with no warranties, and confers no rights. This weblog does not represent the thoughts, intentions, plans or strategies of my employer. It is solely my opinion. Inappropriate comments will be deleted at the authors discretion.