Stream Processing XML in IronPython

When it comes to processing XML, there are two basic approaches – load it all into memory at once or process it a node at a time. In the .NET world where I have spent most of the past ten years, those two models are represented by XmlDocument and XmlReader. There are alternatives to XmlDocument, such as XDocument and XPathDocument, but you get the idea.

Out in non-MSFT land, the same two basic models exist, however the de facto standard for stream based processing is SAX, the Simple API for XML. SAX is supported by many languages, including Python.

Personally, I’ve never been a fan of SAX’s event-driven approach. Pushing events makes total sense for a human driven UI, but I never understood why anyone thought that was a good idea for stream processing XML. I like XmlReader’s pull model much better. When you’re ready for the next node, just call Read() – no mucking about setting content handlers or handling node processing events.

Luckily, the Python standard library supports both approaches. It provides both a SAX based parser as well as a pull based parser called pulldom. Pulldom doc’s are fairly sparse, but Paul Prescod wrote a nice introduction. Here’s an example from Paul’s site (slightly modified):

from xml.dom import pulldom
nodes = pulldom.parse( "file.xml" )  
for (event,node) in nodes:  
    if event=="START_ELEMENT" and node.tagName=="table":  
        nodes.expandNode( node )

Actually, I like this better than XmlReader, since it provides the nodes in a list-like construct that appeals to the functional programmer in me. I’d like it even more if Python had a native pattern matching syntax – you know, like F# – but you can get similar results by chaining together conditionals with elif.

However, IronPython doesn’t support any of the XML parsing modules from Python’s standard library. They’re all based on a C-based python module called pyexpat which IronPython can’t load. 1 I wanted a pulldom type model, so I decided to wrap XmlReader to provide a similar API and lets me write code like this:

import ipypulldom  
nodes = ipypulldom.parse( "sample.xml" )
for node in nodes:
  if node.nodeType==XmlNodeType.Element:
    print node.xname

There are a few differences from pulldom, but it’s basically the same model. I’m using the native .NET type XmlNodeType rather than a string to indicate the node type. Furthermore, I made the node type a property of the node, rather than a separate variable. I also didn’t implement expandNode, though doing so would be a fairly straightforward combination of XmlReader.ReadSubtree and XmlDocument.Load.

I stuck the code for ipypulldom up in a new folder on my Skydrive: IronPython Stuff. It’s fairly short – only about 45 lines of code. Feel free to use it if you need it.


  1. The FePy project has a .NET port of pyexpat as part of their distribution, so I assume that lets you use the standard pulldom implementation in IPy. FePy looks really cool but I haven’t had time to dig into it yet.

Morning Coffee 150

  • Yesterday was the NHL trading deadline, and the Capitals were very busy. They obtained Huet from Montreal, Federov from Columbus and Cooke from Vancouver. Given they are fighting just to make the playoffs, going for three soon-to-be unrestricted free agents seems like an odd choice. However, the consensus (among my parents anyway) was that it’s critical to get this very young Caps team some playoff experience. Even if all three walk at season’s end, it’ll be worth if the Caps make a playoff run. Besides it’s not like we gave up much: an extra second round pick in ’09, a 19 year old defensive prospect (who was apparently 14th on the depth chart) and an underachieving winger.
  • Speaking of the Caps playoff chances, they are currently one and a half games back of the division leading Hurricanes and two games behind the current eighth seed Flyers. Yes, I rank hockey teams using baseball’s standings system. Otherwise, you have to talk about games in hand (i.e. the Caps are five points behind Carolina with two games in hand).
  • The writer’s guild ratified the new contract, so Hollywood labor strife is now officially behind us. At least until July when the the actors may go on strike.
  • It seems like a slow week for Microsoft geek news, which is odd since WS08, VS08 and SQL08 all launch today. I’m guessing it’s the calm before the Mix storm next week.
  • After going dark for six months, Linq to XSD has been re-released to work with the RTM version of VS08. Scott Hanselman demonstrates Linq to XSD by applying it to OFX, an XML Schema he calls “goofy” but apparently helped develop. OFX uses derivation by restriction, which has no direct corollary in C#, but Linq to XSD’s  is able to translate between XML and objects without loosing any of that type fidelity. Nice to know Linq to XSD can tolerate OFX’s level of goofiness, though I’m guessing most people use much more straightforward schemas.
  • Speaking of Linq, I discovered LINQPad via a comment on Rob Conery’s blog (which I found via DNK). It’s basically a code snippet IDE for C# 3.0 and VB9, with it also has built in database connection support, so it can fulfil much the same role as SQL Management Studio. I only played with it for a few minutes, but I was really impressed.  This is definitely going in my utilities folder. I wonder if they’re interested in supporting F#?
  • Not sure how I missed this, but you can get MSDN Magazine via same Syndicated Client Experience as Architecture Journal. Unlike AJ which is divided into issues, the MSDN magazine client is divided into topics which is harder to square with the physical magazine. On the other hand, since MSDN Mag has been around longer, perhaps topics + search is a better discovery mechanism.
  • Soma announces the Visual Studio Gallery, a repository of VS Extensions. It’s kinda cool, but the whole discovery mechanism is clunky. I might like to experiment with some free or even free trial products, but there’s no way to filter on cost so finding them is a hassle. Also, there’s no way for community members to vote, rate or comment on the products in any way.
  • Nick Malik can’t answer the question “how does Enterprise Architecture demonstrate value?” I could be snarky and say “it doesn’t”, but that’s only half the answer. It doesn’t, but it should. My opinion, since you asked Nick, is that EA fails to deliver value because it tries to control the uncontrollable. Trying to gain efficiency thru establishing standards and eliminating overlap via reuse are pipe dreams, though literally millions of $$$ have been poured into those sink-holes. There are a few areas where centrally funded infrastructure projects can solve big problems that individual projects can’t effectively tackle on their own. EA should focus their time there, they can actually make a difference. Otherwise, they should stay out of project’s way.

Morning Coffee 146

  • The writers strike is officially over. Everyone goes back to work today. Thomas Cleaver has what I thought was the best post summarizing how the writers won. TV Guide has a rundown of how and when various shows will resume. I can’t wait to see Daily Show and Colbert Report tonight. Lost – aka the best show on TV – looks like it will be getting five more episodes (in addition to the eight shot before the strike).
  • Speaking of TV, Battlestar Galactica Fans: circle April 4th on your calendar.
  • Obama won all three “Potomac Primaries” yesterday, and is now the Democratic front-runner, though there’s a long way to go before the convention. Scott Adams of Dilbert fame has a great take on presidential experience – I’m guessing he’s an Obama fan.
  • In minor acquisition news, Microsoft is acquiring Caligari, makers of 3D modeling tool trueSpace. The Caligari folks are joining the Virtual Earth team, though I wonder what the XNA folks think of the acquisition. This isn’t the first 3D modeling product Microsoft ever acquired – we owned Softimage for four years in the ’90s.
  • Scott Hanselman and Tomas Resprepo both write about PowerShellPlus, which I saw week before last @ Lang.NET. Scott really likes it, for both PS novices and gurus, but Tomas thinks the UI is busy, based on the screenshots. Personally, I’m not doing much PS work lately – occasional one off stuff, but that’s it – so it doesn’t seem worth the effort.
  • Speaking of Scott & Tomas, Scott also has a nice gallery of VS themes. I’m partial to Tomas’ Ragnarok Grey. Is there a VSThemesGallery.com site somewhere?
  • Still speaking of Scott, he points to the new ASP.NET Developer Wiki (beta). I poked around, but didn’t find anything shiny. I was very surprised that searching for “MVC” returned no results.
  • Speaking of MVC, Scott Guthrie has a rundown on what’s coming in the MIX preview release of ASP.NET MVC. Biggest news IMO is that it’s /bin deployable – i.e. you don’t need your hoster to do anything special to support MVC (assuming they already support ASP.NET 3.5). Also big news, they’re releasing the source so you can build and patch (and enhance?) it yourself.
  • Chris Taveres continues is ObjectBuilder series and Tomas continues is DLR Notes series. BTW, my F# based DLR experimentation continues, albeit slowly (frakking day job). Hope to be able to post on this soon.
  • One of the things driving my interest in F# is manycore. An interesting tangent to manycore is general purpose programming on graphics processing units (aka GPGPU). MS Research just released a new version of Accelerator, just such a GPGPU system. I personally haven’t played with it – I’ve been focused on writing parsers, not parallel code.
  • Is XQuery really “a promising technology of the future” as Don Box suggests? I see exactly zero demand or use for it in my day-to-day work. Of course, Don’s paid to build future platform goo, so maybe it is promising and Don’s afore-mentioned goo will leverage it, though I remain skeptical. As for XML being “Done like a well-cooked steak”, I’d say XML is like a great steak cooked perfectly, except it’s done exactly how you don’t like it. You can appreciate its quality, but you don’t really enjoy it as much as you could have.

Morning Coffee 116

“Looks like I picked the wrong week to stop sniffing glue”
Steve McCroskey, Airplane!

  • So it’s been a while since my last post. Just over a month, not including The F5 High, which wasn’t “original IP”. Frankly, I just stopped reading pretty much cold turkey. I wanted and needed to go heads down on day job stuff for a while. Since I haven’t been reading, Morning Coffee is going to be a little cold while I ramp back up.
  • The new NHL season is upon us, and the Caps are looking good so far. Obviously, they have the new uniforms, but they’re also out to a 2-0 start for the first time in five years. And in those two games, they’ve only allowed one goal and are 100% on the PK. It’s nice to see them start strong, but obviously there’s a long way to go. Here’s hoping the can stay strong all season.
  • Speaking of staying strong, the wheels that were rattling last week came off the Trojan bandwagon completely this week. I’m not sure it’s as big an upset as Appalachian State beating Michigan but it’s close. What happened to the team that scored 5 TD’s in a row on Nebraska?
  • Big news last week is that MSFT is going to release the source code to much of the .NET Framework. Scott Guthrie has the details. Frankly, between Rotor & Reflector, it wasn’t like you couldn’t see the source code anyway, so this seems like a no-brainer. But integrating it directly into the VS Debugging experience, that’s frakking brilliant.
  • I haven’t had a chance to install the new XML Schema Designer (Aug 07 CTP)  but I was really impressed with this video. The XML Team blog has more details. However, I’m not sure what the ship vehicle is. The CTP install on top of VS08 beta 2, but in the video they keep saying “a future version” of VS, implying that it’s not going to be in VS08.
  • Dare is spending some time investigating SSB. I think it’s interesting that some of the REST crowd are starting to see the need for durable messaging. Dare argues that the features and usage models are more important than wire protocol. As long as it’s standardized, I don’t care that much about the protocol. Several of the REST folks mentioned AMQP. While I’ve got nothing against AMQP technically (frankly, I haven’t read the spec), but what does it say about durable messaging vendors (including MSFT) that a financial institution felt the need to drive an interoperable durable messaging specification?

DataReaders, LINQ to XML and Range Generation

I’m doing a bunch of database / XML stuff @ work, so I decided to use to VS08 beta 2 so I can use LINQ. For reasons I don’t want to get into, I needed a way to convert arbitrary database rows, read using a SqlDataReader, into XML. LINQ to SQL was out, since the code has to work against arbitrary tables (i.e. I have no compile time schema knowledge). But XLinq LINQ to XML helped me out a ton. Check out this example:

const string ns = "{http://some.sample.namespace.schema}";

while (dr.Read())
{
    XElement rowXml = new XElement(ns + tableName,
        from i in GetRange(0, dr.FieldCount)
        select new XElement(ns + dr.GetName(i), dr.GetValue(i)));
}

That’s pretty cool. The only strange thing in there is the GetRange method. I needed an easy way to build a range of integers from zero to the number of fields in the data reader. I wasn’t sure of any standard way, so I wrote this little two line function:

IEnumerable<int> GetRange(int min, int max)
{
    for (int i = min; i < max; i++)
        yield return i;
}

It’s simple enough, but I found it strange that I couldn’t find a standard way to generate a range with a more elegant syntax. Ruby has standard range syntax that looks like (1..10), but I couldn’t find the equivalent C#. Did I miss something, or am I really on my own to write a GetRange function?

Update: As expected, I missed something. John Lewicki pointed me to the static Enumerable.Range method that does exactly what I needed.