Hybrid App Debugging Aside – The DLR Hosting API

In my series on Hybrid App Debugging, I showed the following code for executing a Python file in a hybrid C#/IronPython app.

private void Window_Loaded(object sender, RoutedEventArgs e)
{
    ScriptEngine engine = Python.CreateEngine();
    ScriptScope  scope = engine.CreateScope();
    scope.SetVariable("items", lbThings.Items);
    engine.ExecuteFile("getthings.py", scope);
}

The DLR Hosting API has three distinct levels of functionality. As simple as this is, technically it’s level 2 since it’s using a ScriptEngine directly. If you wanted to use the simplest level 1 hosting API, you could use runtimes instead of engines and save a line of code.

private void Window_Loaded(object sender, RoutedEventArgs e)
{
    ScriptRuntime runtime = Python.CreateRuntime();
    runtime.Globals.SetVariable("items", lbThings.Items);
    runtime.ExecuteFile("getthings.py");
}

The ScriptRuntime version of ExecuteFile doesn’t include an overload that takes a ScriptScope like ScriptEngine does, so instead you add the items variable to the globals scope. However, this doesn’t automatically add the items object to every child scope – you have to explicitly import items into the local scope if you want to use it. So for Python, that means you need to add “import items” to the top of the GetThings.py script. Nothing else changes.

Personally, I find DLR Hosting API Level 2 to be straightforward and easy enough to understand, so I tend to code to that level by default. I actually had to go read the doc to discover the ScriptRuntime.Globals property and talk to Dino about importing those variables into a local scope. However, I wanted to point out that nothing in my Hybrid App Debugging sample so far is really dependent on the level 2 API. If you just want to execute some Python files in the context of your C# application, you can stick with the simpler level 1 API if you want. You can even use lightweight debugging with the level 1 API – there’s an overload of the SetTrace extension method for ScriptRuntimes just as there is for ScriptEngines. Just something to keep in mind.

HawkCodeBox

Last month, I lamented the lack of extensibility of the WPF text box. While there are several vendors and at least one open source custom syntax highlighting text box, it still really bothers me how inextensible the basic WPF text box is. I just want to do a simple colorizing REPL – why is that so hard?

So instead of using any of those syntax highlighting text boxes, I decided to build my own using the approach Ken Johnson wrote about on Code Project. As I wrote before, it’s a hack – you set the text box’s foreground and background brushes to transparent so that you can override OnRender – but it works.

The big change I made from Ken’s code was to use DLR TokenCategorizer instead of regular expressions to tokenize the code. TokenCategorizer is a service provided by the DLR hosting API, which will tokenize a given script source for you. Here’s the code that colorizes the text in the text box.

var source = Engine.CreateScriptSourceFromString(this.Text);
var tokenizer = Engine.GetService<TokenCategorizer>();
tokenizer.Initialize(null, source, SourceLocation.MinValue);

var t = tokenizer.ReadToken();
while (t.Category != TokenCategory.EndOfStream)
{
    if (SyntaxMap.ContainsKey(t.Category))
    {
        ft.SetForegroundBrush(_syntaxMap[t.Category],
             t.SourceSpan.Start.Index, t.SourceSpan.Length);
    }

    t = tokenizer.ReadToken();
}

As you can see, I ask the engine for a TokenCategorizer, initialize it with the text box’s current contents, then iterate thru the tokens, looking for ones in my SyntaxMap. If the token category is in the syntax map, we change the foreground brush for that span of formatted text (ft is a WPF FormattedText instance I created earlier in the method.

Of course, this approach isn’t very efficient – it re-colorizes the entire file on every change. It turns out that some DLR TokenCategorizer are restartable so you can cache the tokenizer state at any point and then return later with a new TokenCategorizer instance and pick up tokenizing where you left off. With this approach, you could say tokenize a line at a time, allowing you to only need to retokenize the line where the change occurred rather than the entire file. But only IronPython supports tokenizer restarting today, so I decided to take the easy way and simple re-colorize on every change.

I named the project HawkCodeBox and I’ve published the source up on GitHub. It’s fairly simple, but of course the goal wasn’t to build the be-all-end-all text editor – other people in the VS team already have that job.

Invoking Python Functions from C# (Without Dynamic)

So I’ve compiled the Pygments package into a CLR assembly and loaded an embedded Python script, so now all that remains is calling into the functions in that embedded Python script. Turns out, this is the easiest step so far.

We’ll start with get_all_lexers and get_all_styles, since they’re nearly identical. Both functions are called once on initialization, take zero arguments and return a PythonGenerator (for you C# devs, a PythonGenerator is kind of like the IEnumerable that gets created when you yield return from a function). In fact, the only difference between them is that get_all_styles returns a generator of simple strings, while get_all_lexers returns a PythonTuple of the long name, a tuple of aliases, a tuple of filename patterns and a tuple of mime types. Here’s the implementation of Languages property:

PygmentLanguage[] _lanugages;

public PygmentLanguage[] Languages
{
    get
    {
        if (_lanugages == null)
        {
            _init_thread.Join();

            var f = _scope.GetVariable<PythonFunction>("get_all_lexers");
            var r = (PythonGenerator)_engine.Operations.Invoke(f);
            var lanugages_list = new List<PygmentLanguage>();
            foreach (PythonTuple o in r)
            {
                lanugages_list.Add(new PygmentLanguage()
                    {
                        LongName = (string)o[0],
                        LookupName = (string)((PythonTuple)o[1])[0]
                    });
            }

            _lanugages = lanugages_list.ToArray();
        }

        return _lanugages;
    }
}

If you recall from my last post, I initialized the _scope on a background thread, so I first have to wait for the thread to complete. If I was using C# 4.0, I’d simply be able to run _scope.get_all_lexers, but since I’m not I have to manually reach into the _scope and retrieve the get_all_lexers function via the GetVariable method. I can’t invoke the PythonFunction directly from C#, instead I have to use the Invoke method that hangs off _engine.Operations. I cast the return value from Invoke to a PythonGenerator and iterate over it to populate the array of languages.

If you’re working with dynamic languages from C#, the ObjectOperations instance than hangs off the ScriptEngine instance is amazingly useful. Dynamic objects can participate in a powerful but somewhat complex protocol for binding a wide variety of dynamic operation types. The DynamicMetaObject class supports twelve different Bind operations. But the DynamicMetaObject binder methods are designed to be used by language implementors. The ObjectOperations class lets you invoke them fairly easily from a higher level of abstraction.

The last Python function I call from C# is generate_html. Unlike get_all_lexers, generate_html takes three parameters and can be called multiple times. The Invoke method has a params argument so it can accept any number of additional parameters, but when I tried to call it I got a NotImplemented exception. It turns out that Invoke currently throws NotImplemented if it receives more than 2 parameters. Yes, we realize that’s kinda broken and we are looking to fix it. However, it turns out there’s another way that’s also more efficient for a function like generate_html that we are likely to call more than once. Here’s my implementation of GenerateHtml in C#.

Func<object, object, object, string> _generatehtml_function;

public string GenerateHtml(string code, string lexer, string style)
{
    if (_generatehtml_function == null)
    {
        _init_thread.Join();

        var f = _scope.GetVariable<PythonFunction>("generate_html");
        _generatehtml_function = _engine.Operations.ConvertTo
                           <Func<object, object, object, string>>(f);
    }

    return _generatehtml_function(code, lexer, style);
}

Instead of calling Invoke, I convert the PythonFunction instance into a delegate using Operations.ConvertTo which I then cache and call like any other delegate from C#. Not only does Invoke fail for more than two parameters, it creates a new dynamic call site every time it’s called. Since get_all_lexers and get_all_styles are each only called once, it’s no big deal. But you typically call generate_html multiple times for a block of source code. Using ConvertTo generates a dynamic call site as part of the delegate, so that’s more efficient than creating one on every call.

The rest of the C# code is fairly pedestrian and has nothing to do with IronPython, as all access to Python code is hidden behind GenerateHtml as well as the Languages and Styles property.

So as I’ve shown in the last few posts, embedding IronPython inside a C# application – even before we get the new dynamic functionality of C# 4.0 – isn’t really all that hard. Of course, we’re always interested in ways to make it easier. If you’ve got any questions or suggestions, please feel free to leave a comment or drop me a line.

Embedding Python Scripts in C# Applications

Now that I’ve got Pygments and its dependencies packaged up in an easy-to-distribute assembly, I need to be able to call it from C#. However, if you pop open pygments.dll in Reflector, you’ll notice it’s not exactly intuitive to access. Lots of compiler generated names like pygments$12 and StringIO$64 in a type named DLRCachedCode. Clearly, this code isn’t intended to be used by anything except the IronPython runtime.

So we better create one of those IronPython runtime thingies.

As you can see in the layer diagram to the left, PygmentsCodeSource is split into two parts – a C# part and a Python part. The Python part is very simple – just importing a couple of Pygments functions into the global namespace and a simple helper function to generate syntax highlighted HTML from a given block of code in a given language and style. The code itself is pretty simple. Note the reference to the pygments assembly I described last post. Here’s the entire file:

import clr
clr.AddReference("pygments")

from pygments.lexers import get_all_lexers
from pygments.styles import get_all_styles

def generate_html(code, lexer_name, style_name):
  from pygments import highlight
  from pygments.lexers import get_lexer_by_name
  from pygments.styles import get_style_by_name
  from devhawk_formatter import DevHawkHtmlFormatter

  if not lexer_name: lexer_name = "text"
  if not style_name: style_name = "default"
  lexer = get_lexer_by_name(lexer_name)
  return highlight(code, lexer, DevHawkHtmlFormatter(style=style_name))

Instead of including this in the Pygments assembly, I embedded this file as a resource in my C# assembly. This way, I could use the standard DLR hosting APIs to create a script source and execute this code. I did have to build a concrete StreamContentProvider class to wrap the resource stream in, but otherwise, it’s pretty straight forward.

static ScriptEngine _engine;
static ScriptSource _source;

private void InitializeHosting()
{
    _engine = IronPython.Hosting.Python.CreateEngine();

    var asm = System.Reflection.Assembly.GetExecutingAssembly();
    var stream = asm.GetManifestResourceStream(
                   "DevHawk.PygmentsCodeSource.py");
    _source = _engine.CreateScriptSource(
                new BasicStreamContentProvider(stream),  
                "PygmentsCodeSource.py");
}

Once I got the engine and script source set up, all that remains is setup a script scope to execute the script source in. For this specific application, it’s probably overkill to have a scope per instance – I think the syntax highlighting process is stateless so a single scope should be easily shared across multiple PygmentsCodeSource instances. But I didn’t take any chances, I created a script scope per instance to execute the source in.

ScriptScope _scope;
Thread _init_thread;

public PygmentsCodeSource()
{
    if (_engine == null)
        InitializeHosting();

     _scope = _engine.CreateScope();

    _init_thread = new Thread(() => { _source.Execute(_scope); });
    _init_thread.Start();
}

You’ll notice that I’m executing the source in the scope on a background thread. That’s because it takes a while to execute, especially the first time. However, I don’t actually use the Python code until after the user types or copies a block of code into the UI and presses OK. In my experience, executing the Python code is typically finished by the time I get code into the box and press OK. I just need to make sure I add an _init_thread.Join guard anywhere I’m going to access the _scope to be sure the initialization is complete before I try to use it.

In the next, and last, post in this small series we’ll see how to invoke Python functions in the _scope I initialized above from C#.

Compiling Python Packages into Assemblies

In looking at my hybrid IronPython / C# Windows Live Writer plugin, we’re going to start at the bottom with the Pygments package. Typically Python packages are a physical on-disk folder that contain a collection of Python files (aka modules). And during early development of Pygments for WLWriter, that’s exactly how I used it. However, when it can time for deployment, I figured it would be much easier if I packaged up the Pygments package, my custom HTML formatter and the standard library modules that Pygments depends on into a single assembly.

IronPython ships with a script named pyc for compiling Python files into .NET assemblies. However, pyc is pretty much just a wrapper around the clr module CompileModules function. I wrote my own custom script to build the Pygments assembly from the files in a the pygments and pygments_dependencies folders.

from System import IO
from System.IO.Path import Combine

def walk(folder):
  for file in IO.Directory.GetFiles(folder):
    yield file
  for folder in IO.Directory.GetDirectories(folder):
    for file in walk(folder): yield file

folder = IO.Path.GetDirectoryName(__file__)

pygments_files = list(walk(Combine(folder, 'pygments')))
pygments_dependencies = list(walk(Combine(folder,'pygments_dependencies')))

all_files = pygments_files + pygments_dependencies
all_files.append(IO.Path.Combine(folder, 'devhawk_formatter.py'))

import clr
clr.CompileModules(Combine(folder, "..externalpygments.dll"), *all_files)

Most of this code is a custom implementation of walk. I have all the IronPython and DLR dlls including ipy.exe checked into my source tree, but I don’t have the standard library checked in. Other than that, the code is pretty straight forward – collect a bunch of files in a list and call CompileModules.

The problem with this approach is that IronPython isn’t doing any kind of dependency checking when we compile the assembly. If you pass just the contents of the Pygments package into CompileModules, it will emit an assembly but that assembly will still depend on some modules in the standard library. If those aren’t available, the Pygments assembly won’t load. I’d love to have an automatic tool to determine module dependencies, but since I didn’t have such a tool I used a brute-force, by-hand solution. I wrote a small script to exercise the Pygments assembly. If there were any missing dependencies, test_compiled_pygments would throw an exception indicating the missing module. For each missing dependency, I copied over the missing dependency, recompiled to project and tried again. Lather, rinse, repeat. Not fun, but Pygments only depended on seven standard library modules so it didn’t end up taking that long.

So having gone down this path of compiling Python files into an assembly, would I do it again? For an application with an installer like this one, yes no question. I added the Pygments assembly as a reference to my C# library and it got added to the installer automatically. That was much easier than managing all of the Pygments files and its dependencies in the installer project manually. Plus, I still would have had to manually figure out the dependencies unless I chose to include the entire standard library.

I will point out that the compiled Pygments assembly is the largest single file in my deployed solution. It clocks in at 2.25MB. That’s about twice the size of the Python files that I compiled it from. So clearly, I’m paying for the convenience of deploying a single file in space and maybe load time. ¹ I’m also paying in space for a private copy of IronPython and the DLR – the two IronPython and five DLR assemblies clock in around 3.16MB. In comparison, the actual Writer plugin assembly itself is only about 25KB! But for an installed desktop app like a WLWriter plugin, 5MB of assorted infrastructure isn’t worth worrying about compared to the hassle of ensuring a shared copy of IronPython is installed. I mean, even if you don’t know IronPython exists, you can still install and use Pygments for WLWriter. Simplifying the install process is easily worth 5MB in storage space on the user’s computer in my opinion.

Next up, we’ll look at the Python half of the PygmentsCodeSource component, which calls into this compiled Pygments library.

I haven’t done it, but it would be interesting to compare the load time for the single larger pygments assembly vs. loading and parsing the Python files individually. If I had to guess, I’m thinking the single assembly would load faster even though it’s bigger since there’s less overhead (only loading one big file vs. lots of small ones) and you skip the parsing step. But that’s pure guesswork on my part.↩

Series

Disclaimer

The information in this weblog is provided "AS IS" with no warranties, and confers no rights. This weblog does not represent the thoughts, intentions, plans or strategies of my employer. It is solely my opinion. Inappropriate comments will be deleted at the authors discretion.