Advantages and Problems of Executable Data
Posted on April 7, 2008
Filed Under Computer Science |
I recently read two essays that dealt with the “Data as a Program” idea. It’s a nice idea overall, but there has to be downsides, right? I sat down for a few hours and riffed on the idea, and thought of a few problems that would need to be overcome by anyone using the “data as a program” paradigm on a large scale. But first, I want to bring you up to speed on some of the high notes hit by the essays.
The Nature of Lisp
First up is “The Nature of Lisp” by Slava Akhmechet (passed along by my friend Steve). His essay is set up as an argument to use Lisp for the beginner. I found it pretty effective, as I decided to learn Lisp instead of Ruby this summer. That’s right, Ruby, you’ll just have to wait.
The essay looks at the design of Ant, Java’s build tool. Ant is an XML-driven build system for the Java system that lets you specify how your Java applications are built. In essence, Slava made the following points:
- Ant is slowly marching (one by one) towards Turing Completeness.
- The commands in Ant are being executed by something. It just happens to be Java.
- Any XML is trivially convertible to Lisp.
- Lisp has all of the same benefits as XML, but is a program and has extra benefits, including automatic data loading and metaprogramming facilities.
- Lisp can be extended to other domain specific languages, and to know how to modify data within the language, you only have to learn a single language: Lisp.
Developing the Ant language as an extension of Lisp is a nice idea, because Lisp solved all problems 154 years ago, back when cavemen wore bowties and were still allowed to ride dirigibles to work. As a result of decades of tweaking, Lisp is stable, fast, and elegant.
Storing your build scripts as Lisp is, in some respects, a very good solution to the problem. Since you are storing the data in what amounts to a Lisp data structure, tree commands can be executed on the data and transform it easily. This gives you the power of XSLT without any of the ugliness. Not only that, but since your data is already a program, the program can load itself into memory with a minimum of hassle. No more complicated parsing!
The Emacs Problem
“The Emacs Problem” by Steve Yegge looks at the problem from the perspective of log parsing. Logs can be written as complicated text strings, or they can be written as XML. Both have their advantages, but in the end, having a standardized way to represent your data as a tree saves you from the trouble of modifying your complex string into a tree in order to do complex manipulations. If your data is in an easy-to-use tree, you can use tree traversing/modifying commands, which gives you more power.
Since your log can easily be written as XML, your log can easily be written as Lisp. Since your log is Lisp, it loads itself into memory and you don’t need to spend any extra time parsing it. Since your data is a program, it can know how to modify itself.
Neat, huh?
Steve takes the idea a step further in his section “Beyond Logs”.
So Web pages have CSS, and JavaScript, and all this other hooey. It’s become so ugly that people don’t really write web pages anymore, not for production stuff. Nowadays people treat the morass of ancient, crufty Web technologies as a sort of assembly language. You write code to assemble your pages piecewise using PHP or XML/XSLT or Perl/Mason or Java/JSP or perhaps all of them in a giant ugly pipeline, which “compiles” down to an unreadable Web page format. Talk about fun!
The solution?
The whole nasty “configuration” problem becomes incredibly more convenient in the Lisp world. No more stanza files, apache-config, .properties files, XML configuration files, Makefiles — all those lame, crappy, half-language creatures that you wish were executable, or at least loaded directly into your program without specialized processing. I know, I know — everyone raves about the power of separating your code and your data. That’s because they’re using languages that simply can’t do a good job of representing data as code. But it’s what you really want, or all the creepy half-languages wouldn’t all evolve towards being Turing-complete, would they?
In fact, if you insist on code/data separation and you’re an advocate of OOP, then you’re talking out of both sides of your mouth. If your gut reaction to having log entries know how to transform or process themselves is “woah, that’s just wrong“, think again: you’re imposing a world-view on the problem that’s not consistent with your notions of data encapsulation and active objects. This world-view dates back to ancient Unix and pre-Unix days. But if you think about it, there’s no reason log entries or config files shouldn’t be executable and subclassable. It might be better.
And what about, oh, web pages? Or word-processor documents? Well, you figure it out.
Your webpage is a Lisp program. Your logfile is a Lisp program. All of your code is a Lisp program. Everything is a Lisp program. Hooray! Steve Yegge saved the internet!
Security in the Small
Everything we’ve discussed here is fine and dandy when you have your Lisp program read and execute Lisp data on your own Lisp machine. You can easily load log files, you can easily write log files, and you can easily manipulate log files. I want to look at another example, far more common than log files: web sites.
Let’s pretend the real world has two people, you and me. I ask you for your personal webpage, and you give me an executable file. It could be a webpage, but it could also be a horrible icky program that changes the contents of all of my Open Office documents to a copy of “Never Gonna Give You Up” by Rick Astley. I don’t trust you, reader.
OK, fine, realistically, we can’t let our web pages just do anything they feel like. After all, web pages come with Javascript, and Javascript can’t just wipe your hard drive the first time someone successfully cross-site scripts CNN. Nobody in their right mind would write a web browser that would just take an executable file that it’s been given and run it without even parsing it for harmful instructions.
… right?
“Of course you shouldn’t just execute a binary someone else gives you!” I hear you shouting. Who in their right mind would do that, right?
OK, so we know that web browsers should be written with a sandbox that prevents the evils of the world from doing whatever they want with your personal computer. But web browsers aren’t the only computer programs that use data. Let’s take Steve Yegge’s log file example. Let’s say that I write a program that uses log files. Since the data is code and knows how to modify itself, I can run it and have it entered into memory. That’s fine.
Let’s say that my program runs your log file. I need to keep it in the same kind of sandbox that web browsers use for very similar reasons. However, there is a new problem here. The set of coders writing “data is code”-style programs isn’t restricted to people with famous names.
Do you remember that guy from college that failed the intro course three times and still got a degree? He’ll be writing programs that execute data without checking it. Do you remember that guy in middle school who was willing to put anything in his mouth for a nickel? He’s now a highly-paid consultant for your company, and he’ll be writing programs that execute data without checking it.
OK, so this is a problem that is easily modularized and distributed. So is parsing, and many who don’t use automated parsing tools (and many who do) screw it up.
Errors in the Large
OK, so we now have a magical browser that won’t Rick-roll my whole hard drive when it requests a bad page. Now we can start sending our executable web pages to each other!
Hm, so you sent me a page that has an error on it. My browser has a few options.
- It can try to guess what you meant and display the page accordingly.
- It can refuse to show me the page at all and give me a compilation error.
- It can let me fix what is wrong with the file.
- It can remove the error and show me what’s left of the page.
Really, all of the same options that we are left with on current HTML web browsers. Javascript for the web has to warp in order to support the fact that Internet Explorer 6 supports a different flavor of Javascript than everyone else does. This could potentially do things to new executable formats –Lisp included– that you don’t want to see happen in your lifetime.
Imagine that we end up with the IE6 of Lisp browsers. Not only does this browser fail to correctly display correct Lisp, but it also has an extensive quirks mode that is very forgiving: it lets you also use C-style curly brackets for alternating parentheses groups, for instance. You know, just so the programmer can easily see where they end. This is also the web browser that everyone has by default, so it is the one that the unwashed masses are using in order to sink their teeth into Lisp.
The attack could come in any number of flavors, including “embrace and extend” and “ignore the agreed-upon standard”. There even may be types of ignorance that I didn’t think of. I’d bet on it. So we now have what amounts to a defective, entrenched Lisp interpreter. Not what the crowd is looking for.
Conclusion
Storing data as code is a powerful technique. Lisp, for example, has done this successfully since its inception, and it helps give the language some of its power.
Unfortunately, technique is imperfect when brought into the real world. Security is a huge concern, as — no offense– I don’t trust you. We can’t just treat data as Lisp programs and expect that people won’t try to take advantage of the fact that the files will be executed.
Not only that, but if this idea hits the big time for something like a web format, the people who design it have to be very careful about its release. We don’t want another Martian Headsets situation.
Popularity: 18% [?]
Comments
3 Responses to “Advantages and Problems of Executable Data”
Leave a Reply

Haha, you don’t trust the poor reader? Which I guess is understandable, but if you could create such a trusted environment you could have such a fantastic amount of flexibility and power that tasks would be greatly simplified. But in all honest look at what Flash is. It’s essentially a binary that gets loaded by a plugin to your web browser.
So we already have binary in our browser and we’re trusting the plugin not to overstep its bounds because even though it’s in the web browser it does have more permissions than the browser itself.
It puts into context projects like Squeak which are aiming to bring scheme to the browser which certainly has its possibilities!
B-Daddy,
We already have 2, actually… Flash and Java. Flash does have the added advantage that its sandbox doesn’t let you touch the local filesystem while working as a web application, only when it is exported as a native binary.
I agree that there’s a lot of power behind the idea, and I touch on some of the nice things in my post.
However, security is easy to get, and hard to get right. Flash didn’t get it right on the first go. Even Live Search thinks so, and Live Search sucks. Not to mention the only virus I’ve found on my computer in the past 6 months was from a malicious Java applet ;). Flash and Java are some of the cornerstones of the web, built by some of the smartest people. Even they couldn’t get it right. What hope do the mere mortals have?
Yeah, there’s flash, java, & shockwave too. The trend is certainly leaning towards that method it seems. And of course nothing will be perfect but people will still flock towards it with open arms, look at Win ME^2 (Vista) as a fantastic example of the sort of things that people will use despite glaring flaws and certain compatibility issues.
You are very much correct, us mere mortals certainly don’t have much of a chance. Though if we gave each application its own virtual environment and you could only change/modify an environment inside of your own it could certainly have some sweet possibilities. Though with all the vectors of attack I’m sure that wouldn’t remain secure for too long.