Sonntag, 22. Juni 2008

Writing your own XML parser is wrong

A lot of things in programming are left to personal taste. How you indent your code lines, how you manage your data structures. Most things have advantages and disadvantages, and everybody should come to their own decision.

Some things though violate the most basic rule of software design: 

Do not reinvent the wheel. 

Most developers agree to this rule, consider it a truism and do not try to re-invent task schedulers, file systems or window managers (except if that is exactly what they want to do, like e.g. in Linux). 

For some reason there are certain sins left that developer seem attracted to. One of the cardinal sins is writing your own XML parser. I have seen this multiple times, and all attempts had failed. In theory, writing the parser should be straight forward. In practice, all sorts of things go wrong.

It is my impression that the initial motivation is like this: People look at an XML file. They figure out that it is an easy text format with a tree structure. Then they say to themselves: This is an easy format to store my application information. Why use a parser written by someone else? I can do that! And then the disaster begins. 

What's the problem?

Using a protocol or seeing its output does not give you an impression of how complex its underlying technology really is. It's like school kids thinking that being a teacher must be an easy job, because you're just standing there for a few hours a week, and you have months of holidays every year. The difficulty of actually being a teacher is lost on them because they have never experienced the job from the other side.

People reinventing XML parsing (and ignoring the multitude of proven industry-strength parsers) always end up with the same result: Their parser is fixed to a certain, dumbed down version of XML of a certain structure. Most pseudo-XML parsers get the basic tree part about right, but only for their specific purpose. Many parsers show unpredictable behavior when parts of the XML are missing or duplicated. Character encodings are not taken account of ("We all use Windows encoding here in our files, so no need for anything else!") Then comes a typical flaw: None of the self-made XML parsers are able to verify their input except in very basic situations. Once I had to use a homebrew XML parser that would run into an endless loop when it got unexpected input. No one had tested that. Implementing support for XML schema files would have shown to the developer just how much he did not take into account while writing the parser.

Quick and dirty XML parsers violate the very idea behind XML: to offer a program-independent and exchangable way to distribute hierarchical data. They are charlatans. The data they process appears to be XML. But whenever one of the features above "it is a tree" of XML is used, they fail. When the line endings consist of different characters, they fail. When a tag spans two lines, they fail. And so on, and so on. 

The excuse here will immediately be: I'm writing the XML too, so I have control over how I use it. Great. But then you are using a custom data format, not XML. You do not care about interoperability. You just make it look like you do. Is that really what you want?

Implementing a complete XML parser is hard. It takes months if not years of a concentrated development effort to build a good one, including painstakingly testing each feature. Just reading the Wikipedia page on XML should cure people of implementing all this themselves when stable parsers are ready to download around the corner. Pick the one you like (I like libxml2 for C applications, for example, although its documentation is temperamental). Choose an event-driven or tree-driven parser. Use it. Enjoy the bliss of not having to fight with internal XML features and use the time for your application. 

This article should also have cured you of writing your own shallow XML parser. If you think it must be easy, because after all it's just text dammit, please read Robert Cameron's article on shallow-parsing XML and use his parser grammar instead.

By the way: Exactly the same goes for any homebrew implementation of complex standards that looks easy at first, but then makes you cringe at the details. A common other case is "I can implement an SMTP email send function in a day". Believe me, you will need more than a day. There are well-tested mail sender plugins and libraries out there. Use them instead.

Keine Kommentare: