Sonntag, 22. Juni 2008

Writing your own XML parser is wrong

A lot of things in programming are left to personal taste. How you indent your code lines, how you manage your data structures. Most things have advantages and disadvantages, and everybody should come to their own decision.

Some things though violate the most basic rule of software design: 

Do not reinvent the wheel. 

Most developers agree to this rule, consider it a truism and do not try to re-invent task schedulers, file systems or window managers (except if that is exactly what they want to do, like e.g. in Linux). 

For some reason there are certain sins left that developer seem attracted to. One of the cardinal sins is writing your own XML parser. I have seen this multiple times, and all attempts had failed. In theory, writing the parser should be straight forward. In practice, all sorts of things go wrong.

It is my impression that the initial motivation is like this: People look at an XML file. They figure out that it is an easy text format with a tree structure. Then they say to themselves: This is an easy format to store my application information. Why use a parser written by someone else? I can do that! And then the disaster begins. 

What's the problem?

Using a protocol or seeing its output does not give you an impression of how complex its underlying technology really is. It's like school kids thinking that being a teacher must be an easy job, because you're just standing there for a few hours a week, and you have months of holidays every year. The difficulty of actually being a teacher is lost on them because they have never experienced the job from the other side.

People reinventing XML parsing (and ignoring the multitude of proven industry-strength parsers) always end up with the same result: Their parser is fixed to a certain, dumbed down version of XML of a certain structure. Most pseudo-XML parsers get the basic tree part about right, but only for their specific purpose. Many parsers show unpredictable behavior when parts of the XML are missing or duplicated. Character encodings are not taken account of ("We all use Windows encoding here in our files, so no need for anything else!") Then comes a typical flaw: None of the self-made XML parsers are able to verify their input except in very basic situations. Once I had to use a homebrew XML parser that would run into an endless loop when it got unexpected input. No one had tested that. Implementing support for XML schema files would have shown to the developer just how much he did not take into account while writing the parser.

Quick and dirty XML parsers violate the very idea behind XML: to offer a program-independent and exchangable way to distribute hierarchical data. They are charlatans. The data they process appears to be XML. But whenever one of the features above "it is a tree" of XML is used, they fail. When the line endings consist of different characters, they fail. When a tag spans two lines, they fail. And so on, and so on. 

The excuse here will immediately be: I'm writing the XML too, so I have control over how I use it. Great. But then you are using a custom data format, not XML. You do not care about interoperability. You just make it look like you do. Is that really what you want?

Implementing a complete XML parser is hard. It takes months if not years of a concentrated development effort to build a good one, including painstakingly testing each feature. Just reading the Wikipedia page on XML should cure people of implementing all this themselves when stable parsers are ready to download around the corner. Pick the one you like (I like libxml2 for C applications, for example, although its documentation is temperamental). Choose an event-driven or tree-driven parser. Use it. Enjoy the bliss of not having to fight with internal XML features and use the time for your application. 

This article should also have cured you of writing your own shallow XML parser. If you think it must be easy, because after all it's just text dammit, please read Robert Cameron's article on shallow-parsing XML and use his parser grammar instead.

By the way: Exactly the same goes for any homebrew implementation of complex standards that looks easy at first, but then makes you cringe at the details. A common other case is "I can implement an SMTP email send function in a day". Believe me, you will need more than a day. There are well-tested mail sender plugins and libraries out there. Use them instead.

Effective developers know how to search the net

Back in the old days, handbooks were the most important source of information for software developers. With your operating system you would get a massive amount of documentation that was meant to be comprehensive for every imaginable application to be. 

Things have changed slowly but surely. Today, an application developer not able to find online information about problems or errors encountered is in the way of productivity. He or she will be looking desperately for the explanation of an obscure API call or error message in old manuals when the solution could be found in seconds when using search engines. This is because error messages may be obscure to the single developer, but the net is so huge that somebody probably has stumbled upon it (and possibly found a good solution). Also, the paradoxon of manuals applies: The most difficult problems never appear in them, because they were simply not foreseeable to the API or system creator to be documented.

With the ever evolving APIs of today's libraries and operating systems, it is simply not economical to depend on printed documentation when things go bad. There are some examples of very good official documentation. The Java documentation is usually excellent, for example. Many parts of the MSDN are clear and precise, and tackle real world problems. But especially in open source land, documentation is usually done "in code" (cough) or very scarce and outdated. The quality of Linux man pages varies wildly. 

When an error message "libXYZ: error 59, dropped packets" pops up, it is the best way to simply dump the error message verbatim into a search engine. Usually, a lot of hits will show up. Many may just be desperate postings in online forums, but one out of ten may have received the appropriate answer. Problem solved. Good manuals may assist, yes (although don't forget the saying: you can't grep a dead tree). Reading bad manuals as well as communicating with support hotlines would be much more time-consuming and should just be a way of last resort.

I think it would be useful to not only ask for development qualifications in job interviews for software developers, but also let them find out the problem for a particularly weird error message. Just ask what their approach would be. If you have developers who can't solve problems with search engines, your competitor might have them, and thus have a much better productivity.

Logging: Structuring debug output

When we begin learning to program, we all write debugging output like 

  printf("found string\n"); 

or, in the parlance of our times,

  System.out.println("found string");

For a certain time, this sort of debugging output is enough. We remember where we put that output statement.

Things get harder when the code grows beyond a few hundred lines, or when more than one developer is working on it. Someone else just does not understand what train of thought the message comes from. Especially hard to understand are low-level control flow debugging outputs like AAAABACCDDD where each letter means that one specific branch of an if statement or other control flow statement has been taken (don't laugh, I've seen this sort of output in a release version of Acrobat Reader on Linux). The more your application grows, the more you are in need of a fundamental mechanism: Logging.

Logging should be the first thing you lay out before writing the first debug message of a code. A good log message consists of at least 5 elements per line of output:
  • WHEN was the message caused?
  • WHAT happened?
  • WHO caused the message (which object/method/function)?
  • WHICH THREAD ran the code?
  • HOW SEVERE was the message in the context of the task to accomplish?
(This only sounds like the rules when calling 911.)

Our little example from above may now look like this:

21 Jun 08 11:24:03.434 INFO [search_thread] TextSearch::searchTopic(): Found topic 'Jungle'

Let me explain why each element of this line is important.

The WHEN

As irrelevant as it may seem at first, the point of time a log message is caused is extremely important. Especially in multithreaded or distributed applications with the danger of race conditions (two or more threads or processes concurrently racing to use a data field / ressource) you have to know the exact point of time a thread tried to access something. Depending on your application, millisecond precision may be important. You can also detect time wasting in your code that way ("Gee, that call really takes 100ms? Why?")

Time runneth fast. Always time the log message at the very beginning of your log method. If you write all sort of code before (like string formatting, output preparation etc.), a context switch may already have cost you an unknown amount of time, and the "real" point of time of the log message is lost.

A very hard challenge is synchronizing log messages between different computers. Although NTP and time synching has been around for quite some time, synching of system clocks is still not precise to the millisecond. If you are on a fast local network, you can use a broadcast message over a network to cause a message at one point of time. Even if you have diverging clocks, you can then find out the leap between all clocks because the message was received at almost the same millisecond. 

The WHAT

All log mechanisms allow one or another sort of output formatting where the language does not offer on-the-fly string building like in Java. In C/C++ the usual way is offering printf-style formatting with varargs. A call may then look similar to a printf call:

log("Now looking for text %s",searchstring);

Good log messages are like good algorithms: They are clear to read and describe what is being done. Obscure log messages, like obscure algorithms, may look ingenious, but are usually a pain to read and to maintain. Be clear when formulating log messages. Neither be too short ("Value: "+acceleration) nor too chatty ("The algorithm is just now processing the value acceleration which is of value "+acceleration). One rule of thumb is: Values alone are confusing. Always attach the name of the variable to it, and a unit if possible ("acceleration=3.5 m/s")

The WHO

You definitively want to know which function or method is sending you a log message. In C/C++, you can automatically insert the function/method name by using compile-time directives like __PRETTY_FUNCTION__ (gcc) or __FUNCTION__ (Visual Studio compiler). These directives contain the current function (try printf(__PRETTY_FUNCTION__);) With another preprocessor wrapper like

#if defined(WIN32)
#define __PRETTY_FUNCTION__ __FUNCTION__
#endif
#define LL(l) Logging::l,__PRETTY_FUNCTION__

you can avoid to have to write the long directive in each log message. LL(loglevel) will evaluate to 'Logging::loglevel,__PRETTY_FUNCTION__' and thus fill two of the arguments of a call to the log function:

log(LL(SEVERE),'test');

would satisfy a log(loglevel,method,string) function with three arguments , and the current method or function name would be inserted automatically. (The Logging class has to have constant values for the log levels as static variables to make the example work of course.)

Java offers the java.util.Logging class where you have to enter the method name yourself. And there's LOG4J of course.

If you have to keep track of which object executes what method, use either its name (with uniquely named objects) or its unique pointer if you can get hold of it (this is easy in C/C++ but may be difficult in other languages).

The THREAD

Ten years ago most software was written as one long singular control flow. If the code did something, it was done in a strictly linear way. Only sometimes a worker thread was started so that the GUI could still be handled while calculating something.

Although there are some applications that still work in this peaceful way, most modern applications have an abundance of threads. In such an environment it is of key importance to know which thread caused a log message, especially when multiple threads of the same kind are running. If possible, use named threads and make sure that all thread names are unique. In Java and C# naming threads is a standard feature, while in pthreads, which many portable C/C++ applications use nowadays, you have to use TLS (thread local storage) to attach a name to a thread.

The SEVERITY

Log messages vary a lot in terms of importance. All serious log mechanisms take this into account and require an importance measure be attached to each message: the log level. Names of log levels vary, but the general layout is usually like this:

ERROR: a message that MUST be shown to the user, and which is fatal
WARNING: a message that something has gone wrong which the user should know. Non-fatal.
INFO: a message of an informative character that the user should know about.
DEBUG: messages only important to the developer. Filtered in production code.
TRACE: verbose debugging output that is usually annoying even to the developer except in special cases. Filtered in production code.

Log mechanisms vary in their number of loglevels. Some add a FATAL level above ERROR, others use more fine-grained log levels at the DEBUG or TRACE level. This is a designer's choice.

Most GUI applications also include logging, but they strictly hide it from the end-user. Then of course appropriate GUI action has to be taken to inform the user.

Design details of a logging mechanism

One beginner's mistake when building a log function is using methods in it that issue log messages themselves. Depending on the use of mutexes (none/recursive vs. non-recursive) this either causes a stack explosion by infinite recursion or a deadlock (can't lock the log mutex ever because the log method is locking it already) Be very careful when writing the log output/transmission function when you have no way of determining whether a logging caused it itself. A simple (yet not thread-safe) way to avoid recursion is using a static variable logging_in_progress that is set to true whenever a logging is just going on. If the log method is called while logging_in_progress is true, it immediately returns. In a multithreaded application, this behavior is of course not enough. We'll come to that in a minute.

The log queue

Multiple threads may cause a lot of messages at almost the same time. As a log mechanism usually has to use a mutex to protect the integrity of its output, it can become a needle eye in your code. One log message may be output on a terminal, delivered over network connections and entered into a database. Much too long for one thread to lock a mutex.

The technique to reduce the cost of sending log messages is using a special log thread for the actual output/transmission of log messages and a stub log() method that just appends log messages to a log queue. The log thread basically does two things in an endless loop: 
  • Look for log messages
  • If log messages in the queue, output/transmit them
and so on. Both methods, the log sender and the log receiver, need to be extremely carefully designed. The time spent locking the log queue mutex must be kept as low as possible at any cost. It should go without saying that the timing of the log message must be done in the log sender method, not in the log thread.

Extra care must be taken that the log thread does not starve because other threads take over all the CPU time. In massively multithreaded applications with hundreds of log messages every second, care has to be taken that the log queue does not grow infinitely and eats up all of the address space available to the application. A way to avoid trouble is dropping older log messages (possibly retaining ERROR and WARNING messages) in an overflow situation.

Instead of checking a logging_in_progress static variable, you have to make sure the log thread never queues in log messages in its own queue. A good way to ensure this is checking for the logging thread's unique name when using named threads.

The log queue is so important in multithreaded applications that, in my opinion, it should be declared a design pattern. Conversely, the non-queued logging should be declared an anti-pattern for multithreaded applications because it is a real time waster in all but the most basic situations (everything more complicated than just issuing a printf() statement).

Separating content from layout

For web pages, content management systems separate the content to be displayed and the actual layout. It is a good rule to do the same to log messages: Keep them in a structured way as long as possible, i.e. transform them to a flat string as late as you can. If your log method just builds a flat string, structured entry of log messages e.g. in a database becomes a real pain. If you use a log queue, make sure it uses structured data fields so you can easily process them in the log thread.

Conclusion

Logging is a key element in successful software debugging. It helps all developers and should be designed and used carefully. A good log mechanism implements log levels, thread debugging (where needed) and log message formatting. Log messages should be kept short and informative. Good logging output gives any developer a good understanding of what the code is trying to achieve and how it has succeeded achieving its goals.

Hello

Hi to everybody! 

This will be my place to publish articles about topics I find interesting in software development. To me, software development is both a craft and an art form. We still do not produce software, but we write software. Every piece of code we create shows our thinking and our personal style. 

In its best moments, the writing process of software flows uninhibited, and out comes beautifully styled code that is made to last. Experience shows that these flow moments are not so common. The reality of business makes compromises inevitable. How can we still maintain clarity and quality given the demands of today's fast-paced development schedules and ship-it-by-yesterday demands?

It is my wish that this blog may show ways how to stay on top of the development process. Many topics are of course controversional. We software developers love to argue what is the best editor/compiler/library/indenting style. Anyways, constructive criticism and comments are very welcome!