Sunday, February 3, 2008

Worse is Better in the form of Autosave

There's recently been some talk about how GLib is poorly designed software because g_malloc() abort()s when the underlying malloc implementation returns NULL (suggesting an OOM condition) and therefor should never be used to write real-world applications because your calling code doesn't have a chance to do proper error checking (altho it was brought up that you can actually use g_try_malloc() and/or plug in your own malloc implementation underneath g_malloc() which could trivially notify the application that an OOM condition was met before returning to g_malloc()).

While at first, this argument seems correct and you begin to think "oh my god, the sky is falling", it's important to actually stop and think about the issue a bit first.

GLib was originally a utility library written as part of the Gtk+ widget toolkit in order to make their lives easier. When designing a widget toolkit (Gtk+ in this case) for real-world programmers to use, simplicity is key. If your widget toolkit is hard to use because it offers a way to notify the application developer of every conceivable error condition, then no one would use it because it would be "too hard".

What good is a library that is too hard to use that nobody uses it? It's no good, that's what.

The problem is that the idealists complaining about glib's g_malloc() have only considered being able to check g_strdup()'s return for NULL or maybe as far as gtk_foo_new() and have not considered that the render pipeline may require allocating memory as it renders the widget which might not have an ideal way to pass the OOM error back up to the application because the top of the call stack may in fact be gtk_main() and not some callback function implemented in the application's code itself.

The idealists argue that without the ability to check every malloc() for a NULL return and chain it back up to a high enough level in the call stack to handle properly means that users could lose their unsaved document if the application is, for example, a word processor. They argue that a properly designed application will always properly handle all error conditions and pass errors back up the stack where an emergency buffer can be used to show the user an "Out of memory" error dialog and/or save all of the user's unsaved work.

The problem with this school of thought is that the simple act of rendering your error dialog may require memory that you do not have available (if we are truly OOM as opposed to simply being unable to allocate that ~4GB buffer that the application tried to allocate due to poor arithmetic).

There is one thing that they are correct on, however, and that is that losing a user's document is a Bad Thing(tm).

What they haven't considered, however, is that it's possible to prevent data loss without the need to implement their really complex OOM handling code:

I dub thee, auto-save.

Yes, I will assert that auto-save is our savior and that it is, in fact, the only feasible solution in what I affectionately refer to as: The Real World.

In the Real World, applications are built on top of other people's code that you do not control and do not have time nor luxury to audit, you simply have to trust that they work as advertised.

Let's imagine, for a minute, that you write a word processor application using some toolkit (other than Gtk+, obviously) that upholds your idealist design principles in that it is designed in such a way as to be able to notify your word processor application about OOM conditions that it experienced way down in the deep dark places of the rendering pipeline. And let's, for argument's sake, assume that your application is flawlessly written - because you are an idealist and thus your code is perfectly implemented in all aspects, obviously.

Now imagine that a user is using your word processor application and the version of the widget toolkit (s)he's using has a bug in some error handling case that is unexpectedly hit and the error doesn't properly get passed up the call stack, but instead crashes the application because the toolkit's bug corrupts some memory.

Oops.

All the hard work you did, making sure that every possible error condition in your code properly handles the error, never gets hit because the application crashed in a library you trusted to be implemented flawlessly, so the user loses his/her document that they were writing.

Your effort was all for naught in this particular case.

What's the solution? Auto-save.

What have we learned from this? Auto-save is needed even if your toolkit is sufficiently designed to pass all errors (including OOM errors) back up the stack.

Once you've implemented auto-save, though, what are all those custom OOM-checks for each and every malloc() call in your application really worth?

Zilch.

So why not use something like g_malloc() at this point?

Once your system is OOM, the only reasonable thing you can do is save any state you don't already have saved and then abort the application (not necessarily using the abort() call). But if you already have all your important state pre-saved, then all you have left to do is shut down the application (because you don't have enough memory resources to continue running).

Where does Worse is Better come in, you ask?

Well, arguably, the auto-save approach isn't as ideal as implementing proper fallback code for every possible error condition.

Auto-save is, however, Better because it works in the Real World and is Good Enough in that it achieves the goal of preventing the loss of your user's document(s) and because it is far easier to implement with a lot fewer points of failure.

Fewer points of failure means that it is a lot more likely to work properly. By using the auto-save approach, you can focus on making that code robust against every conceivable error condition with far less developer time and resources meaning you are able to keep both cost and data loss down which makes everyone happy.

5 comments:

Simon Howard said...

Agreed. Nobody really wants to check every return from malloc(), and I dare say there are a lot of C programmers that don't even bother.

Besides, even if the ability to handle out-of-memory errors is something really vital to your application, I question the judgment of anyone who will dismiss the entire library based solely on the behaviour of g_malloc. It can be worked around.

Yevgen Muntyan said...

Bad thing is, it's hard (impossible, that is) to estimate how much memory you need for ui, or how much memory you could reserve for non-ui stuff. Say, you want to allocate gazillion bytes to keep a BigThing. You do g_try_malloc(), it succeeds, and then the application aborts when user opens a menu. And g_slice_alloc will abort on OOM unconditionally, and no emergency pools will help there.

But I can't imagine g_list_prepend() which could fail :)

Anonymous said...

I agree with most of your comments and with the way g_malloc() and friends are implemented in glib. However, I would like to make two comments:

First, implementing auto-save is considerably more expensive than implementing emergency saves. If the application that you are developing has to deal with large files or with very complex data structures, then saving the data on a regular basis can consume a lot of processing power and/or memory. In many cases, it is easier to implement emergency saves: save whatever can be saved just before crashing (like in a signal handler) but otherwise do not disturb the user with heavy background tasks. Alas, glib does not provide an easy way to catch all out-of-memory errors and then trigger the emergency saves. A SIGABRT handler is not an ideal solution because it could be triggered by other things than the out-of-memory condition. It would be nice to be able to register some kind of out-of-memory callback function without having to implement a whole set of wrapper functions with GMemVTable.

My second comment is about the way you describe the problem and the solution. Do you realize that it could also be read as a recommendation to switch to managed languages with proper exception handling? Errors happening deep in the code (including third-party libraries) can be caught with a try/catch block or similar constructs. So... are you suggesting that instead of using glib, we should all move to Java or C#? ;-)

Jeffrey Stedfast said...

Raphael:

As per point #1, yes, auto-save can be expensive (in terms of disk i/o), however, if your number 1 priority is never losing the user's data, then you have no other real choice: all the error checking in the world will not save your user's data if the power goes out, for example.

I'm not saying that error checking shouldn't also be done, but it is far from a reliable solution. Auto-save type solutions provide a much safer means of protecting your user's data... how often you invoke auto-save is up to you.

Just because you implement auto-save doesn't mean you can't also implement any emergency save in a signal handler. In fact, this is another excellent idea, however it may be too late once a signal is raised... particularly if that signal is a SIGSEGV.

My point was that it is 1) impractical to manually error check every malloc() call and do proper error handling, passing the error up the call stack in a large complex gui application because it is very error-prone and 2) it will never be able to handle all possible error conditions e.g. power outages, bugs in your own code, bugs in a library you built upon, or, heaven forbid, a bug in the hardware itself.

if (!(ptr = malloc (size))) {
/* handle error */
}

will never be an end-all fail-proof solution to never losing a user's data (even if the error condition is EMEM), no matter how much the idealists claim that it is because software (and the hardware it runs on) is made by humans, which means that it is, by definition, imperfect.

As per point #2... I'm on Novell's Mono team. Does that answer your question? :)

(Hint: Yes, I believe that there are better languages than C in which to implement large complex applications.)

Anonymous said...

I agree apps shouldn't have to handle OOM. Well said.

Now, there's this cute little feature in OSX where the desktop (Aqua) will basically tell you when you are about to run OOM, and I've seen several cases where apps would not crash, but just return an error. I don't know how they do it, nor do I care. But it's insanely useful and provides a true feeling of stability and elegance. Whatever this is, whatever the API, it can be done and has been done. Denying that is insanity.

Linux, or whatever *x desktop, just doesn't have that.

Code Snippet Licensing

All code posted to this blog is licensed under the MIT/X11 license unless otherwise stated in the post itself.