(Yes it's been a while). A typical problem. I have a logging mechanism that's used by a large application. The logger itself is a single thread that wakes up every n seconds or m unwritten log messages, reads them from a queue, batches them together and sends them off to where-ever they end up in the end. Since this is a common queue, to a multi-threaded application, I wanted to avoid having a lock statement in a highly visible location. My solution seemed quite nifty at first. Define a delegate for the method that actually writes to the queue, and have the externally visible method invoke the delegate asynchronously, and not wait for the result. The idea is then that instead of having the invoking thread wait for the lock to be acquired, in what is likely to be a point of high contention, offload the task onto another thread and go on our merry way.

Everything seemed wonderful, except I couldn't for the life of me find a good explanation of how BeginInvoke actually worked. I saw things like ThreadPool, and Messages, but nothing concrete. So I decided to write a little test harness (code at the end of the article). This wonderful little harness did the following: a number of threads was created and launched almost simultaneously (in a loop). Each one of these threads incremented an internal counter n times, and every m times, it would also increment a shared counter. The whole thing was timed at the individual thread level, and the overall level. I ran the test with 20 threads, and a large enough number of iterations for my machine.

First, we had the standard lock approach. The program ran for about 40 seconds and finished. Next, I ran the async version. After it went past the 5 minute mark, I killed it without the app finishing. Uh-oh. Alright, I thought, let's try a more realistic scenario - let's put the thread to sleep when we do the shared write for 1ms. Same 20 threads, but fewer iterations this time, to account for the fact that we're sleeping part of the cycle. This time around, I got better results. The direct version ran for 20ms, while the async version ran for 22ms. Still not what I wanted to see. "Well, maybe it works better with higher contention" I thought, and upped the number of threads to 30. Re-running, I saw that the direct stay at around 20ms, while the async jumped up to 30. There goes that idea. So it turns out that while theoretically the concept is a nice one, in practice it ends up being a bigger performance issue than it's brute-force counterpart. In fact, the only time I saw a performance boost, is when the operation within the lock is painful. When I put the thread to sleep within the lock, the async version went to 26 seconds, and the direct version shot up to 40. Understandable, really.

There are still a few variables out there. It seems to me that the main reason the async was significantly slower was my test machine's ability to handle a large number of threads. With the first test, I saw the CPU shoot up to 100%. What's more interesting is that my dual-core system showed the second CPU idling - not something I expected.