This technical overview gives a few hints on FFMQ behavior and the internal optimizations used to achieve its current performance.
After spending quite some time writing (and rewriting) this JMS queuer, I have to conclude that any
correctly written queuer implementation will (should !) be I/O bound at the end of the day.
There are a few other aspects that can be optimized to achieve maximum performance and horizontal
scalability (like fine-grained synchronization), but the bottom line is that speed differences between
queuers will be directly linked to the way they handle disk and network I/O operations.
Sending a Java message instance over a network or writing it to a persistent storage requires the instance to be 'serialized'
to a binary form.
Java serialization comes to mind immediately but as versatile and automated it may be (it's reflection based), it is far from optimal performance wise.
Thus, using explicit serialization is a first step toward better performance :
java.io.DataInputStream/DataOutputStream implementations would be nice candidates
for this purpose but they are not fully optimal either because of two limitations.
FFMQ uses a custom serialization mechanism handling serialization/unserialization in a more straightforward way.
Even when using (memory-stored) non-persistent messages, there is a non-negligible latency induced by the network layer.
Under typical load, the network traffic of a queuer is essentially composed of messages and acknowledgements of these messages.
To improve performance our objective is simple : reduce the number of round-trips between client and server.
Because in most critical environments JMS is used in a transacted way, client to server round-trips need to be synchronous (think of a commit operation for example).
That means the actual queuer throughput will be directly limited by its commit latency because a client won't send/receive more messages before the previous ones
are properly commited.
A somewhat basic remote consumer implementation would be an 'active' implementation or pull architecture.
In such an implementation, the consumer is asking the server if there is something for him and, if not,
sleeps until a server notification arrives then asks the server if there is something for him, and so on.
This kind of implementation is correct but is wasting a lot of network round-trips.
A better implementation, as used by all the fastest queuers, is a 'passive' or push architecture : the server
decides to push messages to consumers automatically (even before the consumers actually try to receive them !).
It may even push several messages ahead to maximize bandwidth (this is usually called prefetching).
This approach has two immediate benefits :
Concerning producers, some buffering can help saving round-trips by batching messages together or batching messages with acknowledgement together. For example, imagine the following typical scenario : producer sends a message and commits in a loop. Instead of sending the message then the commit we may group the two operations as one, thus saving one round-trip per message.
FFMQ puts these concepts into application in its custom TCP-based protocol to reduce the number of packets sent back and forth between clients and server.
For persistent messages, hard-disk I/O operations are a big bottleneck. Not because of the amount of data
moved around (nowadays you have read/write buffers in your application plus in the operating system plus in the hardware)
but because of the low-level hard-disk flush/synchronization required by transacted sessions.
There are various possibilities to handle persistent storage.
When designing the FFMQ storage engine, I had the following constraints in mind :
To answer this, I merged two concepts : a doubly-linked list (provides order and constant removal time) and a journaling-filesystem-like store (block-based storage with an allocation table).
It was optimal for most operations I needed but had the same usual limitations as a filesystem : it has a
fixed pre-allocated size (though FFMQ 3.x now support auto-extend at runtime), it tends to waste space (stored messages are padded to use an even number of blocks) and it may suffer from fragmentation.
The good news are that those problems only have a limited impact on real world scenarios, making the whole solution pretty efficient.
But what about disk sync and safety ?