GPars performance test
We just added a REST interface for replicating data between servers. Parts of the service require us to GET a collection of URLs all at once and to POST an object to a collection of remote servers all at once. I thought this would be a great time to try out GPars. After all, you can’t get much easier pooling/multi-threaded support than:
GParsPool.withPool {
urlList.eachParallel {
...get/post with Jersey client...
}
}
Some of my co-workers expressed concern that this would involve creating and destroying the pool data structures (specifically, Java threads) for every url or server collection we submitted. They thought this would take too much overhead. So I decided to put together a few tests to see what GPars could get us using its simplest form of concurrency. First, an interesting tidbit from the reference guide:
While the GParsPool class relies on the jsr-166y Fork/Join framework and so offers greater functionality and better performance, the GParsExecutorsPool uses good old Java executors and so is easier to set up in a managed or restricted environment. It needs to be stated, however, that GParsPool performs typically much better than GParsExecutorsPool does. (Section 3 intro).
and, from Groovy in Action v2:
GParsPool does not create threads. Instead, it takes them from a fork/join thread pool of the jsr166y library, which is a candidate for inclusion in future Java versions. GPars uses this library extensively, especially its support for parallel arrays that are the basis for all parallel collection processing in GPars.” (Section 17.2)
If these statements were correct, then hopefully we didn’t have to worry about maintaining an existing pool and setting up a countdown latch of some form.
Tests
I threw together some quick-and-dirty tests:
- A series of mathematical operations (i.e. pure cpu)
- Open and read in the text of 120 files, each about 1-4Kb
- Get the contents of a small web page (9kb) hosted on a machine on the local network
Obviously, these were not meant to be a definitive test of all of GPars capabilities. I just wanted to see if we could use the simplest form of GPars notation or if we had to do something more complicated.
I ran each test in a loop various numbers of times (100 up to 10000), just to see if there was significant difference over time. The results I list below are for the test that ran the loop 5000 times. The core bit of code I timed was something like this:
int ms = 0
5000.times {
StopWatch timer = new StopWatch().go()
GParsExecutorsPool.withPool {
List result = data.collectParallel {
//(1..100).sum {i -> i^it}
//or
//it.text.size()
}
}
ms += timer.stop
}
def message = DebugUtils.logTimePerItem("GParsPool", numLoops, ms)
println message
where “it” was a File or URL (or a number for test #1).
I ran the test using regular sequential code (i.e. commenting out the withPool block and changing collectParallel to the normal collect), and then using GParsPool.withPool and GParsExecutorsPool.withPool. I also tried using the GPars ParallelEnhancer class and the makeConcurrent, both of which let me just use the normal collect call rather than having to write collectParallel. For some reason, these conventions slowed down the collection processing by a noticeable amount. I did not dive into why that happens, but I suspect it has to do with the additional overhead of the custom MetaClass handling.
Results
These are the results on my Dell Precision M6400 running 32-bit Fedora 14. The JVM had 1.5G of RAM.
Test: Mathematical Operation
Normal: 5000 in 14594 (343/s)
GParsPool: 5000 in 12480 (401/s)
GParsExecutorsPool: 5000 in 15685 (319/s)
I reran this test with the timer inside the withPool call to see what the overhead of creating the pool was, i.e.:
GParsExecutorsPool.withPool {
MemStopWatch timer = new MemStopWatch().go()
List result = data.collectParallel {
(1..100).sum {i -> i^it}
}
ms += timer.stop
}
with these results:
GParsPool: 5000 in 11253 (444/s)
GParsExecutorsPool: 5000 in 13308 (376/s)
So setting up the pool each time definitely has some overhead cost, but even doing that, the GParsPool is still faster than normal, single-threaded sequential execution, even on my little ol’ dual core machine.
Test: Open and Read Files
Normal: 5000 in 31726 (157/s)
GParsPool: 5000 in 24050 (207/s)
GParsExecutorsPool: 5000 in 25351 (197/s)
Very similar scale of results as with the straight mathematical operation.
Test: Get small web page over local network
All tests resulted in the same numbers – network latency was the deciding factor. Sorry for not having the exact metrics on this one.
Conclusion
So what does all this mean? I think it means that just using the simple GParsPool.withPool structure to iterate over a collection is perfectly fine for our needs. We could optimize a bit with different structures and a pre-existing pool, but it honestly won’t make a bit of difference in real performance given that network latency is the deciding factor for us. Your mileage may vary, especially if you are running an open server that has higher load requirements.









