The most important piece of Unix philosophy for programs is "do one thing and do it well". If only more programs, and more libraries, followed that basic principle, and stopped bloating their code with runaway featuritis. As a real-life example of what to do, and what not to do, I take Proj4 and Geotools.
Proj4 is a C program/library that projects coordinates from one system to another. It doesn't do or include anything else -- i.e. have a gui or receive emails. The one thing that it does though, it does to exhaustion. It handles a laundry list of projections and conversions: indeed, the complete laundry list as far as I can tell (though I'm not expert). And it's easy to use from the command line or as an api.
Geotools, in contrast, is a bloated monster of a Java framework. It handles all things Geo. It can read shapefiles, csv files, gml files, mif and vpf (god knows what those are). It graphs, it visualizes, it projects and converts. And I venture to say it does no one thing particularly well.
I've only used it for projections, so perhaps I'm biased. You might point out that obviously Proj4 is going to be better at projections, cause that's all it does anyway, whereas Geotools has to a bunch of things. I'd argue that the Unix "one thing do it well" is a superior, and more composable, design than the "do it all at once" alternative. I'd rather have a bunch of cohesive and exhaustive libraries that I can compose together to get the functionality I need, than a massive behemoth that I must take or leave.
In all fairness, this post is motivated more by my frustration with Geotools weird plugin architecture than my concerns for their code-base. I spent a considerable amount of time trying to figure out how their crazy service-registry/factory/class-loader contraption worked, and it honestly left me exhausted. They have attempted to make their massive framework modular by (wait for it) dividing their code into modules... but the way they tie it all together is absolutely horrendous. They have a home-baked auto-wiring solution, like what you'd get in Spring (but would never use cause they advise against it and it's bad anyway). If you drop a jar on the classpath that contains an implementation class, their service-registry/factory/class-loader contraption attempts to auto-magically suck it up and make it available. Obviously this only works under the best of circumstances, and requires a deep understanding of the contraptions inner-workings of it is to be debugged or modified at all. There is a reason the Spring framework (and presumably other similiar frameworks) avoid auto-wiring and have configuration files. If for no other reason than a configuration file is more explicit: at least then you know which concrete classes are implementating your interfaces. In this case, I had to step through the initialization process with the debugger to find out what concrete classes I was getting. Ick!
Of course, I'm sure they had some mangled reason to do what they did. It's just frustrating coming in from the outside and not seeing the benefits of this madness, whatever they are--or if they are.
Friday, August 8, 2008
C version
The C version of my Corpscon clone runs in about 3.9 seconds (for a 13mb shape file), versus the Java version which takes between 6 and 10 seconds. I'm surprised that I get such a performance boost; I'd thought that Java performance would be closer to the C performance. Of course, a 50% (or even 250%) increase in runtime may not seem very large--and it's indisputable that both implementations are fast enough, but I've read so many benchmarks that put Java right up there with C and C++. I guess that's only for certain things, and not for others. And that's only when you turn on the -server flag, and let the JVM eat your memory.
What's also interesting is that the lowest Java runtime, the 6 seconds, is based on an NIO implementation of my program which was far more complicated than the corresponding C implementation. Whereas C does all it's buffering/file-magic behind the scenes, and lets you fetch one char at a time penalty free, Java demands that you explicitly buffer everything, and that you call upon the complex NIO library to get C-level performance. And you don't even get it! NIO, you lied to me.
I'm certainly not an IO expert: I don't know the secrets of cache and optimal buffering. But in C I don't have to be... it appears to be done for me by the compiler. I'm sure I could extract even more performance if I knew what flags to set on GCC. I need to find a book on how caching works, and how I can utilize it to improve performance.
Another thing that might have influenced the C vs Java benchmark is that in C, I can use floats that are supported natively by the operating system, because C (as far as I know) does not guarantee consistent float behavior, unlike Java which does. Because C lets the hardware take care of the float stuff, it gets the best possible performance, whereas Java has to do software-level work to ensure consistency, and therefore looses performance. I need to get a profiler or something, to better understand the bottlenecks in my programs.
What's also interesting is that the lowest Java runtime, the 6 seconds, is based on an NIO implementation of my program which was far more complicated than the corresponding C implementation. Whereas C does all it's buffering/file-magic behind the scenes, and lets you fetch one char at a time penalty free, Java demands that you explicitly buffer everything, and that you call upon the complex NIO library to get C-level performance. And you don't even get it! NIO, you lied to me.
I'm certainly not an IO expert: I don't know the secrets of cache and optimal buffering. But in C I don't have to be... it appears to be done for me by the compiler. I'm sure I could extract even more performance if I knew what flags to set on GCC. I need to find a book on how caching works, and how I can utilize it to improve performance.
Another thing that might have influenced the C vs Java benchmark is that in C, I can use floats that are supported natively by the operating system, because C (as far as I know) does not guarantee consistent float behavior, unlike Java which does. Because C lets the hardware take care of the float stuff, it gets the best possible performance, whereas Java has to do software-level work to ensure consistency, and therefore looses performance. I need to get a profiler or something, to better understand the bottlenecks in my programs.
Subscribe to:
Posts (Atom)
