A Java Ripper for the Google Programming Contest

I found myself interested in the Google Programming Contest. The supplied code for working with pre-parsed repositories (the ``ripper'') is written in C++, and I thought I'd be more comfortable working with Java code. So, I converted the code to Java. In case anyone else wants to play with it, I've made the code freely available.

I believe this code is covered by Google's license on the ripper source. I've included the same license as in the original.

Going to Java simplified the ripper, but slowed it down, at least partially because I decided to copy parsed bytes into Strings, which entails lots of extraneous memory allocation. On my home system (750MHz Pentium III running RedHat Linux 6.2), running the URL extractor (--caturl) on the first 1000 documents (--stop_after 1000) from the 57MB sample repository took 1.5 seconds of user time with the C ripper, and 3.3 seconds with the Java ripper.

I don't think this code can be used in a submitted entry, because it's based on java.nio. This note in google.public.programming-contest says that entries should run with 1.3 and seems to prohibit ``cutting-edge APIs that aren't present or functional in an older JDK.'' That appears to rule out nio for submitted entries. Which is a shame, because nio is a really good fit for the structure of the ripper.

Instructions:

Note: I fixed two bugs (long terms and, thanks to Dave Wilson, handling of EOF) and changed the posted version as of Wednesday, February 13, at 10am.

No warranty, no promises.


Paul Haahr