In short, WebSPHINX is a Java class library and interactive development environment for web crawlers.
Homepage: http://www-2.cs.cmu.edu/~rcm/websphinx/