Home » FuzzyCat

FuzzyCat Tangles with Some Hairy HTML

10 February 2009 No Comment
This entry is part 4 of 6 in the series Building FuzzyCat

I have an emerging mental image of FuzzyCat as a street-smart cat, not well-heeled at all, but rather a lean critter, adept at travelling both the high roads and low roads of the web, yet still the kind of cat you’d like to pull out of the rain, give a plate of milk, and curl up on your lap near the fire, warm and fuzzy.

In my previous post in the FuzzyCat series, I revealed the FuzzyCat secret, the shortcut it uses to search libraries for books so that most libraries can be included. In short, FuzzyCat performs screen-scraping of library catalogues (OPAC), relying on standard OPAC design patterns. For example, nearly every OPAC offers a simple keyword search using a text input box and search button. The prototype is doing this for a small number of libraries, and will be doing more soon.

Not every OPAC is friendly territory. The OPAC of the Vermont Department of Libraries uses an array of buttons for different search types, rather than the more standard drop-down or other HTML select element. Each button has been associated with a hidden input. Each input has the same name but a different value. When the form is submitted, the value is set for a particular search type depending on the button that was clicked. The default search type is an author search rather than a keyword search. This OPAC was designed for humans, and it works well enough for them, but it is a hairy tangle for a screen scraper like FuzzyCat.

The challenge here is not to figure out how the OPAC works — that’s obvious from inspecting the HTML. I could just add a rule that detects the occurrence of this particular OPAC and processes it accordingly, but that would just be reinventing OpenURL technology. FuzzyCat is intended as a generic adapter that will work for multiple libraries. In this case, the solution is as follows:

  • Check if there is a set of hidden inputs with the same name. The name will differ from one OPAC to the next, but if multiple instances exist with the same name, FuzzyCat knows it is dealing with a set of buttons being used like a select element. (If a set does not exist, no further rules are needed.)
  • The ultimate purpose of the set of inputs does not really matter. It could indicate a search type; or it could indicate some other search subset. It could differ between OPACs. In any case, it is an option available to the human user, so it is a valid part of the search, and FuzzyCat includes it.
  • The value of each button does not really matter either. In all probability, the first button in the set is the preferred value so FuzzyCat uses it as the default. If no results are found using the default, the other values are tested in sequence.

FuzzyCat is a test of an idea, that OPACs conform to a relatively small number of design patterns, and so most OPACs can be searched with a relatively simple algorithm. In Blink, Malcolm Gladwell describes “thin slicing”, the ability to make a complex decision quickly based on a little bit of key knowledge. Yeah, that’s what FuzzyCat does. Each time FuzzyCat scales a wall like this, I gain confidence that this will work. I will not trouble you with the details of each exception; for that, you can look at the source code.

Series Navigation«The FuzzyCat Secret: How it Discovers Books in LibrariesFuzzyCat Crawls 10 Distinct OPACs»

Leave your response!

Add your comment below, or trackback from your own site. You can also subscribe to these comments via RSS.

Be nice. Keep it clean. Stay on topic. No spam.

You can use these tags:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

This is a Gravatar-enabled weblog. To get your own globally-recognized-avatar, please register at Gravatar.

Print This Post Print This Post