[OPLINTECH] Filters

Thu Nov 6 13:55:04 EST 2008

"Sandy" <hartsesa at oplin.org> writes:

>    Cons?

Know this:  things *will* get blocked that shouldn't.

>    Accuracy?

Forget accuracy.  The problem is fundamentally unsolvable, in the
first place because content filtering is fundamentally AI Complete,
secondly because new information on the internet appears *FAR* more
quickly than any filtering company could possibly even retrieve and
index it, much less evaluate it in any meaningful way, and on top of
that because the requirements are generally not very well defined.

Accuracy in content filtering is a pipe dream.  It won't be accurate.
This bears repeating: it won't be accurate.  It won't necessarily be
more accurate if it costs more, either.  (A solution that actually
*worked*, correctly, would be worth the entire US GDP, because it
would solve a lot of other problems as well, not least by eliminating
all language barriers.)  None of the available solutions are going to
be accurate.  Some will tend to err more on the side of overblocking,
or more on the side of underblocking, but realistically you're going
to see both kinds of errors.

What you want to concentrate on is the failure properties, and, in
particular, what happens when something is blocked that shouldn't be
and how much hassle that creates.  This, as far as I'm concerned, is
THE criterion you should use to judge the desirability of possible
candidate solutions.  

If there's a second-most-important criterion (other than price), it's
probably how hard it is to "fix" a discovered instance of
underblocking (i.e., quickly blacklist a site that should have been
blocked and wasn't).

An important gotcha that you want to watch out for, that has created a
lot of annoyance here, is the question of accessing a site by IP
address (rather than domain name).  On the one hand, if you allow this
universally for all sites, then there's not much point in implementing
the filtering, because the average bored third-grader will need about
a minute and a half to figure out how to get around the filter.  On
the other hand, if you disallow it universally for all sites, a lot of
stuff breaks that you would never expect, not because the sites don't
have a domain name, but because they don't assign a subdomain for
every auxilliary server.  Thinks like images, webmail attachments,
login, and so on are frequently hosted on an IP-address-only server,
even for fairly major sites (including Yahoo, Google, and MSN).  The
filtering solution we are using does everything based on domain name
and with regard to IP address gives us the choice between these two
extremes: we can block everything on the whole internet from being
accessed by IP address, or else not block anything from being accessed
by IP address.  This is, to my way of thinking, a fairly serious flaw,
and is ultimately responsible for the overwhelming majority of our
filtering-related patron complaints.

>    Has anyone been able to "get around" the filters?

Fundamentally this will always be possible in one way or another.  I
haven't seen anyone do it here (our setup is based on a transparent
http proxy), but on the other hand we weren't having a significant
problem with people displaying inappropriate content *before* we
implemented the filtering, either.  We implemented it mainly because
the director we had at the time wanted to apply for erate grants.

-- 
Nathan Eady
Galion Public Library