Evaluating Web Filters: A Practical Approach
Eyas S. Al-HAJERY
<alhajery@kacst.edu.sa>
Badr Al BADR <badr@kacst.edu.sa>
King Abdulaziz City for Science and Technology
Saudi Arabia
Abstract
The Internet is becoming a significant source of all
types of information to all people. This has made Internet
censorship a major and controversial issue. While many
people believe that the use of content filtering products is
against free speech, there are others, especially parents
and librarians, who are concerned about the negative effects
of Internet pornography on minors. Many libraries and
schools are mandating filtered access to the Internet. In
this paper, we evaluate the performance of six major
Internet content filtering products: SmartFilter, WebSense,
CyberPatrol, SurfWatch, N2H2, and I-Gear. The performance
evaluation is based on a log of several tens of thousands of
uniform resource locators (URLs), which has been collected
from an Internet service provider (ISP). This ISP provides
unfiltered Internet access to several thousands of
customers. Several performance measures have been
investigated to compare the performance of these products.
These measures include the blocking rate of pornographic
materials and the false alarm rate (the blocking rate of
nonpornographic material). Furthermore, the method we
propose to evaluate the filters has the added advantage of
being practical.
One of the biggest complaints people have about the
Internet concerns the proliferation of pornography. To guard
minors and conservative communities from pornography, many
products appeared in the market with the goal of filtering
Internet access and hence restricting access to pornographic
sites. Another important application for Internet filtering
is resource management, where an organization wishes to
ensure that its Internet connection is properly used for
legitimate business activities during office hours, so
nonbusiness sites are blocked.
The definition of the problem is as follows: Assuming
that an organization has its own definition of what is
suitable and what is not, the organization must find the
filter that best satisfies its needs in allowing access to
the maximum number of sites it deems suitable, while at the
same time blocking the minimum number of sites it deems
unsuitable.
This work attempts to measure the effectiveness of
several commercial filters in blocking access to
pornography. Filter effectiveness is measured as minimizing
the blocking of suitable sites and maximizing the blocking
of unsuitable sites. It is important to note that this work
focuses on filtering Web-based traffic (which is in fact the
vast majority of Internet traffic).
It is believed that the results of this work would
benefit organizations (e.g., schools, public libraries)
wanting to deploy pornography filtering software. The
contribution of this work is not only in assessing filter
effectiveness but also in outlining a practical procedure by
which to test filtering software. The procedure can be
applied to new filters, for different filtering objectives,
or with different suitability standards in mind.
Usually, to assess a detection task the decisions of the
detector are compared to the actual identities of the
objects to detect. In our case, the objects to detect are
URLs of pornographic Web objects. This means that a set of
Web pages is needed, with each page being prelabeled as
pornographic or not. Hence a major contribution of this work
was in evaluating the performance of the filters without
having to manually label all URLs in the test set as
pornographic or not, by using voting among filters and
taking unanimous decisions as absolute labels. The savings
in the labeling step was on the order of 67%, as we had to
manually label only one third of the URLs.
The work relies on the principles of statistical decision
theory in evaluating the filters. The major components of
this work are the following:
- Modeling user requests by finding a list set of URLs
that mimic the requests of the target user community. This
constitutes the test set on which each filter would be
tested.
- Automatically running all URLs through each filter and
checking whether or not the filter blocks the URL.
- Labeling each URL in the test set as pornographic or
not, based on the unanimous decisions of all filters, and
when the filters disagree then by manual checking.
- Statistically analyzing the results for each filter
after counting the number of misdetections (pornographic
sites that were not blocked) and the number of false
alarms (nonpornographic sites that were blocked) and
assessing the operation
|
|
Filtering software blocks content in two primary ways:
blocking by URL and blocking by the content of retrieved
pages.
- URL-based blocking: Using this method, the
filtering software employs a "black list" of unwanted
URLs. The list is normally classified into different
categories (e.g., sex, drugs, cults, gambling). A user is
given the ability to choose the categories that he or she
wants to block. Also, most of the address-based blocking
software provides the capability to augment the black list
with additional URLs the user wishes to block.
Furthermore, users can exempt URLs from the black list.
The list should be updated periodically to include new
URLs and remove inactive URLs.
As an alternative approach to the black list, some
filtering software uses a "white list." The user is
permitted to access only URLs that are included in the
white list. This is intended mainly for school students or
closed communities.
- Content-based blocking: The filtering software
analyzes the content of the retrieved pages to check for
unwanted patterns. The simple method of this type of
blocking is to block by words. The filter will block
retrieved content if it encounters a word that matches its
list of banned words. More sophisticated software will
employ some artificial intelligence algorithms to analyze
the retrieved content.
Content-based blocking is widely criticized for its
ineffectiveness. A block on the word "breast" might block
pages about breast cancer. Address-based blocking is
preferred since it is less prone to errors. Moreover, sites
in different languages are hard to detect. Newer image-based
content filters are emerging; they have yet to gain
widespread acceptance. However, address-based blocking is
more expensive because of the overhead incurred in the
frequent updating of the black list.
Filtering software can also be classified based on its
location within the network into two classes: client
based and server based. In client-based filtering
software, the filter resides at the client side. It
interacts with browsers installed on the client machine to
employ filtering functionality while a person surfs the
Internet. Because it is installed on the client machine, it
is considered to be voluntary. The client can choose to
uninstall it.
In server-based filters, on the other hand, the filter is
installed on a server within a network. It is managed by the
network administrator; therefore, filtering can be forced
upon all network clients. These filters are widely used in
corporations and large organizations. The filter can be a
plug-in to a known Web proxy (e.g., Netscape, Microsoft, or
Apache) or a stand-alone proxy. |
|
In this paper, the performance of six filtering products
was evaluated. All selected filters use the black list
technique. Furthermore, all filters except N2H2 are server
based. At the time of the experiment, N2H2 Inc. did not
distribute its software but provided filtering solutions for
ISPs.
Table 1 shows the filtering products, vendor names, and
product versions.
Table 1. Filtering products evaluated
| Filter Name |
Vendor Name |
Version |
| SmartFilter |
SecureComputing Co. |
SmarfFilter for Netscape Proxy |
| SurfWatch |
SurfWatch Software Inc. |
Professional Edition |
| WebSense |
WebSense Inc. |
3.01 |
| I-Gear |
Symantec Co. |
I-Gear for Solaris |
| CyberPatrol |
The Learning Company |
2.10 |
| N2H2 |
N2H2 Inc. |
N2H2 for ISPs |
The first step in testing the filters was to construct a
sufficient and representative test set of Web pages or URLs
that adequately mimics the target user population. The
target user population in this case is assumed to be the
casual home user accessing the Internet through a dial-up
connection to a public ISP. For that the test set was chosen
to be a large set of 54,681 page requests (URLs) from actual
users, registered in the proxy log of an ISP. The data was
collected during a 24-hour period during the summer of 1998.
At that particular ISP, Internet access was provided
through a proxy that cached frequently requested pages.
However, the proxy did not block access to any sites. As a
side effect of using the proxy, a log was automatically
produced that specified for each user request the
destination URL (the address of the page that was requested)
among lots of other detailed information. The URLs were
collected from the proxy log and were used after removing
all source IP addresses (the IP of the requestor of the URL)
and other unnecessary data. (It should be noted that a URL
addresses a Web object and not a whole page, so pages with
multiple objects such as images would have multiple URLs in
the test set.)
Two data sets were prepared: (1) the original set
with all URLs and (2) the distinct set, which is the
data set after removing all duplicate URLs (URLs that were
requested more than once during the data collection period)
and keeping only distinct URLs. The size of this set was
40,100, meaning that more than 25% of the log is duplicated.
Note that the second test set is a proper subset of the
first and hence does not require any extra effort in
performing the experiment. Only the analysis stage is
affected. Error statistics on the original set are more
indicative of the user populations because misdetections or
false alarms in a URL requested multiple times will be
reflected in the final error rate. On the contrary, the
distinct set lists each URL only once, and hence an error
will be counted only once. |
|
First we will describe the experiments performed on the
original set and analyze the results. In the next sections,
we will address the distinct set.
In this step all URLs in the data set were run through
each of the filters to determine each filter's particular
decision about each URL.
The test machine was a Sun Altra 10 that runs Solaris
2.6h. Netscape proxy server version 3.5 was installed on the
test machine. The experiment was done between December 1998
and January 1999.
We can describe the steps of the experiment as follows:
- Trial versions of I-Gear, WebSense, SmartFilter,
SurfWatch, and CyberPatrol were installed on the test
machine. All except I-Gear are plug-ins to the Netscape
proxy server. Since N2H2 is a service rather than
distributed software we were not able to get a copy of the
software. However, a dedicated server was set up for our
experiment at N2H2 Inc.
- All filters were configured to block only sex-related
categories.
- A script was written to run each filtering product
through the whole data set to determine the set of blocked
URLs within the test set and the remaining URLs (i.e., the
set of retrieved URLs).
- For each URL, an indication of whether it was blocked
or retrieved by each particular filter was entered into a
database.
The result of running all the filters on the data set is
summarized in the following table. The total number of URLs
tested was 54,681. Table 2 shows the number of URLs blocked
by each filtering product. Figure 2 shows the percentage of
total URLs that were blocked by each filtering product. As
can be seen from the table, the filters agree to a certain
extent on the number of URLs that are blocked from among the
total test set. |
Table 2. Number of URLs
blocked by each filter
| Filtering Product |
Number of URLs Blocked |
% of Total |
| SmartFilter |
22,642 |
41% |
| SurfWatch |
24,917 |
46% |
| WebSense |
22,901 |
42% |
| I-Gear |
18,171 |
33% |
| CyberPatrol |
23,578 |
43% |
| N2H2 |
23,161 |
42% |
As can be seen from the table, the number of blocked URLs
for all filtering products fall within a small interval
except that of I-Gear.
The next step was to use the filter decisions as a basis
for labeling each URL as pornographic or not. The method
used here was conceptually simple but saved a lot of effort
practically. The method was to trust the unanimous decisions
reached by the filters. So, all URLs with unanimous "block"
decisions were considered to be pornographic, while all URLs
with unanimous "retrieve" decisions were considered to be
nonpornographic.
The remaining URLs with different decisions were labeled
manually. All URLs with different filter decisions were
manually checked and labeled to be pornographic or not based
on the usual U.S. cultural standards for pornography.
Table 3 summarizes the labels. The first two rows show
the number of URLs unanimously blocked and retrieved,
respectively; the third row shows the URLs with different
decisions, which were manually labeled. The final two rows
show the summary of labels of all URLs in the database,
where the pornographic row includes sites unanimously agreed
to by the filters plus the URLs deemed pornographic from the
manual check from among the URLs with different filter
decisions.
Table 3. Summary of URL labels for the original set
| |
Number of URLs |
% of Total |
| Unanimously blocked |
12,072 |
22% |
| Unanimously retrieved |
24,027 |
44% |
| Different decisions |
18,582 |
34% |
| Pornographic |
23,955 |
44% |
| Nonpornographic |
30,726 |
56% |
From the table we see that two thirds of the URLs were
unanimously labeled, while one third of the URLs were
manually labeled, as all filters agreed to most of the URLs.
The last step in this evaluation is to analyze the error
rates of each filter. In any detection task two possible
types of error by filters are possible. In our case the two
types of error are (1) the filter not blocking a
pornographic URL, which is called a misdetection error, and
(2) the filter blocking a nonpornographic URL, which is
called a false alarm error. The misdetection rate for each
filter was calculated by calculating the conditional
probability that a URL was labeled pornographic but was not
blocked by the filter. The false alarm rate for each filter
was calculated by calculating the conditional probability
that a URL was labeled nonpornographic but was blocked by
the filter.
(Probability of misdetection by filter) =
(Probability of pornographic URL but not blocked by filter)
/ (Probability of pornographic URL)
(Probability of false alarm by filter) =
(Probability of nonpornographic but blocked by filter) /
(Probability of nonpornographic URL)
The probability of pornographic URL is calculated
as the number of pornographic URLs divided by the total
number of URLs in the set. The probability of
pornographic URL but not blocked by filter is calculated
as the number of pornographic URLs that were not blocked by
the filter divided by the total number of URLs in the set.
The other probabilities are calculated similarly.
In calculating the error rate of each filter, we give
equal weight to each of the two types of errors. Table 4
shows the errors of each filter. |
Table 4. Error analyses of
filters
| Filtering Product |
Misdetection |
False Alarm |
Error Rate |
| SmartFilter |
15% |
7% |
11% |
| SurfWatch |
12% |
7% |
10% |
| WebSense |
17% |
9% |
13% |
| I-Gear |
36% |
10% |
23% |
| CyberPatrol |
16% |
7% |
11% |
| N2H2 |
14% |
7% |
11% |
As can be seen from the table, all products have error
rates that are close to one another, except one. The values
range from 10% to 13% for the top five filters. In this
experiment, SurfWatch turned out to be the filter with the
lowest error rate, but it was closely trailed by
SmartFilter, CyberPatrol, and N2H2. Figure 1 shows the
results graphically.
Figure 1. Error rates for the original set
Similar processing and analysis was done on the distinct
set. It is important to note that the most
resource-consuming tasks in the experiment (running the
filters on the URLs and manually labeling URLs) did not need
to be repeated for this data set.
The filter decisions were taken from the trial runs on
the original data set. The result of running all the filters
on the distinct set is summarized in the following table.
The total number of URLs tested was 40,100. Table 5 shows
the number of URLs blocked by each filtering product and the
percentage of URLs that were blocked by each filtering
product. As can be seen from the table, the filters agree to
a certain extent on the number of URLs that are blocked from
among the total test set.
Table 5. Number of URLs blocked by each filter for the
distinct data set
| Filtering Product |
Number of URLs Blocked |
% of Total |
| SmartFilter |
17,629 |
44% |
| SurfWatch |
18,836 |
47% |
| WebSense |
18,441 |
46% |
| I-Gear |
13,483 |
34% |
| CyberPatrol |
17,849 |
45% |
| N2H2 |
18,354 |
46% |
As can be seen from the table, the number of blocked URLs
for all filtering products fall within a small interval
except that of I-Gear.
Table 6 summarizes the labels, the first two rows show
the number of URLs unanimously blocked and retrieved,
respectively; the third row shows the URLs with different
decisions, which were manually labeled. The final two rows
show the summary of labels of all URLs in the database. |
Table 6. Summary of URL labels
for the distinct data set
| |
Number of URLs |
% of Total |
| Unanimously blocked |
9,757 |
24% |
| Unanimously retrieved |
17,611 |
44% |
| Different decisions |
12,732 |
32% |
| Pornographic |
18,451 |
46% |
| Nonpornographic |
21,649 |
54% |
The ratios here are similar to those of the original data
set in Table 3.
Table 7 shows the errors of each filter for the distinct
data set.
Table 7. Error analysis of filters for the distinct
data set
| Filtering Product |
Misdetection |
False Alarm |
Error Rate |
| SmartFilter |
13% |
4% |
8% |
| SurfWatch |
12% |
11% |
11% |
| WebSense |
12% |
7% |
10% |
| I-Gear |
36% |
7% |
21% |
| CyberPatrol |
15% |
9% |
12% |
| N2H2 |
11% |
1% |
6% |
As can be seen from the table, here the false alarm error
rates are in general less than those for the original set.
Here N2H2 takes the lead in having the fewest false alarm
errors for distinct URLs. Figure 2 shows the results
graphically. |
Figure 2. Error rates for the distinct set
We presented a methodology to study and compare the
performance of Web-based filtering products that use the
black list approach. This methodology has the advantage of
reducing two thirds of the time-consuming work required to
prelabel all the URLs. It is based on taking the unanimous
decisions of filters as absolute labels, meaning that any
URL blocked by all filters is considered to be pornographic.
The risk in this case is when all six filters agree in
error, which we assume to be a remote possibility (the
obvious case is when the same URL now points to different
content than it did at the time of evaluation by the filter
producer). The labeling effort could be reduced further by
not insisting on unanimous decisions by filter but on a
majority vote. This, however, could increase the labeling
errors.
Another contribution of the work is in formalizing a
performance metric for evaluating filters. The metric is
based on estimating the rates of two types of errors:
(1) the filter not blocking a pornographic URL, which is
called a misdetection error, and (2) the filter blocking a
nonpornographic URL, which is called a false alarm error.
The misdetection rate for each filter was calculated by
calculating the conditional probability that a URL was
labeled pornographic but was not blocked by the filter. The
false alarm rate for each filter was calculated by
calculating the conditional probability that a URL was
labeled nonpornographic but was blocked by the filter. In
our experiments we gave equal weight to both types of
errors. But it is very easy to give more weight to the
misdetection rate, for instance, favoring filters that err
on the side of blocking nonpornographic sites.
As to the results of the experiment, we can see that most
of the famous filtering products agree on blocking most of
the sites. Two thirds of the URLs in the test set made
unanimous decisions either to block or to retrieve.
Furthermore, the ratio of blocked URLs to the whole set was
close for all filters (between 41% and 46%, except for one
filter). For the error rates we can see that in general the
misdetection rates are higher than the false alarm rates.
This is to be expected given the numerous sites on the
Internet and the negative media consequences of blocking a
nonpornographic site erroneously. The error rates here too
fall within a small interval (10% to 13%), except for one
filter. When comparing the results of the error rates of the
original set with those of the distinct set, one finds that
the distinct set is associated with a smaller error rate
with some of the filters. This could be explained by saying
that the URLs that are in error for that filter are repeated
multiple times in the test set.
In summary, the major filtering products are more or less
close in their error rate, hovering around the 10% range.
For parents, libraries, and schools, this means a lot: every
tenth URL request will be handled incorrectly. This means
that other methods should be investigated to augment the
filters, such as image-based filters or better content-based
filters.
The authors would like to thank their colleagues who
assisted them in performing the experiment, particularly
Rayed Al-Fayez, Muhammad Al-Korbi, and Waleed Al-Oriny.
Haralick and Shapiro, Computer and Robot Vision,
Addison Wesley, 1992.
K. Schneider, A Practical Guide to Internet Filters,
Neal-Schuman, 1997.
CyberNOT List - Search Engine Results
Cyber Patrol
Filtering Facts
Home Page
PC Magazine: The 1998 Utility Guide -- Parental Filtering
Utilities
RCLS
LibraryLand: General Library Issues: Censorship/Intellectual
Freedom
SMART
PARENT- PROTECT YOUR KIDS
SurfWatch Home
Page
SurfWatch Test Site |
|
|