HTML filtering of user supplied content
Scenario
Filtering HTML tags in user entered data is an important aspect of all
web based systems. It serves both to avoid security vunerabilities & allow
the site administrator control over what is displayed in the site. In
all WAF based applications I’ve worked on we’ve relied on the fact that
XSL transformers will automatically escape HTML tags unless you specifically
set the ‘disable-output-escaping’ attribute on the <xsl:value-of> tag.
While this has the virtue of being simple & very safe by default, its
crude on / off action is increasingly becoming a source of problems,
particulary with CMS content items.
For an idea of how its hurting, consider the following situation:
- The Agenda content type has a number of text attributes:
- Location
- Attendees
- Subject
- Contact
- Summary
- Body text
Typically, the XSL only allows tags in the ‘body text’ attribute
to be rendered – everything else is escaped. - When entering some non-west european languages, text reads from
right-to-left instead of left-to-right. To achieve this in HTML
you need to add the dir=”rtl” to you tags. If a particular field
in a stylesheet only ever holds a single language, then this can
be done in the XSL stylesheet, however, there are occasions when a
single block of text has multiple languages interspersed. In this
case, the user must enter <span dir=”rtl”> tags themselves.
The combination of these two points creates a problem, because we only
want to allow HTML in certain fields, but we need to enter tags
in any field to set the text direction.
The only way out of this is to change the XSLT so that all fields
allow any HTML tag to be rendered. Which in turn implies we need to
filter HTML tags in user entered data.
Use cases
Before considering how to filter HTML, lets enumerate a few use cases:
- Allow tag with ‘rtl’ attribute
- Allow any block or inline tag with ‘rtl’ attribute
- Allow any tag, but no onXXX event tags
- Allow any tag in the HTML-4.0 Strict DTD.
- Disallow any <font> and <blink> tags.
- Allow any inline markup.
- Disallow tables.
There is also the question of what you do to the text when encountering
a forbidden tag. There are two plausible actions:
- Strip out the tag, leaving the content
- Strip out the tag, including the content
The former is applicable for situations where you know the content of
the tag is safe, eg, stripping <font>…some text…</font> tags, you
ought to let ‘….some text…’ pass through. The latter is applicable
when stripping something like an <applet> tag.
Algorith design
So, a reasonable approach to filtering HTML would go something like this:
- Build up a rule set:
- Set a default ‘tag’ policy – one of:
- allow – don’t touch tag
- deny – remove tag & content
- filter – remove tag, leaving content
- Set a default ‘attribute’ policy – one of:
- allow
- deny
- Create a hash mapping ‘tag’ to ‘policy’ for all
tags with a non-default policy - For each tag, create a hash mapping ‘attribute’ to
‘policy’ for all attributes with a non-default
policy
- Set a default ‘tag’ policy – one of:
- Tokenise the HTML document, building up a parse tree,
matching opening & closing tags. Also fill in any closing
tags that were ommitted, eg typically </p>, </li> - Traverse the parse tree. When encountering a tag, apply
the rules- if the tag is allowed
- filter out any attributes which are denied
- output the opening tag
- process sub-tags (if any)
- output the closing tag (if any)
- if the tag is denied
- skip the opening tag
- skip sub-tags (if any)
- skip the closing tag
- if the tag is filtered
- skip the opening tag
- process sub-tags (if any)
- skip the closing tag (if any)
- if the tag is allowed
The only potentially difficult bit is 2) tokenizing the HTML
and building a syntax tree. Crucial features for such a parser
are
- Thread safe (we can be serving many requests at once)
- Efficient (ie fast at parsing large amounts of data)
- Character set aware (at least UTF-8)
For Java a suitable candidate is HTMLParser
while in Perl there is HTML::Tree
Integrating with applications
Now that we have the basics of the HTML filter worked out, there
is a question of integrating it with applications. There are three
possibilities:
- Filter the data in the form submission
- Throw a validation error in the from submission if
forbidden markup is found. - Filter the data when generating the output (ie XML or HTML in a JSP/CGI)
In most cases, a) and/or b) are the best approaches since they
catch the problem at the earliest stage. Indeed it may be best
to use a combination of both:
- Default action is to just throw validation error,
giving the user a chance to fixup their data. (this is
nice for letting the user deal with typos). - If they click the ‘cleanup HTML’ checkbox, then automatically
strip all remaining invalid tags.
The final thought is how to decide on the filtering rule sets.
Again, a one size fits all approach is probably too restrictive.
For example, when using the Article content type in CMS, it is
conceivable that role A (normal authors) should be allowed a
limited set of HTML, but role B (the organization web team) be
allowed arbitrary HTML. Thus there is a case for providing the
site administrator with the means to specify different filtering
rules per role.
In a wierd co-incidence just a couple of weeks after posting this I come across the newly launched html_scrub written by Scott McKellar to do almost exactly what I described for Groklaw.