Coding tips

Friday, May 28, 2004

HTML filtering of user supplied content

Scenario

Filtering HTML tags in user entered data is an important aspect of all web based systems. It serves both to avoid security vunerabilities & allow the site administrator control over what is displayed in the site. In all WAF based applications I've worked on we've relied on the fact that XSL transformers will automatically escape HTML tags unless you specifically set the 'disable-output-escaping' attribute on the <xsl:value-of> tag. While this has the virtue of being simple & very safe by default, its crude on / off action is increasingly becoming a source of problems, particulary with CMS content items.

For an idea of how its hurting, consider the following situation:

  • The Agenda content type has a number of text attributes:
    • Location
    • Attendees
    • Subject
    • Contact
    • Summary
    • Body text
    Typically, the XSL only allows tags in the 'body text' attribute to be rendered - everything else is escaped.
  • When entering some non-west european languages, text reads from right-to-left instead of left-to-right. To achieve this in HTML you need to add the dir="rtl" to you tags. If a particular field in a stylesheet only ever holds a single language, then this can be done in the XSL stylesheet, however, there are occasions when a single block of text has multiple languages interspersed. In this case, the user must enter <span dir="rtl"> tags themselves.

The combination of these two points creates a problem, because we only want to allow HTML in certain fields, but we need to enter tags in any field to set the text direction.

The only way out of this is to change the XSLT so that all fields allow any HTML tag to be rendered. Which in turn implies we need to filter HTML tags in user entered data.

Use cases

Before considering how to filter HTML, lets enumerate a few use cases:

  • Allow tag with 'rtl' attribute
  • Allow any block or inline tag with 'rtl' attribute
  • Allow any tag, but no onXXX event tags
  • Allow any tag in the HTML-4.0 Strict DTD.
  • Disallow any <font> and <blink> tags.
  • Allow any inline markup.
  • Disallow tables.

There is also the question of what you do to the text when encountering a forbidden tag. There are two plausible actions:

  • Strip out the tag, leaving the content
  • Strip out the tag, including the content

The former is applicable for situations where you know the content of the tag is safe, eg, stripping <font>...some text...</font> tags, you ought to let '....some text...' pass through. The latter is applicable when stripping something like an <applet> tag.

Algorith design

So, a reasonable approach to filtering HTML would go something like this:

  1. Build up a rule set:
    • Set a default 'tag' policy - one of:
      1. allow - don't touch tag
      2. deny - remove tag & content
      3. filter - remove tag, leaving content
    • Set a default 'attribute' policy - one of:
      1. allow
      2. deny
    • Create a hash mapping 'tag' to 'policy' for all tags with a non-default policy
    • For each tag, create a hash mapping 'attribute' to 'policy' for all attributes with a non-default policy
  2. Tokenise the HTML document, building up a parse tree, matching opening & closing tags. Also fill in any closing tags that were ommitted, eg typically </p>, </li>
  3. Traverse the parse tree. When encountering a tag, apply the rules
    1. if the tag is allowed
      • filter out any attributes which are denied
      • output the opening tag
      • process sub-tags (if any)
      • output the closing tag (if any)
    2. if the tag is denied
      • skip the opening tag
      • skip sub-tags (if any)
      • skip the closing tag
    3. if the tag is filtered
      • skip the opening tag
      • process sub-tags (if any)
      • skip the closing tag (if any)

The only potentially difficult bit is 2) tokenizing the HTML and building a syntax tree. Crucial features for such a parser are

  • Thread safe (we can be serving many requests at once)
  • Efficient (ie fast at parsing large amounts of data)
  • Character set aware (at least UTF-8)

For Java a suitable candidate is HTMLParser while in Perl there is HTML::Tree

Integrating with applications

Now that we have the basics of the HTML filter worked out, there is a question of integrating it with applications. There are three possibilities:

  1. Filter the data in the form submission
  2. Throw a validation error in the from submission if forbidden markup is found.
  3. Filter the data when generating the output (ie XML or HTML in a JSP/CGI)

In most cases, a) and/or b) are the best approaches since they catch the problem at the earliest stage. Indeed it may be best to use a combination of both:

  • Default action is to just throw validation error, giving the user a chance to fixup their data. (this is nice for letting the user deal with typos).
  • If they click the 'cleanup HTML' checkbox, then automatically strip all remaining invalid tags.

The final thought is how to decide on the filtering rule sets. Again, a one size fits all approach is probably too restrictive. For example, when using the Article content type in CMS, it is conceivable that role A (normal authors) should be allowed a limited set of HTML, but role B (the organization web team) be allowed arbitrary HTML. Thus there is a case for providing the site administrator with the means to specify different filtering rules per role.

http://berrange.com/coding/tips

1 Comment(s)

Blogger Daniel said...

In a wierd co-incidence just a couple of weeks after posting this I come across the newly launched html_scrub written by Scott McKellar to do almost exactly what I described for Groklaw.

12:59 PM 

Post a Comment