Browsing the articles on Linux Mafia Knowledgebase I came across a link to The Parable of the Languages. Having sufferred XML for the past
three years while using CCM I couldn’t help but be moved ;-)
XML! Exclaimed C++. What are you doing here? You’re not a programming language.
Tell that to the people who use me, said XML.
…snip..
And yet, all I am is a simple little markup, from humble origins. It’s a burden, being XML.
At that XML sighed, and the other languages, moved by its plight gathered around…
…and tromped that little XML into the dirt. Yes, into the very dirt at their feet. Basic tromped, and C++ tromped, and Java cleaned and tromped and cleaned again, and COBOL tried to throw a kick at XML’s head but fell over on its cane. Even LISP pulled itself out of the pond to throw loopy hands around XML’s throat, but only managed to choke its ownself.
And each language could be heard to mumble as it tromped and tromped and tromped, with complete and utter glee:
Have to parse XML, eh? Have to have an XML API, eh? Have to work with SOAP and XML-RPC and RSS and RDF, eh?
Well parse this, you little markup asshole.
A common misconception amongst developers is that GNU Arch is hard to learn and
use. The Linux Journal has just published an excellant article illustrating how to achieve common tasks from CVS using Arch.
‘00000000’. According to this story (also featured on /., the 8-digit launch code to protect against a rogue missile launch was left at this trivial setting because commanders didn’t want procedure to get in the way of action during a war time situation. Carrying on with the scary but true theme I came across this comment in the same /. article
A flight attendant invited me to a party a few years back, and it was mostly pilots and flight attendants at the party. All getting sloshed, of course – pilot and flight attendants DRINK. Since most airline pilots started their careers in the military I got to spend a lot of the evening listening to ‘war’ stories.
One pilot I talked to used to copilot one of the two big planes (747s?) that they send up that can launch all the missiles remotely in case NORAD gets knocked out. He told a story about how they would run all these drills where they would scramble, get in the air immediately, and then get transmitted codes from the ground. They would unscramble the codes as “do not launch” and then return to base without transmitting anything to the silos, drill over.
According to him, on one of these sorties received the “launch” code in error. So they asked the ground to repeat the transmission. Which they did, and it was the same. So they took a chance and broke protocol and radio’d the ground and told them that they had just sent the “launch” codes, and did they really want them to transmit this along to the silos? Of course the ground told them to cease and return to base.
Scary truth or dunken bravado? Who knows.
Scenario
Filtering HTML tags in user entered data is an important aspect of all
web based systems. It serves both to avoid security vunerabilities & allow
the site administrator control over what is displayed in the site. In
all WAF based applications I’ve worked on we’ve relied on the fact that
XSL transformers will automatically escape HTML tags unless you specifically
set the ‘disable-output-escaping’ attribute on the <xsl:value-of> tag.
While this has the virtue of being simple & very safe by default, its
crude on / off action is increasingly becoming a source of problems,
particulary with CMS content items.
For an idea of how its hurting, consider the following situation:
The combination of these two points creates a problem, because we only
want to allow HTML in certain fields, but we need to enter tags
in any field to set the text direction.
The only way out of this is to change the XSLT so that all fields
allow any HTML tag to be rendered. Which in turn implies we need to
filter HTML tags in user entered data.
Use cases
Before considering how to filter HTML, lets enumerate a few use cases:
- Allow tag with ‘rtl’ attribute
- Allow any block or inline tag with ‘rtl’ attribute
- Allow any tag, but no onXXX event tags
- Allow any tag in the HTML-4.0 Strict DTD.
- Disallow any <font> and <blink> tags.
- Allow any inline markup.
- Disallow tables.
There is also the question of what you do to the text when encountering
a forbidden tag. There are two plausible actions:
- Strip out the tag, leaving the content
- Strip out the tag, including the content
The former is applicable for situations where you know the content of
the tag is safe, eg, stripping <font>…some text…</font> tags, you
ought to let ‘….some text…’ pass through. The latter is applicable
when stripping something like an <applet> tag.
Algorith design
So, a reasonable approach to filtering HTML would go something like this:
- Build up a rule set:
- Set a default ‘tag’ policy – one of:
- allow – don’t touch tag
- deny – remove tag & content
- filter – remove tag, leaving content
- Set a default ‘attribute’ policy – one of:
- allow
- deny
- Create a hash mapping ‘tag’ to ‘policy’ for all
tags with a non-default policy
- For each tag, create a hash mapping ‘attribute’ to
‘policy’ for all attributes with a non-default
policy
- Tokenise the HTML document, building up a parse tree,
matching opening & closing tags. Also fill in any closing
tags that were ommitted, eg typically </p>, </li>
- Traverse the parse tree. When encountering a tag, apply
the rules
- if the tag is allowed
- filter out any attributes which are denied
- output the opening tag
- process sub-tags (if any)
- output the closing tag (if any)
- if the tag is denied
- skip the opening tag
- skip sub-tags (if any)
- skip the closing tag
- if the tag is filtered
- skip the opening tag
- process sub-tags (if any)
- skip the closing tag (if any)
The only potentially difficult bit is 2) tokenizing the HTML
and building a syntax tree. Crucial features for such a parser
are
- Thread safe (we can be serving many requests at once)
- Efficient (ie fast at parsing large amounts of data)
- Character set aware (at least UTF-8)
For Java a suitable candidate is HTMLParser
while in Perl there is HTML::Tree
Integrating with applications
Now that we have the basics of the HTML filter worked out, there
is a question of integrating it with applications. There are three
possibilities:
- Filter the data in the form submission
- Throw a validation error in the from submission if
forbidden markup is found.
- Filter the data when generating the output (ie XML or HTML in a JSP/CGI)
In most cases, a) and/or b) are the best approaches since they
catch the problem at the earliest stage. Indeed it may be best
to use a combination of both:
- Default action is to just throw validation error,
giving the user a chance to fixup their data. (this is
nice for letting the user deal with typos).
- If they click the ‘cleanup HTML’ checkbox, then automatically
strip all remaining invalid tags.
The final thought is how to decide on the filtering rule sets.
Again, a one size fits all approach is probably too restrictive.
For example, when using the Article content type in CMS, it is
conceivable that role A (normal authors) should be allowed a
limited set of HTML, but role B (the organization web team) be
allowed arbitrary HTML. Thus there is a case for providing the
site administrator with the means to specify different filtering
rules per role.