HTML filtering of user supplied content

Posted: May 28th, 2004 | Filed under: Coding Tips | 1 Comment »

Scenario

Filtering HTML tags in user entered data is an important aspect of all
web based systems. It serves both to avoid security vunerabilities & allow
the site administrator control over what is displayed in the site. In
all WAF based applications I’ve worked on we’ve relied on the fact that
XSL transformers will automatically escape HTML tags unless you specifically
set the ‘disable-output-escaping’ attribute on the <xsl:value-of> tag.
While this has the virtue of being simple & very safe by default, its
crude on / off action is increasingly becoming a source of problems,
particulary with CMS content items.

For an idea of how its hurting, consider the following situation:

The Agenda content type has a number of text attributes:
- Location
- Attendees
- Subject
- Contact
- Summary
- Body text
Typically, the XSL only allows tags in the ‘body text’ attribute
to be rendered – everything else is escaped.
When entering some non-west european languages, text reads from
right-to-left instead of left-to-right. To achieve this in HTML
you need to add the dir=”rtl” to you tags. If a particular field
in a stylesheet only ever holds a single language, then this can
be done in the XSL stylesheet, however, there are occasions when a
single block of text has multiple languages interspersed. In this
case, the user must enter tags themselves.

The combination of these two points creates a problem, because we only
want to allow HTML in certain fields, but we need to enter tags
in any field to set the text direction.

The only way out of this is to change the XSLT so that all fields
allow any HTML tag to be rendered. Which in turn implies we need to
filter HTML tags in user entered data.

Use cases

Before considering how to filter HTML, lets enumerate a few use cases:

Allow tag with ‘rtl’ attribute
Allow any block or inline tag with ‘rtl’ attribute
Allow any tag, but no onXXX event tags
Allow any tag in the HTML-4.0 Strict DTD.
Disallow any and <blink> tags.
Allow any inline markup.
Disallow tables.

There is also the question of what you do to the text when encountering
a forbidden tag. There are two plausible actions:

Strip out the tag, leaving the content
Strip out the tag, including the content

The former is applicable for situations where you know the content of
the tag is safe, eg, stripping …some text… tags, you
ought to let ‘….some text…’ pass through. The latter is applicable
when stripping something like an <applet> tag.

Algorith design

So, a reasonable approach to filtering HTML would go something like this:

Build up a rule set:
- Set a default ‘tag’ policy – one of:
  1. allow – don’t touch tag
  2. deny – remove tag & content
  3. filter – remove tag, leaving content
- Set a default ‘attribute’ policy – one of:
  1. allow
  2. deny
- Create a hash mapping ‘tag’ to ‘policy’ for all
  tags with a non-default policy
- For each tag, create a hash mapping ‘attribute’ to
  ‘policy’ for all attributes with a non-default
  policy
Tokenise the HTML document, building up a parse tree,
matching opening & closing tags. Also fill in any closing
tags that were ommitted, eg typically , </li>
Traverse the parse tree. When encountering a tag, apply
the rules
1. if the tag is allowed
  - filter out any attributes which are denied
  - output the opening tag
  - process sub-tags (if any)
  - output the closing tag (if any)
2. if the tag is denied
  - skip the opening tag
  - skip sub-tags (if any)
  - skip the closing tag
3. if the tag is filtered
  - skip the opening tag
  - process sub-tags (if any)
  - skip the closing tag (if any)

The only potentially difficult bit is 2) tokenizing the HTML
and building a syntax tree. Crucial features for such a parser
are

Thread safe (we can be serving many requests at once)
Efficient (ie fast at parsing large amounts of data)
Character set aware (at least UTF-8)

For Java a suitable candidate is HTMLParser
while in Perl there is HTML::Tree

Integrating with applications

Now that we have the basics of the HTML filter worked out, there
is a question of integrating it with applications. There are three
possibilities:

Filter the data in the form submission
Throw a validation error in the from submission if
forbidden markup is found.
Filter the data when generating the output (ie XML or HTML in a JSP/CGI)

In most cases, a) and/or b) are the best approaches since they
catch the problem at the earliest stage. Indeed it may be best
to use a combination of both:

Default action is to just throw validation error,
giving the user a chance to fixup their data. (this is
nice for letting the user deal with typos).
If they click the ‘cleanup HTML’ checkbox, then automatically
strip all remaining invalid tags.

The final thought is how to decide on the filtering rule sets.
Again, a one size fits all approach is probably too restrictive.
For example, when using the Article content type in CMS, it is
conceivable that role A (normal authors) should be allowed a
limited set of HTML, but role B (the organization web team) be
allowed arbitrary HTML. Thus there is a case for providing the
site administrator with the means to specify different filtering
rules per role.

SCM on Linux

Posted: May 28th, 2004 | Filed under: Uncategorized | No Comments »

First some links to pages comparing and contrasting various version control systems on Linux

And now some links with technical details of various packages

(Near?) zero overhead debugging with the IBM JVM

Posted: May 10th, 2004 | Filed under: Coding Tips | 2 Comments »

Scenario

Web application running in a production environment is having
periodic OutOfMemory / extreme CPU load issues.
The server is not responding to HTTP requests.
‘ps -ef’ shows a couple Java proceses (out of 100’s total)
consuming 99% of CPU time.
The problem recurrs with a frequency of between 6 and 24 hours.
There is, as yet, no known procedure to reproduce the problem
on demand.

Questions

Some questions to which I want answers:

What Java threads correspond to each UNIX process
The stack trace for every Java thread
What locks are held / being waited on
What objects are allocated

Tools

The IBM JVM comes with a number of custom, built-in tools for
collecting the information required to answer these questions.
Unlike the standard tools in the Java Platform Debugging
Architecture, these IBM-specific facilities impose (near?) zero
performance overhead on the JVM, so can be left permanently
enabled on production systems. What is more, these tools free,
quick and easy to use.

There is extensive (think 400+ pages long) documentation
available from the IBM JVM diagnosis docs site. What follows
is a brief overview of some of the details…

Java Dump (aka Javacore)

The javadump tool produces text files that contain diagnostic
information related to the JVM and the application running
within. This facility is enabled by default. A javadump can
be manually triggered by sending SIG_QUIT to the root process
of the JVM. The dump file will be created in first of the
following locations which is writable:

The directory specified in IBM_JAVACOREDIR env variable
The JVM’s current working directory
The directory specified in TMPDIR env variable
The /tmp directory

A dump file will be generated automatically if a native exception
occurrs in the JVM. If the IBM_JAVADUMP_OUTOFMEMORY env variable
is set, then a dump will also be generated when heapspace is
exhausted. The file name will be of the form javacore_[pid].[time].txt

Some of the information included in the dump file is:

The OS memory info (eg from /proc/meminfo)
User resource limits
Signal handlers
Environment variables
OS memory maps (eg from /proc/[pid]/maps)
List of all object monitors being waited on
List of all system locks (and what thread, if any,
is holding them).
List of all threads, their Java names & their UNIX
process ids.
Java stack trace for each thread
Native OS stack trace for each thread
List of all classloaders
List of all loaded classes

This is all output in a wellformatted text files which makes for
fairly easy human reading & trivial Perl scripting.

Heap dump

The heapdump tool produces text files that contain a dump of
all objects currently allocated on the heap and a list of
references between objects. This facility is disabled by default,
but can be enabled by setting the env variable IBM_HEAP_DUMP=true
prior to starting the JVM. When enabled, a heapdump is generated
at the same time as the javadump, eg by sending a SIG_QUIT signal.

The filename will be of the form heapdump_[pid].[time].txt While
plain text, the format of this file isn’t directly interesting to
humans, but when processed with the tool ‘heaproot’ you can get
some interesting information:

Objects table – list of objects in the heap sorted
by size, address, name, subtree size, many more…
Types table – list of classes with objects in heap
sorted by total size, instance count, name
Gaps table – list of free spaces in the heap
Object info – list of references to & from an object
Paths/routes – find a route between two objects (if
one exists).

JFormat

The JFormat tool is a TUI/GUI for extracting & displaying
information from UNIX core dumps. To enable core dumps there
are two things to check before starting the JVM:

Remove any resource limits ‘ulimit -c unlimited’
Change into a writable directory

Now to generate a core dump, simply send the root process
a SIG_ABRT signal.

Since core dumps are operating system specific, you must
now process it with the jextract tool to create a platform
independant core dump.

   $JAVA_HOME/jre/bin/jextract [path-to-native-core-dump]

Then you can launch the jformat gui with

   $JAVA_HOME/bin/jformat -g

The JFormat GUI provides menus to pull out all sorts of
useful information, which I won’t go into now, other than
to say there is alot of overlap with the Java Dump and
Heap Dump metrics.

Example

Having set the IBM_HEAPDUMP=true env variable & restarted the
servers I just had to wait for the performance slowdown to
recurr (which indicated the garbage collector was starting to
thrash). I now sent a SIGQUIT to the root process generating
a heapdump file about 850 MB in size. This was then loaded into
the HeapRoots analysis program:

   java -jar /tmp/HR205.jar heapdump3632.1083842645.txt

After parsing the file heaproots shows a few basic statistics
about the dump:

  # Objects            : 7,727,411
  # Refs               : 12,807,015
  # Unresolved refs    : 31
  Heap usage           : 315,413,344

The first thing to do is invoke the ‘process’ command. This resolves
all references between the objects, detects roots, calculates total
size metrics per class & sub-tree.

  Pure roots           : 174,088
  ... which reach      : 7,706,600
  Artificial roots     : 1,531
  Softlinks followed   : 16,014

Now, its ready to answer some interesting queries. Lets see what
the biggest tree roots are. This is done by running the ‘dt’
command – ‘Dump from roots, sort by total size’. It prompts for
some values such as max depth to print out, min size threshold,
then produces something looking like

<0> [270,167,808] 0x100a2ef8 [72] java/lang/Thread
<1>   [270,167,680] 0x10053c70 [168] com/caucho/java/CompilingClassLoader
<2>     [270,164,496] 0x10053d18 [168] com/caucho/util/DirectoryClassLoader
          - 14 children of 0x10053d18 too deep.
<2>     {270,167,680} 0x10053c70 [168] com/caucho/java/CompilingClassLoader
<2>     <270,163,200 parent:0x10053d18> 0x10053dc0 [168] com/caucho/java/CompilingClassLoader
<2>     <270,163,200 parent:0x10053d18> 0x10053dc0 [168] com/caucho/java/CompilingClassLoader
        - 13 children of 0x10053c70 filtered.
      - 4 children of 0x100a2ef8 filtered.
<0> [5,234,712] 0x16a85b90 [447,296] array of java/lang/Object
      - 111820 children of 0x16a85b90 filtered.

Unsurprisingly the root with highest memory consumption is the top level
thread of the JVM. Not so enlightening. I figure the memory leak is
probably in CCM code, so lets find the biggest object in a package whose
name begins with ‘com/arsdigita/’. For this we use the ‘Objects dump, sort
by total size’ command (‘ot com/arsdigita/’) again this prompts for some
more optional filters, but accepting the defaults we get a list of big
objects:

  Addr       Size       Root-owner Parent     Total-size  Name
-------------------------------------------------------------------------------
N 0x284ffb48 256        0x100a2ef8 0x284ffc88 224,107,816 class com/arsdigita/templating/Templating
N 0x220d3458 96         0x100a2ef8 0x22bc0ee8 4,585,520   com/arsdigita/london/theme/util/ThemeDevelopmentFileManager
N 0x1cbc24b0 32         0x100a2ef8 0x1e18a5a8 2,460,464   com/arsdigita/templating/XSLTemplate
N 0x23180b28 80         0x100a2ef8 0x22bc0ee8 2,164,688   com/arsdigita/london/search/RemoteSearcher
N 0x1bf215e0 56         0x100a2ef8 0x1bf2a850 1,844,152   com/arsdigita/persistence/Session
N 0x1bf2ab98 40         0x100a2ef8 0x1bf215e0 1,770,096   com/arsdigita/persistence/TransactionContext
N 0x1affda20 16         0x100a2ef8 0x1aa4d388 1,681,056   com/arsdigita/persistence/DataQueryImpl$UnaliasMapper
N 0x1affc508 96         0x100a2ef8 0x1affda20 1,681,040   com/arsdigita/persistence/DataAssociationCursorImpl
N 0x1a985ff0 56         0x100a2ef8 0x1a98e8e8 1,138,696   com/arsdigita/bebop/PageState

So, the class object for ‘com.arsdigita.templating.Templating’ is taking
up 224 MB of the heap. A quick look at the code shows it has a static
HashMap storing XSLTemplate objects with no size bounding. To confirm this
is the culprit we go back tot he ‘Dump from roots, sort by total size’ command,
this time starting display from the object at address 0x284ffb48:

<0> [224,107,816] 0x284ffb48 [256] class com/arsdigita/templating/Templating
<1>   [224,092,408] 0x28500840 [56] java/util/HashMap
<2>     [224,092,352] 0x1cdf9160 [1,552] array of java/util/HashMap$Entry
<3>       [4,288,800] 0x2dbfa500 [32] java/util/HashMap$Entry
<4>         [3,216,568] 0x1bf5cc00 [32] java/util/HashMap$Entry
              - 3 children of 0x1bf5cc00 too deep.
<4>         [1,069,960] 0x2c824ca8 [32] com/arsdigita/templating/XSLTemplate
              - 4 children of 0x2c824ca8 too deep.
            - 1 child of 0x2dbfa500 filtered.
<3>       [3,216,584] 0x263c8ff8 [32] java/util/HashMap$Entry
<4>         [2,144,416] 0x3d00ade8 [32] java/util/HashMap$Entry
              - 3 children of 0x3d00ade8 too deep.
<4>         [1,069,912] 0x237796b0 [32] com/arsdigita/templating/XSLTemplate
              - 4 children of 0x237796b0 too deep.
            - 1 child of 0x263c8ff8 filtered.
<3>       [3,216,568] 0x19319d58 [32] java/util/HashMap$Entry
<4>         [2,144,400] 0x33ddd918 [32] java/util/HashMap$Entry
              - 3 children of 0x33ddd918 too deep.
<4>         [1,069,912] 0x1864f950 [32] com/arsdigita/templating/XSLTemplate
              - 4 children of 0x1864f950 too deep.
            - 1 child of 0x19319d58 filtered.

[snipped for another 10 pages of output]
<1>   {224,107,816} 0x284ffb48 [256] class com/arsdigita/templating/Templating
      - 15 children of 0x284ffb48 filtered.
There were 362 objects expanded.

So, from this its clear that the problem is a huge HashMap in the Templating
class containing around one hundred XSLTemplate objects, each around 1 MB in
size. So, the moral of the story – if you implement a static cache in a server,
make sure you:

benchmark the average size of objects in the cache
decide how much memory can be set aside to it
use results from a) & b) to calculate a good max entry
count for the cache before invoking an purge algorithm.

Coding tips

Posted: May 10th, 2004 | Filed under: Uncategorized | No Comments »

I’ve started a new blog for coding tips; the first posting deals with the issue of zero overhead debugging in IBM JDK (ie, not using the -Xdebug JVM flag)!

Euro-English

Posted: May 10th, 2004 | Filed under: Uncategorized | 1 Comment »

When working in a multi-national company such as Red Hat, there are always problems with understanding the languages spoken by all employees. Recognising the problem this poses to business, the European Commission has taken steps to standardize on English across Europe:

The European Commission has just announced an agreement whereby English
will be the official language of the European nation rather than German
which was the other possibility. As part of the negotiations, Her
Majesty’s Government conceded that English spelling had some room for
improvement and has accepted a 5-year phase-in plan that would become
known as “Euro-English”.

In the first year, “s” will replace the soft “c”. Sertainly, this will
make the sivil servants jump with joy. The hard “c” will be dropped in
favour of the “k”.
This should klear up konfusion, and keyboards kan have one less letter.

There will be growing publik enthusiasm in the sekond year when the
troublesome “ph” will be replaced with the “f”. This will make words
like fotograf 20% shorter.

In the 3rd year, publik akseptanse of the new spelling kan be expected
to reach the stage where more komplikated changes are possible.
Governments will enkourage the removal of double letters which have
always ben a deterent to akurate speling.
Also, al wil agre that the horibl mes of the silent “e” in the languag
is disgrasful and it should go away.

By the 4th yer peopl wil be reseptiv to steps such as replasing “th”
with “z”and “w” with “v”. During ze fifz yer, ze unesesary “o” kan be
dropd from vords kontaining “ou” and after ziz fifz yer, ve vil hav a
reil sensibl riten styl..
Zer vil be no mor trubl or difikultis and evrivun vil find it ezi tu
understand ech oza. Ze drem of a united urop vil finali kum tru !

NB, in case anyone was wondering, this is a joke. I hope!

Daniel P. Berrangé

Writing about open source software, virtualization & more

HTML filtering of user supplied content

Scenario

Use cases

Algorith design

Integrating with applications

SCM on Linux

(Near?) zero overhead debugging with the IBM JVM

Scenario

Questions

Tools

Java Dump (aka Javacore)

Heap dump

JFormat

Example

Coding tips

Euro-English

Pages

Sites

Categories

Browse Posts

Projects

Recent Posts