It has been just over a year & 1/2 since I first blogged about DTrace suggesting that a similar tool would be very valuable to the Linux community. Well after a few long email threads, it turned out that a significant number of people within Red Hat agreed with this assessment and so in partnership with IBM and Intel the SystemTAP project came into life at the start of 2005. Starting with the previously developed KProbes dynamic instrumentation capability a huge amount of work has been done building out a high level language and runtime for safely, efficiently & reliably probing the kernel. It has seen a limited ‘technology preview’ in RHEL-4, and with its inclusion in the forthcoming Fedora Core 5 it will be exposed to a much wider community of users & potential developers.
On the very same day as Dave Jones was looking at the Fedora boot process via static kernel instrumentation, I was (completely co-incidentally ) playing around using SystemTAP to instrument the boot process. The probe I wrote looked at file opens, process fork/execve to enable a hierarchical view of startup to be pieced together. A simplified version of the script looked like:
global indent
function timestamp() {
return string(gettimeofday_ms()) . indent[pid()] . " "
}
function proc() {
return string(pid()) . " (" . execname() . ")"
}
function push(pid, ppid) {
indent[pid] = indent[ppid] . " "
}
function pop(pid) {
delete indent[pid]
}
probe kernel.function("sys_clone").return {
print(timestamp() . proc() . " forks " . string(retval()). "\n")
push(retval(), pid())
}
probe kernel.function("do_execve") {
print(timestamp() . proc() . " execs " . kernel_string($filename) . "\n")
}
probe kernel.function("sys_open") {
if ($flags & 1) {
print(timestamp() . proc() . " writes " . user_string($filename) . "\n")
} else {
print(timestamp() . proc() . " reads " . user_string($filename) . "\n")
}
}
probe kernel.function("do_exit") {
print(timestamp() . proc() . " exit\n")
pop(pid())
}
A few tricks later it was running during boot, and having analysed the results with Perl one can display a summary of how many files each init script opened
1 init read 90 write 30 running...
251 init read 29 write 0 run 23.08s
252 rc.sysinit read 1035 write 45 run 22.91s
274 start_udev read 355 write 128 run 15.10s
286 start_udev read 91 write 0 run 1.90s
287 MAKEDEV read 91 write 0 run 1.88s
291 udevstart read 177 write 124 run 3.95s
614 usleep read 2 write 0 run 1.05s
649 udev-stw.modules read 84 write 5 run 1.23s
701 dmraid read 111 write 0 run 1.07s
748 rc read 235 write 13 run 14.57s
753 S10network read 111 write 16 run 2.85s
833 S12syslog read 44 write 3 run 0.43s
844 S25netfs read 87 write 1 run 1.51s
861 S55cups read 31 write 2 run 1.70s
878 S55sshd read 52 write 1 run 0.86s
892 S97messagebus read 31 write 2 run 0.44s
900 S98NetworkManager read 96 write 10 run 0.58s
910 S98NetworkManagerDispatcher read 92 write 3 run 0.67s
921 S98avahi-daemon read 29 write 0 run 0.41s
929 S98haldaemon read 31 write 2 run 4.20s
955 S99local read 17 write 1 run 0.16s
There are so many other interesting ways to analyse the data collected at boot which I don’t have space for in this blog, so I’ve put all the information (including how to run SystemTAP during boot) up on my Red Hat homepage
The Perl DBI module provides a uniform API to access relational databases. Thanks to Perl’s data type model using it is considerably easier than, say JDBC in the Java world. In common with most database access APIs, the code is split into two bits, the generic infrastructure is in the DBI module, while backends for each database are in the various DBD modules:
The first task in any program is to connect to the server. This is done with the DBI->connect method. To simplify error handling, its good practice to turn on ‘RaiseError’ option, and turn off ‘PrintError’ and ‘AutoCommit’. Then by wrapping the entire unit of work in an eval we get safe transaction commit/rollback without the need to check method return status on each DBI call.
use DBI;
my $db;
eval {
$db = DBI->connect("DBI:Pg:dbname=mydb;host=myhost",
$username, $password, {
RaiseError => 1,
PrintError => 0,
AutoCommit => 0
});
...do some work with the db...
};
if ($@) {
if ($db) {
$db->rollback;
}
die $@;
}
$db->commit;
$db->disconnect;
The next task is issue statements to the DB. Following common practice DBI allows placeholders to be used in SQL, which are substituted with real values at execution time. If the underlying DB doesn’t support placeholders, DBI will emulate them. For maximum performance its advisable to use prepared statement handles, and again DBI will emulate this feature if the underlying driver does not support it.
my $sth1 = $db->prepare_cached("INSERT INTO foo (bar) values (?)");
$sth1->execute($bar);
my $sth2 = $db->prepare_cached("UPDATE foo SET bar = ? WHERE wizz = ?");
$sth2->execute($bar, $wizz);
my $sth3 = $db->prepare_cached("DELETE foo WHERE bar LIKE ?");
$sth3->execute($bar);
The final common task is retrieving data from the DB. There are a number of ways to get data back, but the simplest to code is to bind variables to each return parameter and then call ‘fetchrow’.
my $sth4 = $db->prepare_cached("SELECT bar, wizz FROM foo where bar > ?");
$sth4->execute(20);
my ($bar, $wizz);
$sth4->bind_columns(\$bar, \$wizz);
while ($sth4>fetchrow) {
print "Got $bar $wizz\n";
}
Putting this together a complete example looks like
use DBI;
my $db;
eval {
$db = DBI->connect("DBI:Pg:dbname=mydb;host=myhost",
$username, $password, {
RaiseError => 1,
PrintError => 0,
AutoCommit => 0
});
my $sth1 = $db->prepare_cached("INSERT INTO foo (bar) values (?)");
$sth1->execute($bar);
my $sth2 = $db->prepare_cached("UPDATE foo SET bar = ? WHERE wizz = ?");
$sth2->execute($bar, $wizz);
my $sth3 = $db->prepare_cached("DELETE foo WHERE bar LIKE ?");
$sth3->execute($bar);
my $sth4 = $db->prepare_cached("SELECT bar, wizz FROM foo where bar > ?");
$sth4->execute(20);
my ($bar, $wizz);
$sth4->bind_columns(\$bar, \$wizz);
while ($sth4->fetchrow) {
print "Got $bar $wizz\n";
}
};
if ($@) {
if ($db) {
$db->rollback;
}
die $@;
}
$db->commit;
$db->disconnect;
Scenario
Filtering HTML tags in user entered data is an important aspect of all
web based systems. It serves both to avoid security vunerabilities & allow
the site administrator control over what is displayed in the site. In
all WAF based applications I’ve worked on we’ve relied on the fact that
XSL transformers will automatically escape HTML tags unless you specifically
set the ‘disable-output-escaping’ attribute on the <xsl:value-of> tag.
While this has the virtue of being simple & very safe by default, its
crude on / off action is increasingly becoming a source of problems,
particulary with CMS content items.
For an idea of how its hurting, consider the following situation:
The combination of these two points creates a problem, because we only
want to allow HTML in certain fields, but we need to enter tags
in any field to set the text direction.
The only way out of this is to change the XSLT so that all fields
allow any HTML tag to be rendered. Which in turn implies we need to
filter HTML tags in user entered data.
Use cases
Before considering how to filter HTML, lets enumerate a few use cases:
- Allow tag with ‘rtl’ attribute
- Allow any block or inline tag with ‘rtl’ attribute
- Allow any tag, but no onXXX event tags
- Allow any tag in the HTML-4.0 Strict DTD.
- Disallow any <font> and <blink> tags.
- Allow any inline markup.
- Disallow tables.
There is also the question of what you do to the text when encountering
a forbidden tag. There are two plausible actions:
- Strip out the tag, leaving the content
- Strip out the tag, including the content
The former is applicable for situations where you know the content of
the tag is safe, eg, stripping <font>…some text…</font> tags, you
ought to let ‘….some text…’ pass through. The latter is applicable
when stripping something like an <applet> tag.
Algorith design
So, a reasonable approach to filtering HTML would go something like this:
- Build up a rule set:
- Set a default ‘tag’ policy – one of:
- allow – don’t touch tag
- deny – remove tag & content
- filter – remove tag, leaving content
- Set a default ‘attribute’ policy – one of:
- allow
- deny
- Create a hash mapping ‘tag’ to ‘policy’ for all
tags with a non-default policy
- For each tag, create a hash mapping ‘attribute’ to
‘policy’ for all attributes with a non-default
policy
- Tokenise the HTML document, building up a parse tree,
matching opening & closing tags. Also fill in any closing
tags that were ommitted, eg typically </p>, </li>
- Traverse the parse tree. When encountering a tag, apply
the rules
- if the tag is allowed
- filter out any attributes which are denied
- output the opening tag
- process sub-tags (if any)
- output the closing tag (if any)
- if the tag is denied
- skip the opening tag
- skip sub-tags (if any)
- skip the closing tag
- if the tag is filtered
- skip the opening tag
- process sub-tags (if any)
- skip the closing tag (if any)
The only potentially difficult bit is 2) tokenizing the HTML
and building a syntax tree. Crucial features for such a parser
are
- Thread safe (we can be serving many requests at once)
- Efficient (ie fast at parsing large amounts of data)
- Character set aware (at least UTF-8)
For Java a suitable candidate is HTMLParser
while in Perl there is HTML::Tree
Integrating with applications
Now that we have the basics of the HTML filter worked out, there
is a question of integrating it with applications. There are three
possibilities:
- Filter the data in the form submission
- Throw a validation error in the from submission if
forbidden markup is found.
- Filter the data when generating the output (ie XML or HTML in a JSP/CGI)
In most cases, a) and/or b) are the best approaches since they
catch the problem at the earliest stage. Indeed it may be best
to use a combination of both:
- Default action is to just throw validation error,
giving the user a chance to fixup their data. (this is
nice for letting the user deal with typos).
- If they click the ‘cleanup HTML’ checkbox, then automatically
strip all remaining invalid tags.
The final thought is how to decide on the filtering rule sets.
Again, a one size fits all approach is probably too restrictive.
For example, when using the Article content type in CMS, it is
conceivable that role A (normal authors) should be allowed a
limited set of HTML, but role B (the organization web team) be
allowed arbitrary HTML. Thus there is a case for providing the
site administrator with the means to specify different filtering
rules per role.
Scenario
- Web application running in a production environment is having
periodic OutOfMemory / extreme CPU load issues.
- The server is not responding to HTTP requests.
- ‘ps -ef’ shows a couple Java proceses (out of 100’s total)
consuming 99% of CPU time.
- The problem recurrs with a frequency of between 6 and 24 hours.
- There is, as yet, no known procedure to reproduce the problem
on demand.
Questions
Some questions to which I want answers:
- What Java threads correspond to each UNIX process
- The stack trace for every Java thread
- What locks are held / being waited on
- What objects are allocated
Tools
The IBM JVM comes with a number of custom, built-in tools for
collecting the information required to answer these questions.
Unlike the standard tools in the Java Platform Debugging
Architecture, these IBM-specific facilities impose (near?) zero
performance overhead on the JVM, so can be left permanently
enabled on production systems. What is more, these tools free,
quick and easy to use.
There is extensive (think 400+ pages long) documentation
available from the IBM JVM diagnosis docs site. What follows
is a brief overview of some of the details…
Java Dump (aka Javacore)
The javadump tool produces text files that contain diagnostic
information related to the JVM and the application running
within. This facility is enabled by default. A javadump can
be manually triggered by sending SIG_QUIT to the root process
of the JVM. The dump file will be created in first of the
following locations which is writable:
- The directory specified in IBM_JAVACOREDIR env variable
- The JVM’s current working directory
- The directory specified in TMPDIR env variable
- The /tmp directory
A dump file will be generated automatically if a native exception
occurrs in the JVM. If the IBM_JAVADUMP_OUTOFMEMORY env variable
is set, then a dump will also be generated when heapspace is
exhausted. The file name will be of the form javacore_[pid].[time].txt
Some of the information included in the dump file is:
- The OS memory info (eg from /proc/meminfo)
- User resource limits
- Signal handlers
- Environment variables
- OS memory maps (eg from /proc/[pid]/maps)
- List of all object monitors being waited on
- List of all system locks (and what thread, if any,
is holding them).
- List of all threads, their Java names & their UNIX
process ids.
- Java stack trace for each thread
- Native OS stack trace for each thread
- List of all classloaders
- List of all loaded classes
This is all output in a wellformatted text files which makes for
fairly easy human reading & trivial Perl scripting.
Heap dump
The heapdump tool produces text files that contain a dump of
all objects currently allocated on the heap and a list of
references between objects. This facility is disabled by default,
but can be enabled by setting the env variable IBM_HEAP_DUMP=true
prior to starting the JVM. When enabled, a heapdump is generated
at the same time as the javadump, eg by sending a SIG_QUIT signal.
The filename will be of the form heapdump_[pid].[time].txt While
plain text, the format of this file isn’t directly interesting to
humans, but when processed with the tool ‘heaproot’ you can get
some interesting information:
- Objects table – list of objects in the heap sorted
by size, address, name, subtree size, many more…
- Types table – list of classes with objects in heap
sorted by total size, instance count, name
- Gaps table – list of free spaces in the heap
- Object info – list of references to & from an object
- Paths/routes – find a route between two objects (if
one exists).
JFormat
The JFormat tool is a TUI/GUI for extracting & displaying
information from UNIX core dumps. To enable core dumps there
are two things to check before starting the JVM:
- Remove any resource limits ‘ulimit -c unlimited’
- Change into a writable directory
Now to generate a core dump, simply send the root process
a SIG_ABRT signal.
Since core dumps are operating system specific, you must
now process it with the jextract tool to create a platform
independant core dump.
$JAVA_HOME/jre/bin/jextract [path-to-native-core-dump]
Then you can launch the jformat gui with
$JAVA_HOME/bin/jformat -g
The JFormat GUI provides menus to pull out all sorts of
useful information, which I won’t go into now, other than
to say there is alot of overlap with the Java Dump and
Heap Dump metrics.
Example
Having set the IBM_HEAPDUMP=true env variable & restarted the
servers I just had to wait for the performance slowdown to
recurr (which indicated the garbage collector was starting to
thrash). I now sent a SIGQUIT to the root process generating
a heapdump file about 850 MB in size. This was then loaded into
the HeapRoots analysis program:
java -jar /tmp/HR205.jar heapdump3632.1083842645.txt
After parsing the file heaproots shows a few basic statistics
about the dump:
# Objects : 7,727,411
# Refs : 12,807,015
# Unresolved refs : 31
Heap usage : 315,413,344
The first thing to do is invoke the ‘process’ command. This resolves
all references between the objects, detects roots, calculates total
size metrics per class & sub-tree.
Pure roots : 174,088
... which reach : 7,706,600
Artificial roots : 1,531
Softlinks followed : 16,014
Now, its ready to answer some interesting queries. Lets see what
the biggest tree roots are. This is done by running the ‘dt’
command – ‘Dump from roots, sort by total size’. It prompts for
some values such as max depth to print out, min size threshold,
then produces something looking like
<0> [270,167,808] 0x100a2ef8 [72] java/lang/Thread
<1> [270,167,680] 0x10053c70 [168] com/caucho/java/CompilingClassLoader
<2> [270,164,496] 0x10053d18 [168] com/caucho/util/DirectoryClassLoader
- 14 children of 0x10053d18 too deep.
<2> {270,167,680} 0x10053c70 [168] com/caucho/java/CompilingClassLoader
<2> <270,163,200 parent:0x10053d18> 0x10053dc0 [168] com/caucho/java/CompilingClassLoader
<2> <270,163,200 parent:0x10053d18> 0x10053dc0 [168] com/caucho/java/CompilingClassLoader
- 13 children of 0x10053c70 filtered.
- 4 children of 0x100a2ef8 filtered.
<0> [5,234,712] 0x16a85b90 [447,296] array of java/lang/Object
- 111820 children of 0x16a85b90 filtered.
Unsurprisingly the root with highest memory consumption is the top level
thread of the JVM. Not so enlightening. I figure the memory leak is
probably in CCM code, so lets find the biggest object in a package whose
name begins with ‘com/arsdigita/’. For this we use the ‘Objects dump, sort
by total size’ command (‘ot com/arsdigita/’) again this prompts for some
more optional filters, but accepting the defaults we get a list of big
objects:
Addr Size Root-owner Parent Total-size Name
-------------------------------------------------------------------------------
N 0x284ffb48 256 0x100a2ef8 0x284ffc88 224,107,816 class com/arsdigita/templating/Templating
N 0x220d3458 96 0x100a2ef8 0x22bc0ee8 4,585,520 com/arsdigita/london/theme/util/ThemeDevelopmentFileManager
N 0x1cbc24b0 32 0x100a2ef8 0x1e18a5a8 2,460,464 com/arsdigita/templating/XSLTemplate
N 0x23180b28 80 0x100a2ef8 0x22bc0ee8 2,164,688 com/arsdigita/london/search/RemoteSearcher
N 0x1bf215e0 56 0x100a2ef8 0x1bf2a850 1,844,152 com/arsdigita/persistence/Session
N 0x1bf2ab98 40 0x100a2ef8 0x1bf215e0 1,770,096 com/arsdigita/persistence/TransactionContext
N 0x1affda20 16 0x100a2ef8 0x1aa4d388 1,681,056 com/arsdigita/persistence/DataQueryImpl$UnaliasMapper
N 0x1affc508 96 0x100a2ef8 0x1affda20 1,681,040 com/arsdigita/persistence/DataAssociationCursorImpl
N 0x1a985ff0 56 0x100a2ef8 0x1a98e8e8 1,138,696 com/arsdigita/bebop/PageState
So, the class object for ‘com.arsdigita.templating.Templating’ is taking
up 224 MB of the heap. A quick look at the code shows it has a static
HashMap storing XSLTemplate objects with no size bounding. To confirm this
is the culprit we go back tot he ‘Dump from roots, sort by total size’ command,
this time starting display from the object at address 0x284ffb48:
<0> [224,107,816] 0x284ffb48 [256] class com/arsdigita/templating/Templating
<1> [224,092,408] 0x28500840 [56] java/util/HashMap
<2> [224,092,352] 0x1cdf9160 [1,552] array of java/util/HashMap$Entry
<3> [4,288,800] 0x2dbfa500 [32] java/util/HashMap$Entry
<4> [3,216,568] 0x1bf5cc00 [32] java/util/HashMap$Entry
- 3 children of 0x1bf5cc00 too deep.
<4> [1,069,960] 0x2c824ca8 [32] com/arsdigita/templating/XSLTemplate
- 4 children of 0x2c824ca8 too deep.
- 1 child of 0x2dbfa500 filtered.
<3> [3,216,584] 0x263c8ff8 [32] java/util/HashMap$Entry
<4> [2,144,416] 0x3d00ade8 [32] java/util/HashMap$Entry
- 3 children of 0x3d00ade8 too deep.
<4> [1,069,912] 0x237796b0 [32] com/arsdigita/templating/XSLTemplate
- 4 children of 0x237796b0 too deep.
- 1 child of 0x263c8ff8 filtered.
<3> [3,216,568] 0x19319d58 [32] java/util/HashMap$Entry
<4> [2,144,400] 0x33ddd918 [32] java/util/HashMap$Entry
- 3 children of 0x33ddd918 too deep.
<4> [1,069,912] 0x1864f950 [32] com/arsdigita/templating/XSLTemplate
- 4 children of 0x1864f950 too deep.
- 1 child of 0x19319d58 filtered.
[snipped for another 10 pages of output]
<1> {224,107,816} 0x284ffb48 [256] class com/arsdigita/templating/Templating
- 15 children of 0x284ffb48 filtered.
There were 362 objects expanded.
So, from this its clear that the problem is a huge HashMap in the Templating
class containing around one hundred XSLTemplate objects, each around 1 MB in
size. So, the moral of the story – if you implement a static cache in a server,
make sure you:
- benchmark the average size of objects in the cache
- decide how much memory can be set aside to it
- use results from a) & b) to calculate a good max entry
count for the cache before invoking an purge algorithm.