Using SystemTAP for dynamically instrumented the Linux kernel

Posted: January 27th, 2006 | Filed under: Coding Tips, Uncategorized | No Comments »

It has been just over a year & 1/2 since I first blogged about DTrace suggesting that a similar tool would be very valuable to the Linux community. Well after a few long email threads, it turned out that a significant number of people within Red Hat agreed with this assessment and so in partnership with IBM and Intel the SystemTAP project came into life at the start of 2005. Starting with the previously developed KProbes dynamic instrumentation capability a huge amount of work has been done building out a high level language and runtime for safely, efficiently & reliably probing the kernel. It has seen a limited ‘technology preview’ in RHEL-4, and with its inclusion in the forthcoming Fedora Core 5 it will be exposed to a much wider community of users & potential developers.

On the very same day as Dave Jones was looking at the Fedora boot process via static kernel instrumentation, I was (completely co-incidentally ) playing around using SystemTAP to instrument the boot process. The probe I wrote looked at file opens, process fork/execve to enable a hierarchical view of startup to be pieced together. A simplified version of the script looked like:

global indent

function timestamp() {
  return string(gettimeofday_ms()) . indent[pid()] . " "
}

function proc() {
  return string(pid()) . " (" . execname() . ")"
}

function push(pid, ppid) {
  indent[pid] = indent[ppid] . "  "
}

function pop(pid) {
  delete indent[pid]
}

probe kernel.function("sys_clone").return {
  print(timestamp() . proc() . " forks " . string(retval()). "\n")
  push(retval(), pid())
}

probe kernel.function("do_execve") {
  print(timestamp() . proc() . " execs " . kernel_string($filename) . "\n")
}

probe kernel.function("sys_open") {
  if ($flags & 1) {
    print(timestamp() . proc() . " writes " . user_string($filename) . "\n")
  } else {
    print(timestamp() . proc() . " reads " . user_string($filename) . "\n")
  }
}

probe kernel.function("do_exit") {
  print(timestamp() . proc() . " exit\n")
  pop(pid())
}

A few tricks later it was running during boot, and having analysed the results with Perl one can display a summary of how many files each init script opened

 1    init                                        read  90 write  30 running...
   251  init                                      read  29 write   0 run 23.08s
     252  rc.sysinit                              read 1035 write  45 run 22.91s
       274  start_udev                            read 355 write 128 run 15.10s
         286  start_udev                          read  91 write   0 run 1.90s
           287  MAKEDEV                           read  91 write   0 run 1.88s
         291  udevstart                           read 177 write 124 run 3.95s
         614  usleep                              read   2 write   0 run 1.05s
       649  udev-stw.modules                      read  84 write   5 run 1.23s
       701  dmraid                                read 111 write   0 run 1.07s
   748  rc                                        read 235 write  13 run 14.57s
     753  S10network                              read 111 write  16 run 2.85s
     833  S12syslog                               read  44 write   3 run 0.43s
     844  S25netfs                                read  87 write   1 run 1.51s
     861  S55cups                                 read  31 write   2 run 1.70s
     878  S55sshd                                 read  52 write   1 run 0.86s
     892  S97messagebus                           read  31 write   2 run 0.44s
     900  S98NetworkManager                       read  96 write  10 run 0.58s
     910  S98NetworkManagerDispatcher             read  92 write   3 run 0.67s
     921  S98avahi-daemon                         read  29 write   0 run 0.41s
     929  S98haldaemon                            read  31 write   2 run 4.20s
     955  S99local                                read  17 write   1 run 0.16s

There are so many other interesting ways to analyse the data collected at boot which I don’t have space for in this blog, so I’ve put all the information (including how to run SystemTAP during boot) up on my Red Hat homepage

Thoughts on creating better RPMs

Posted: August 23rd, 2005 | Filed under: Coding Tips | No Comments »

Despite the impression given from endless flame wars about the merits of Debian packages vs RPMs, the biggest problems with RPM are not technical, but rather quality control. With its very long release cycles, extensive / comprehensive testing, and higher than average ability of its maintainers, the debian packages one encounters are very well produced indeed. Somewhat a victim of its own success & popularity, RPMs are being produced by a large number of third parties, many of whom only have passing familiarity with Linux, let along RPM packaging guideliens. As an unfortunate result, there are a large number of poorly packaged RPMs floating around the net. The even more unfortunate thing is that if just a few simple guidelines were followed, the situation would improve dramatically. So, without further ado, here are some pointers.

Group names
Only use group names which are defined in /usr/share/doc/rpm-[VERSION]/GROUPS – don’t make up new groups
Installation prefix
Most software should install into the regular /usr hierarchy, and definitely not into /usr/local, or /usr/local/[APP] which are for *unpackaged* software. Since RPM manages all installed files, there is no need to worry about files from different apps being installed side-by-side. RPM knows what belongs to which app and will ensure clean uninstallation of all files, and prevent installation of two apps with conflicting files. If a private installation location is required, then use /opt/APP-NAME/VERSION, which both separates out the app into its own heirarchy & also allows multiple versions to live in harmony.
Files list
The %files section MUST list all files belonging to the application. RPM uses this files list for many purposes

  • to ensure that two applications don’t try to install the same file
  • to ensure complete removal of all files upon uninstallation
  • to verify installed files for changes, from modification time, and ownership right down to md5 sums.

Having a %post section which, for example, extracts a ZIP / tar.gz archive of files for the application completely bypasses these important features of RPM, dramatically lowering the value of having the application packaged in an RPM at all.

Init scripts
If the application includes init scripts, make sure they are registered with chkconfig, but don’t turn on the service by default – policy decisions such as whether a daemon should start on boot are the realm of the system administrator, not the software distributor. This is especially important if the application listens on any network ports.
Patch releases
After an initial release of software it may be neccessary to distribute a patch update. The approach for this is to take the original source RPM, add one or more patches in the spec file, increment the release number & then rebuild the bina4ry RPM.

30 second guide to Perl DBI

Posted: August 5th, 2004 | Filed under: Coding Tips | No Comments »

The Perl DBI module provides a uniform API to access relational databases. Thanks to Perl’s data type model using it is considerably easier than, say JDBC in the Java world. In common with most database access APIs, the code is split into two bits, the generic infrastructure is in the DBI module, while backends for each database are in the various DBD modules:

The first task in any program is to connect to the server. This is done with the DBI->connect method. To simplify error handling, its good practice to turn on ‘RaiseError’ option, and turn off ‘PrintError’ and ‘AutoCommit’. Then by wrapping the entire unit of work in an eval we get safe transaction commit/rollback without the need to check method return status on each DBI call.

  use DBI;

  my $db;

  eval {
    $db = DBI->connect("DBI:Pg:dbname=mydb;host=myhost",
                       $username, $password, {
			  RaiseError => 1,
			  PrintError => 0,
			  AutoCommit => 0
                       });

      ...do some work with the db...
  };
  if ($@) {
     if ($db) {
       $db->rollback;
     }
     die $@;
  }
  $db->commit;
  $db->disconnect;

The next task is issue statements to the DB. Following common practice DBI allows placeholders to be used in SQL, which are substituted with real values at execution time. If the underlying DB doesn’t support placeholders, DBI will emulate them. For maximum performance its advisable to use prepared statement handles, and again DBI will emulate this feature if the underlying driver does not support it.

    my $sth1 = $db->prepare_cached("INSERT INTO foo (bar) values (?)");
    $sth1->execute($bar);

    my $sth2 = $db->prepare_cached("UPDATE foo SET bar = ? WHERE wizz = ?");
    $sth2->execute($bar, $wizz);

    my $sth3 = $db->prepare_cached("DELETE foo WHERE bar LIKE ?");
    $sth3->execute($bar);

The final common task is retrieving data from the DB. There are a number of ways to get data back, but the simplest to code is to bind variables to each return parameter and then call ‘fetchrow’.

    my $sth4 = $db->prepare_cached("SELECT bar, wizz FROM foo where bar > ?");
    $sth4->execute(20);

    my ($bar, $wizz);
    $sth4->bind_columns(\$bar, \$wizz);
    while ($sth4>fetchrow) {
        print "Got $bar $wizz\n";
    }

Putting this together a complete example looks like

  use DBI;

  my $db;

  eval {
    $db = DBI->connect("DBI:Pg:dbname=mydb;host=myhost",
                       $username, $password, {
			  RaiseError => 1,
			  PrintError => 0,
			  AutoCommit => 0
                       });

    my $sth1 = $db->prepare_cached("INSERT INTO foo (bar) values (?)");
    $sth1->execute($bar);

    my $sth2 = $db->prepare_cached("UPDATE foo SET bar = ? WHERE wizz = ?");
    $sth2->execute($bar, $wizz);

    my $sth3 = $db->prepare_cached("DELETE foo WHERE bar LIKE ?");
    $sth3->execute($bar);

    my $sth4 = $db->prepare_cached("SELECT bar, wizz FROM foo where bar > ?");
    $sth4->execute(20);

    my ($bar, $wizz);
    $sth4->bind_columns(\$bar, \$wizz);
    while ($sth4->fetchrow) {
        print "Got $bar $wizz\n";
    }
  };
  if ($@) {
     if ($db) {
       $db->rollback;
     }
     die $@;
  }
  $db->commit;
  $db->disconnect;

HTML filtering of user supplied content

Posted: May 28th, 2004 | Filed under: Coding Tips | 1 Comment »

Scenario

Filtering HTML tags in user entered data is an important aspect of all
web based systems. It serves both to avoid security vunerabilities & allow
the site administrator control over what is displayed in the site. In
all WAF based applications I’ve worked on we’ve relied on the fact that
XSL transformers will automatically escape HTML tags unless you specifically
set the ‘disable-output-escaping’ attribute on the <xsl:value-of> tag.
While this has the virtue of being simple & very safe by default, its
crude on / off action is increasingly becoming a source of problems,
particulary with CMS content items.

For an idea of how its hurting, consider the following situation:

  • The Agenda content type has a number of text attributes:
    • Location
    • Attendees
    • Subject
    • Contact
    • Summary
    • Body text

    Typically, the XSL only allows tags in the ‘body text’ attribute
    to be rendered – everything else is escaped.

  • When entering some non-west european languages, text reads from
    right-to-left instead of left-to-right. To achieve this in HTML
    you need to add the dir=”rtl” to you tags. If a particular field
    in a stylesheet only ever holds a single language, then this can
    be done in the XSL stylesheet, however, there are occasions when a
    single block of text has multiple languages interspersed. In this
    case, the user must enter <span dir=”rtl”> tags themselves.

The combination of these two points creates a problem, because we only
want to allow HTML in certain fields, but we need to enter tags
in any field to set the text direction.

The only way out of this is to change the XSLT so that all fields
allow any HTML tag to be rendered. Which in turn implies we need to
filter HTML tags in user entered data.

Use cases

Before considering how to filter HTML, lets enumerate a few use cases:

  • Allow tag with ‘rtl’ attribute
  • Allow any block or inline tag with ‘rtl’ attribute
  • Allow any tag, but no onXXX event tags
  • Allow any tag in the HTML-4.0 Strict DTD.
  • Disallow any <font> and <blink> tags.
  • Allow any inline markup.
  • Disallow tables.

There is also the question of what you do to the text when encountering
a forbidden tag. There are two plausible actions:

  • Strip out the tag, leaving the content
  • Strip out the tag, including the content

The former is applicable for situations where you know the content of
the tag is safe, eg, stripping <font>…some text…</font> tags, you
ought to let ‘….some text…’ pass through. The latter is applicable
when stripping something like an <applet> tag.

Algorith design

So, a reasonable approach to filtering HTML would go something like this:

  1. Build up a rule set:
    • Set a default ‘tag’ policy – one of:
      1. allow – don’t touch tag
      2. deny – remove tag & content
      3. filter – remove tag, leaving content
    • Set a default ‘attribute’ policy – one of:
      1. allow
      2. deny
    • Create a hash mapping ‘tag’ to ‘policy’ for all
      tags with a non-default policy
    • For each tag, create a hash mapping ‘attribute’ to
      ‘policy’ for all attributes with a non-default
      policy
  2. Tokenise the HTML document, building up a parse tree,
    matching opening & closing tags. Also fill in any closing
    tags that were ommitted, eg typically </p>, </li>
  3. Traverse the parse tree. When encountering a tag, apply
    the rules

    1. if the tag is allowed
      • filter out any attributes which are denied
      • output the opening tag
      • process sub-tags (if any)
      • output the closing tag (if any)
    2. if the tag is denied
      • skip the opening tag
      • skip sub-tags (if any)
      • skip the closing tag
    3. if the tag is filtered
      • skip the opening tag
      • process sub-tags (if any)
      • skip the closing tag (if any)

The only potentially difficult bit is 2) tokenizing the HTML
and building a syntax tree. Crucial features for such a parser
are

  • Thread safe (we can be serving many requests at once)
  • Efficient (ie fast at parsing large amounts of data)
  • Character set aware (at least UTF-8)

For Java a suitable candidate is HTMLParser
while in Perl there is HTML::Tree

Integrating with applications

Now that we have the basics of the HTML filter worked out, there
is a question of integrating it with applications. There are three
possibilities:

  1. Filter the data in the form submission
  2. Throw a validation error in the from submission if
    forbidden markup is found.
  3. Filter the data when generating the output (ie XML or HTML in a JSP/CGI)

In most cases, a) and/or b) are the best approaches since they
catch the problem at the earliest stage. Indeed it may be best
to use a combination of both:

  • Default action is to just throw validation error,
    giving the user a chance to fixup their data. (this is
    nice for letting the user deal with typos).

  • If they click the ‘cleanup HTML’ checkbox, then automatically
    strip all remaining invalid tags.

The final thought is how to decide on the filtering rule sets.
Again, a one size fits all approach is probably too restrictive.
For example, when using the Article content type in CMS, it is
conceivable that role A (normal authors) should be allowed a
limited set of HTML, but role B (the organization web team) be
allowed arbitrary HTML. Thus there is a case for providing the
site administrator with the means to specify different filtering
rules per role.

(Near?) zero overhead debugging with the IBM JVM

Posted: May 10th, 2004 | Filed under: Coding Tips | 2 Comments »

Scenario

  • Web application running in a production environment is having
    periodic OutOfMemory / extreme CPU load issues.
  • The server is not responding to HTTP requests.
  • ‘ps -ef’ shows a couple Java proceses (out of 100’s total)
    consuming 99% of CPU time.
  • The problem recurrs with a frequency of between 6 and 24 hours.
  • There is, as yet, no known procedure to reproduce the problem
    on demand.

Questions

Some questions to which I want answers:

  • What Java threads correspond to each UNIX process
  • The stack trace for every Java thread
  • What locks are held / being waited on
  • What objects are allocated

Tools

The IBM JVM comes with a number of custom, built-in tools for
collecting the information required to answer these questions.
Unlike the standard tools in the Java Platform Debugging
Architecture, these IBM-specific facilities impose (near?) zero
performance overhead on the JVM, so can be left permanently
enabled on production systems. What is more, these tools free,
quick and easy to use.

There is extensive (think 400+ pages long) documentation
available from the IBM JVM diagnosis docs site. What follows
is a brief overview of some of the details…

Java Dump (aka Javacore)

The javadump tool produces text files that contain diagnostic
information related to the JVM and the application running
within. This facility is enabled by default. A javadump can
be manually triggered by sending SIG_QUIT to the root process
of the JVM. The dump file will be created in first of the
following locations which is writable:

  1. The directory specified in IBM_JAVACOREDIR env variable
  2. The JVM’s current working directory
  3. The directory specified in TMPDIR env variable
  4. The /tmp directory

A dump file will be generated automatically if a native exception
occurrs in the JVM. If the IBM_JAVADUMP_OUTOFMEMORY env variable
is set, then a dump will also be generated when heapspace is
exhausted. The file name will be of the form javacore_[pid].[time].txt

Some of the information included in the dump file is:

  • The OS memory info (eg from /proc/meminfo)
  • User resource limits
  • Signal handlers
  • Environment variables
  • OS memory maps (eg from /proc/[pid]/maps)
  • List of all object monitors being waited on
  • List of all system locks (and what thread, if any,
    is holding them).
  • List of all threads, their Java names & their UNIX
    process ids.
  • Java stack trace for each thread
  • Native OS stack trace for each thread
  • List of all classloaders
  • List of all loaded classes

This is all output in a wellformatted text files which makes for
fairly easy human reading & trivial Perl scripting.

Heap dump

The heapdump tool produces text files that contain a dump of
all objects currently allocated on the heap and a list of
references between objects. This facility is disabled by default,
but can be enabled by setting the env variable IBM_HEAP_DUMP=true
prior to starting the JVM. When enabled, a heapdump is generated
at the same time as the javadump, eg by sending a SIG_QUIT signal.

The filename will be of the form heapdump_[pid].[time].txt While
plain text, the format of this file isn’t directly interesting to
humans, but when processed with the tool ‘heaproot’ you can get
some interesting information:

  • Objects table – list of objects in the heap sorted
    by size, address, name, subtree size, many more…
  • Types table – list of classes with objects in heap
    sorted by total size, instance count, name
  • Gaps table – list of free spaces in the heap
  • Object info – list of references to & from an object
  • Paths/routes – find a route between two objects (if
    one exists).

JFormat

The JFormat tool is a TUI/GUI for extracting & displaying
information from UNIX core dumps. To enable core dumps there
are two things to check before starting the JVM:

  • Remove any resource limits ‘ulimit -c unlimited’
  • Change into a writable directory

Now to generate a core dump, simply send the root process
a SIG_ABRT signal.

Since core dumps are operating system specific, you must
now process it with the jextract tool to create a platform
independant core dump.

   $JAVA_HOME/jre/bin/jextract [path-to-native-core-dump]

Then you can launch the jformat gui with

   $JAVA_HOME/bin/jformat -g

The JFormat GUI provides menus to pull out all sorts of
useful information, which I won’t go into now, other than
to say there is alot of overlap with the Java Dump and
Heap Dump metrics.

Example

Having set the IBM_HEAPDUMP=true env variable & restarted the
servers I just had to wait for the performance slowdown to
recurr (which indicated the garbage collector was starting to
thrash). I now sent a SIGQUIT to the root process generating
a heapdump file about 850 MB in size. This was then loaded into
the HeapRoots analysis program:

   java -jar /tmp/HR205.jar heapdump3632.1083842645.txt

After parsing the file heaproots shows a few basic statistics
about the dump:

  # Objects            : 7,727,411
  # Refs               : 12,807,015
  # Unresolved refs    : 31
  Heap usage           : 315,413,344

The first thing to do is invoke the ‘process’ command. This resolves
all references between the objects, detects roots, calculates total
size metrics per class & sub-tree.

  Pure roots           : 174,088
  ... which reach      : 7,706,600
  Artificial roots     : 1,531
  Softlinks followed   : 16,014

Now, its ready to answer some interesting queries. Lets see what
the biggest tree roots are. This is done by running the ‘dt’
command – ‘Dump from roots, sort by total size’. It prompts for
some values such as max depth to print out, min size threshold,
then produces something looking like

<0> [270,167,808] 0x100a2ef8 [72] java/lang/Thread
<1>   [270,167,680] 0x10053c70 [168] com/caucho/java/CompilingClassLoader
<2>     [270,164,496] 0x10053d18 [168] com/caucho/util/DirectoryClassLoader
          - 14 children of 0x10053d18 too deep.
<2>     {270,167,680} 0x10053c70 [168] com/caucho/java/CompilingClassLoader
<2>     <270,163,200 parent:0x10053d18> 0x10053dc0 [168] com/caucho/java/CompilingClassLoader
<2>     <270,163,200 parent:0x10053d18> 0x10053dc0 [168] com/caucho/java/CompilingClassLoader
        - 13 children of 0x10053c70 filtered.
      - 4 children of 0x100a2ef8 filtered.
<0> [5,234,712] 0x16a85b90 [447,296] array of java/lang/Object
      - 111820 children of 0x16a85b90 filtered.

Unsurprisingly the root with highest memory consumption is the top level
thread of the JVM. Not so enlightening. I figure the memory leak is
probably in CCM code, so lets find the biggest object in a package whose
name begins with ‘com/arsdigita/’. For this we use the ‘Objects dump, sort
by total size’ command (‘ot com/arsdigita/’) again this prompts for some
more optional filters, but accepting the defaults we get a list of big
objects:

  Addr       Size       Root-owner Parent     Total-size  Name
-------------------------------------------------------------------------------
N 0x284ffb48 256        0x100a2ef8 0x284ffc88 224,107,816 class com/arsdigita/templating/Templating
N 0x220d3458 96         0x100a2ef8 0x22bc0ee8 4,585,520   com/arsdigita/london/theme/util/ThemeDevelopmentFileManager
N 0x1cbc24b0 32         0x100a2ef8 0x1e18a5a8 2,460,464   com/arsdigita/templating/XSLTemplate
N 0x23180b28 80         0x100a2ef8 0x22bc0ee8 2,164,688   com/arsdigita/london/search/RemoteSearcher
N 0x1bf215e0 56         0x100a2ef8 0x1bf2a850 1,844,152   com/arsdigita/persistence/Session
N 0x1bf2ab98 40         0x100a2ef8 0x1bf215e0 1,770,096   com/arsdigita/persistence/TransactionContext
N 0x1affda20 16         0x100a2ef8 0x1aa4d388 1,681,056   com/arsdigita/persistence/DataQueryImpl$UnaliasMapper
N 0x1affc508 96         0x100a2ef8 0x1affda20 1,681,040   com/arsdigita/persistence/DataAssociationCursorImpl
N 0x1a985ff0 56         0x100a2ef8 0x1a98e8e8 1,138,696   com/arsdigita/bebop/PageState

So, the class object for ‘com.arsdigita.templating.Templating’ is taking
up 224 MB of the heap. A quick look at the code shows it has a static
HashMap storing XSLTemplate objects with no size bounding. To confirm this
is the culprit we go back tot he ‘Dump from roots, sort by total size’ command,
this time starting display from the object at address 0x284ffb48:

<0> [224,107,816] 0x284ffb48 [256] class com/arsdigita/templating/Templating
<1>   [224,092,408] 0x28500840 [56] java/util/HashMap
<2>     [224,092,352] 0x1cdf9160 [1,552] array of java/util/HashMap$Entry
<3>       [4,288,800] 0x2dbfa500 [32] java/util/HashMap$Entry
<4>         [3,216,568] 0x1bf5cc00 [32] java/util/HashMap$Entry
              - 3 children of 0x1bf5cc00 too deep.
<4>         [1,069,960] 0x2c824ca8 [32] com/arsdigita/templating/XSLTemplate
              - 4 children of 0x2c824ca8 too deep.
            - 1 child of 0x2dbfa500 filtered.
<3>       [3,216,584] 0x263c8ff8 [32] java/util/HashMap$Entry
<4>         [2,144,416] 0x3d00ade8 [32] java/util/HashMap$Entry
              - 3 children of 0x3d00ade8 too deep.
<4>         [1,069,912] 0x237796b0 [32] com/arsdigita/templating/XSLTemplate
              - 4 children of 0x237796b0 too deep.
            - 1 child of 0x263c8ff8 filtered.
<3>       [3,216,568] 0x19319d58 [32] java/util/HashMap$Entry
<4>         [2,144,400] 0x33ddd918 [32] java/util/HashMap$Entry
              - 3 children of 0x33ddd918 too deep.
<4>         [1,069,912] 0x1864f950 [32] com/arsdigita/templating/XSLTemplate
              - 4 children of 0x1864f950 too deep.
            - 1 child of 0x19319d58 filtered.

[snipped for another 10 pages of output]
<1>   {224,107,816} 0x284ffb48 [256] class com/arsdigita/templating/Templating
      - 15 children of 0x284ffb48 filtered.
There were 362 objects expanded.

So, from this its clear that the problem is a huge HashMap in the Templating
class containing around one hundred XSLTemplate objects, each around 1 MB in
size. So, the moral of the story – if you implement a static cache in a server,
make sure you:

  1. benchmark the average size of objects in the cache
  2. decide how much memory can be set aside to it
  3. use results from a) & b) to calculate a good max entry
    count for the cache before invoking an purge algorithm.