A common question from developers/users new to lbvirt, is to wonder what the benefit of using libvirt is, over directly scripting/developing against QEMU/KVM. This blog posting attempts to answer that, by outlining features exposed in the libvirt QEMU/KVM driver that would not be automatically available to users of the lower level QEMU command line/monitor.
“All right, but apart from the sanitation, medicine, education, wine, public order, irrigation, roads, the fresh water system and public health, what have the Romans ever done for us?”
Insurance against QEMU/KVM being replaced by new $SHINY virt thing
Linux virtualization technology has been through many iterations. In the beginning there was User Mode Linux, which was widely used by many ISPs offering Linux virtual hosting. Then along came Xen, which was shipped by many enterprise distributions and replaced much usage of UML. QEMU has been around for a long time, but when KVM was added to the Linux kernel and integrated with QEMU, it became the new standard for Linux host virtualization, often replacing the usage of Xen. Application developers may think
“QEMU/KVM is the Linux virtualization standard for the future, so why do we need a portable API?”
To think this way is ignoring the lessons of history, which are that every virtualization technology that has existed in Linux thus far has been replaced or rewritten or obsoleted. The current QEMU/KVM userspace implementation may turn out to be the exception to the rule, but do you want to bet on that? There is already a new experimental KVM userspace application being developed by a group of LKML developers which may one day replace the current QEMU based KVM userspace. libvirt exists to provide application developers an insurance policy should this come to pass, by isolating application code from the specific implementation of the low level userspace infrastructure.
Long term stable API and XML format
The libvirt API is an append-only API, so once added an existing API wil never be removed or changed. This ensures that applications can expect to operate unchanged across multiple major RHEL releases, even if the underlying virtualization technology changes the way it operates. Likewise the XML format is append-only, so even if the underlying virt technology changes its configuration format, apps will continue to work unchanged.
Automatic compliance with changes in KVM command line syntax “best practice”
As new releases of KVM come out, the “best practice” recommendations for invoking KVM processes evolve. The libvirt QEMU driver attempts to always follow the latest “best practice” for invoking the KVM version it finds
Example: Disk specification changes
- v1: Original QEMU syntax:
-hda filename
- v2: RHEL5 era KVM syntax
-drive file=filename,if=ide,index=0
- v3: RHEL6 era KVM syntax
-drive file=filename,id=drive-ide0-0-0,if=none \
-device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide-0-0-0
- v4: (possible) RHEL-7 era KVM syntax:
-blockdev ...someargs... \
-device ide-drive,bus=ide.0,unit=0,blockdev=bdev-ide0-0-0,id=ide-0-0-0
Isolation against breakage in KVM command line syntax
Although QEMU attempts to maintain compatibility of command line flags across releases, there may been cases where options have been removed / changed. This has typically occurred as a result of changes which were in the KVM branch, and then done differently when the code was merged back into the main QEMU GIT repository. libvirt detects what is required for the QEMU binary and uses the appropriate supported syntax:
Example: Boot order changes for virtio.
- v1: Original QEMU syntax (didn’t support virtio)
-boot c
- v2: RHEL5 era KVM syntax
-drive file=filename,if=virtio,index=0,boot=on
- v3: RHEL-6.0 KVM syntax
-drive file=filename,id=drive-virtio0-0-0,if=none,boot=on \
-device virtio-blk-pci,bus=pci0.0,addr=1,drive=drive-virtio0-0-0,id=virtio-0-0-0
- v4: RHEL-6.1 KVM syntax (boot=on was removed in the update, changing syntax wrt RHEL-6.0)
-drive file=filename,id=drive-virtio0-0-0,if=none,bootindex=1 \
-device virtio-blk-pci,bus=pci0.0,addr=1,drive=drive-virtio0-0-0,id=virtio-0-0-0
Transparently take advantage of new features in KVM
Some new features introduced in KVM can be automatically enabled / used by the libvirt QEMU driver without requiring changes to applications using libvirt. Applications thus benefit from these new features without expending any further development effort
Example: Stable machine ABI in RHEL-6.0
The application provided XML requests a generic ‘pc’ machine type. libvirt automatically queries KVM to determine the supported machines types and canonicalizes ‘pc’ to the stable versioned machine type ‘pc-0.12’ and preserves this in the XML configuration. Thus even if host is updated to KVM 0.13, existing guests will continue to run with the ‘pc-0.12’ ABI, and thus avoid potential driver breakage or guest OS re-activation. It also makes it is possible to migrate the guest between hosts running different versions of KVM.
Example: Stable PCI device addressing in RHEL-6.0
The application provided XML requests 4 PCI devices (balloon, 2 disks & a NIC). libvirt automatically assigns PCI addresses to these devices and ensures that every time the guest is launched the exact same PCI addresses are used for each device. It will also manage PCI addresses when doing PCI device hotplug and unplug, so the sequence boot; unplug nic; migrate will result in stable PCI addresses on the target host.
Automatic compliance with changes in KVM monitor syntax best practice
In the same way that KVM command line argument “best practice” changes between releases, so does the monitor “best practice”. The libvirt QEMU drivers aims to always be in compliance with the “best practice” recommendations for the version of KVM being used.
Example: monitor protocol changes
- v1: Human parsable monitor in RHEL5 era or earlier KVM
- v2: JSON monitor in RHEL-6.0
- v3: JSON monitor, with HMP passthrough in RHEL-6.1
Isolation against breakage in KVM monitor commands
Upstream QEMU releases have generally avoided changing monitor command syntax, but the KVM fork of QEMU has not been so lucky when merging changes back into mainline QEMU. Likewise, OS distros sometimes introduce custom commands in their branches.
Example: Method for setting VNC/SPICE passwords
- v1: RHEL-5 using the ‘change’ command only for VNC, or ‘set_ticket’ for SPICE
'change vnc 123456'
'set_ticket 123456'
- v2: RHEL-6.0 using the (non-upstream) ‘__com.redhat__set_password’ for SPICE or VNC
{ "execute": "__com.redhat__set_password", "arguments": { "protocol": "spice", "password": "123456", "expiration": "60" } }
- v4: RHEL-6.1 using the upstream ‘set_password’ and ‘expire_password’ commands
{ "execute": "set_password", "arguments": { "protocol": "spice", "password": "123456" } }
{ "execute": "expire_password", "arguments": { "protocol": "spice", "time": "+60" } }
Security isolation with SELinux / AppArmour
When all guests run under the same user ID, any single KVM process which is exploited by a malicious guest, will result in compromise of all guests running on the host. If the management software runs as the same user ID as the KVM processes, the management software can also be exploited.
sVirt integration in libvirt, ensures that every guest is assigned a dedicated security context, which strictly confines what resources it can access. Thus even if a guest can read/write files based on matching UID/GID, the sVirt layer will block access, unless the file / resource was explicitly listed in the guest configuration file.
Security isolation using POSIX DAC
A additional security driver uses traditional POSIX DAC capabilities to isolate guests from the host, by running as an unprivileged UID and GID pair. Future enhancements will run each individual guest under a dedicated UID.
Security isolation using Linux container namespaces
Future will take advantage of recent advances in Linux container namespace capabilities. Every QEMU process will be placed into a dedicate PID namespace, preventing QEMU seeing any processes on the system, should it be exploited. A dedicated network namespace will block access to all host network devices, only allow access to TAP device FDs, preventing an exploited guest from making arbitrary network connections to outside world. When QEMU gains support for a ‘fd:’ disk protocol, a dedicated filesystem namespace will provide a very secure chrooted environment where it can only use file descriptors passed in from libvirt. UID/GID namespaces will allow separation of QEMU UID/GID for each process, even though from the host POV all processes will be under the same UID/GID.
Avoidance of shell code for most operations
Use of the shell for invoking management commands is susceptible to exploit by users by providing data which gets interpreted as shell metadata characters. It is very hard to get escaping correctly applied, so libvirt has an advanced set of APIs for invoking commands which avoid all use of the shell. They are also able to guarantee that no file descriptors from the management layer leak down into the QEMU process where they could be exploited.
Secure remote access to management APIs
The libvirt local library API can be exposed over TCP using TLS, x509 certificates and/or Kerberos. This provides secure remote access to all KVM management APIs, with parity of functionality vs local API usage. The secure remote access is a validated part of the common criteria certifications
Operation audit trails
All operations where there is an association between a virtual machine and a host resource will result in one or more audit records being generated via the Linux auditing subsystem. This provides a clear record of key changes in the virtualization state. This functionality is typically a mandatory requirement for deployment into government / defence / financial organizations.
Security certification for Common Criteria
libvirt’s QEMU/KVM driver has gone through Common Criteria certification. This provides assurance for the host management virtualization stack, which again, is typically a mandatory requirement for deployment into government / defense related organizations. The core components in this certification are the sVirt security model and the audit subsystem integration
Integration with cgroups for resource control
All guests are automatically placed into cgroups. This allows control of CPU schedular priority between guests, block I/O bandwidth tunables, memory utilization tunables. The devices cgroups controller, provides a further line of defence blocking guest access to char & block devices on the host in the unlikely event that another layer of protection fails. The cpu accounting group will also enable querying of the per-physical CPU utilization for guests, not available via any traditional /proc file.
Integration with libnuma for memory/cpu affinity
QEMU does not provide any native support for controlling NUMA affinity via its command line. libvirt integrates with libnuma and/or sched_setaffinity for memory and CPU pinning respectively
CPU compatibility guarantees upon migration
libvirt directly models the host and guest CPU feature sets. A variety of policies are available to control what features become visible to the guest, and strictly validate compatibility of host CPUs across migration. Further APIs are provided to allow apps to query CPU compatibility ahead of time, and to enable computation of common feature sets between CPUs. No comparable API exists elsewhere.
Host network filtering of guest traffic
The libvirt network filter APIs allow definition of a flexible set of policies to control network traffic seen by guests. The filter configurations are automatically translated into a set of rules in ebtables, iptables and
ip6tables. The rules are automatically created when the guest starts and torn down when it stops. The rules are able to automatically learn the guest IP address through ARP traffic, or DHCP responses from a (trusted) DHCP server. The data model is closely aligned with the DMTF industry spec for virtual machine network filtering
Libvirt ships with a default set of filters which can be turned on to provide ‘clean’ traffic from the guest. ie it blocks ARP spoofing and IP spoofing by the guest which would otherwise DOS other guests on the network
Secure migration data transport
KVM does not provide any secure migration capability. The libvirt KVM driver adds support for tunnelling migration data over a secure libvirtd managed data channel
Management of encrypted virtual disks for guests
An API set for managing secret keys enables a secure model for deploying guests to untrusted shared storage. By separating management of the secret keys from the guest configuration, it is possible for the management app to ensure a guest can only be started by an authorized user on an authorized host, and prevent restarts once it shuts down. This leverages the Qcow2 data encryption capabilities and is the only method to securely deploy guest if the storage admins / network is not trusted.
Seamless migration of SPICE clients
The libvirt migration protocol automatically takes invokes the KVM graphics clients migration commands at the right point in the migration process to relocate connected clients
Integration with arbitrary lock managers
The lock manager API allows strict guarantees that a virtual machine’s disks can only be in use by once process in the cluster. Pluggable implementations enable integration with different deployment scenarios. Implementations available can use Sanlock, POSIX locks (fcntl), and in the future DLM (clustersuite). Leases can be explicitly provided by the management application, or zero-conf based on the host disks assigned to the guests
Host device passthrough
Integration between host device management APIs and KVM device assignment allows for safe passthrough of PCI devices from the host. libvirt automatically ensures that the device is detached from the host OS before guest startup, and re-attached after shutdown (or equivalent for hotplug/unplug). PCI topology validation ensures that the devices can be safely reset, with FLR, Power management reset, or secondary bus reset.
Host device API enumeration
APIs for querying what devices are present on the host. This enables applications wishing to perform device assignment, to determine which PCI devices are available on the host and how they are used by the host OS. eg to detect case where user asked to assign a device that is currently used as the host eth0 interface.
NPIV SAN management
The host device APIs allow creation and deletion of NPIV SCSI interfaces. This in turns provides access to sets of LUNs, whcih can be associated with individual guests. This allows the virtual SCSI HBA to “follow” the guest where ever it may be running
Core dump integration with “crash”
The libvirt core dump APIs allows a guest memory dump to be obtained and saved into a format which can then be analysed by the Linux crash tool.
Access to guest dmesg logs even for crashed guests
The libvirt API for peeking into guest memory regions allows the ‘virt-dmesg’ tool to extract recent kernel log messages from a guest. This is possible even when the guest kernel has crashed and thus no guest agent is otherwise available.
Cross-language accessibility of APIs.
Provides an stable API which allows KVM to be managed from C, Perl, Python, Java, C#, OCaml and PHP. Components of the management stack are not tied into all using the same language
Zero-conf host network access
Out of the box libvirt provides a NAT based network which allows guests to have outbound network access on a host, whether running wired, or wireless, or VPN, without needing any administrator setup ahead of time. While QEMU provides the ‘SLIRP’ networking mode, this has very low performance and does not integrate with the networking filters capabilities libvirt also provides, nor does it allow guests to communicate with each other
Storage APIs for managing common host storage configurations
Few APIs existing for managing storage on Linux hosts. libvirt provides a general purpose storage API which allows use of NFS shares; formatting & use of local filesystems; partitioning of block devices; creation of LVM volume groups and creation of logical volumes within them; enumeration of LUNs on SCSI HBAs; login/out of ISCSI servers & access to their LUNs; automatic discovery of NFS exports, LVM volume groups and iSCSI targets; creation and cloning of non-raw disks formats, optionally including encryption; enumeration of multipath devices.
Integration with common monitoring tools
Plugins are available for munin, nagios and collectd to enable performance monitoring of all guests running on a host / in a network. Additional 3rd party monitoring tools can easily integrate by using a readonly libvirt connection to ensure they don’t interfere with the functional operation of the system.
Expose information about virtual machines via SNMP
The libvirt SNMP agent allows information about guests on a host to be exposed via SNMP. This allows industry standard tools to report on what guests are running on each host. Optionally SNMP can also be used to control guests and make changes
Expose information & control of host via CIM
The libvirt CIM agent exposes information about, and allows control of, virtual machines, virtual networks and storage pools via the industry standard DMTF CIM schema for virtualization.
Expose information & control of host via AMQP/QMF
The libvirt AMQP agent (libvirt-qpid) exposes information about, and allows control of, virtual machines, virtual networks and storage pools via the QMF object modelling system, running on top of AMQP. This allows data center wide view & control of the state of virtualization hosts.
The libvirt library uses a number of approaches to versioning to cover various usage scenarios:
- Software release versions
- Package manager versions (RPM)
- pkgconfig version
- libtool version
- ELF library soname/version
- ELF symbol version
The goal of libvirt is to never change the library ABI, in other words, once an API is added, struct declared, macro declared, etc it can never be changed. If an API was found flawed, the existing API may be annotated as deprecated, but it will never be removed. Instead a new alternative API is added.
Each new software release has an associated 3 component version number, eg 0.6.3, 0.8.0, 0.8.1. There is no strict policy on the significance of each component. At time of writing, the major component has never been changed. The minor component is changed when there is significant new functionality added to the library (usually a major collection of new APIs). The macro component is changed in each release and reset to zero when the minor component changes. So from an application developer’s point of view, if they want to use a particular API they look for the software release version that introduced the new API.
The text that follows often uses libvirt as an example. The principals / effects described apply to any ELF library and are not specific to libvirt.
Simple compile time version checking
Applications building against libvirt will make use of the pkgconfig tool to probe for existance of libvirt. The software release version is used, unchanged, as the pkgconfig version number. Thus if an application knows the API it wants was introduced in version 0.8.3, it will typically perform a compile time check like
$ pkg-config --atleast-version=0.8.3 libvirt \
&& echo "Found" || echo "Missing"
Or in autoconf macros
PKG_CHECK_MODULES([LIBVIRT], [libvirt >= 0.8.3])
This will result in compiler/linker flags that look something like
-lvirt
Or
-I/home/berrange/usr/include -L/home/berrange/usr/lib -lvirt
When the compiler links the application to libvirt, it will take the libvirt ELF soname and embed it in the application. The libvirt ELF soname is simply “libvirt.so.0” and since libvirt policy is to maintain ABI, it will never change.
libtool also includes library versioning magic that is intended to allow for version checks that are independent of the software release version. Few libraries ever fully make use of this capability since it is poorly understood, easy to get wrong and doesn’t integrate with the rest of the toolchain / OS distro. libvirt sets libtool version based on the package release version. It won’t be discussed further.
Simple runtime version checking
When an application is launched the ELF loader will attempt to resolve the linked libraries and verify their current soname matches that embedded in the application. Since libvirt will never break ABI, the soname check will always succeed. It should be clear that the soname check is a very weak versioning scheme. The application may have been linked against version 0.8.3, but the installed version it is run against could be 0.6.0 and this mismatch would not be caught by the soname check.
Installation version checking
The software release version is also used unchanged for the package manager version field. In RPM packages there is an extra ‘release’ component appended to the version for extra fine checking.
Thus the package manager can be used to provide stronger runtime version guarantee, by checking the full version number at installation time. The aforementioned example application would typically want to include a dependency
Requires: libvirt >= 0.8.3
The RPM package manager, however, also has support for automatic dependency extraction. For this it will extract the soname from any ELF libraries the application links against and secretly add statements of the form
Requires: libvirt.so.0()(64bit)
The libvirt RPM itself also gains secret statements of the form
Provides: libvirt.so.0()(64bit)
Since this is based on the soname though, these automatic dependencies are a very weak versioning scheme.
Symbol versioning
Most linkers will provide library developers with a way to control which symbols are exported for use by applications. This may be done via annotations in the source file, or more often via a standalone “linker script” which simply whitelists symbols to be exported.
The GNU ELF linker has support for doing more than simply whitelisting symbols. It allows symbols to be grouped together and a version number attached to each group. A group can also be annotated as being a child of another group. In this manner you can create trees of versioned symbols, although most common libraries will only use one path without branching.
The libvirt library uses the software release version to group symbols which were added in that release. The top levels of the libvirt tree are logically arranged like:
+------------------------+
| LIBVIRT_0.0.3 |
+------------------------+
| virConnectClose |
| virConnectOpen |
| virConnectListDomains |
| .... |
+------------------------+
|
V
+------------------------+
| LIBVIRT_0.0.5 |
+------------------------+
| virDomainGetUUID |
| virConnectLookupByUUID |
+------------------------+
|
V
+------------------------+
| LIBVIRT_0.1.0 |
+------------------------+
| virInitialize |
| virNodeGetInfo |
| virDomainReboot |
| .... |
+------------------------+
|
V
....
Runtime version checking with version symbols
When a library is built normally, the ELF headers will contain a list of plain symbol names:
$ eu-readelf -s /usr/lib64/libvirt.so | grep 'GLOBAL DEFAULT' | awk '{print $8}'
virConnectOpen
virConnectClose
virDomainGetUUID
....
When a library is built with versioned symbols though, the name is mangled to include a version string:
$ eu-readelf -s /usr/lib64/libvirt.so | grep 'GLOBAL DEFAULT' | awk '{print $8}'
virConnectOpen@@LIBVIRT_0.0.3
virConnectClose@@LIBVIRT_0.0.3
virDomainGetUUID@@LIBVIRT_0.0.5
....
Similarly when an application is built against a normal library, the ELF headers will contain a list of plain symbol names that it uses:
$ eu-readelf -s /usr/bin/virt-viewer | grep 'GLOBAL DEFAULT' | awk '{print $8}'
virConnectOpenAuth
virDomainFree
virDomainGetID
And as expected, when built against an library with versioned symbols, the application ELF headers will contained symbols mangled to include a version string:
$ eu-readelf -s /usr/bin/virt-viewer | grep 'GLOBAL DEFAULT' | awk '{print $8}'
virConnectOpenAuth@LIBVIRT_0.4.0
virDomainFree@LIBVIRT_0.0.3
virDomainGetID@LIBVIRT_0.0.3
When the application is launched, the linker is now able to perform strict version checking. For each symbol ‘FOO’
- If application listed an unversioned 'FOO'
- If library has an unversioned 'FOO'
=> Success
- Else library has a versioned 'FOO'
=> Success
- Else application listed a versioned 'FOO'
- If library has an unversioned 'FOO'
=> Fail
- Else library has a versioned 'FOO'
- If versions of 'FOO' match
=> Success
- Else versions don't match
=> Fail
Notice that an application with unversioned symbols will work with a versioned libary. A versioned application will not work with an unversioned library. This allows for versioned symbols to be added to be introduced to an existing unversioned library without causing an ABI compatibility problem.
With an application and library both using versioned symbols, the compile time requirement the developer declared to pkgconfig with “libvirt >= 0.8.3” is now effectively encoded in the binary via symbol versioning. Thus a fairly strict check can be performed at runtime to ensure the correct library is installed.
Installation version checking with versioned symbols
Remember that the RPM automatic dependencies extractor will find the ELF soname in libraries/applications. When using versioned symbols, it can go one step further and find the symbol version strings too.
The libvirt RPM will thus automatically get extra versioned dependencies
$ rpm -q libvirt-client --provides | grep libvirt
libvirt.so.0()(64bit)
libvirt.so.0(LIBVIRT_0.0.3)(64bit)
libvirt.so.0(LIBVIRT_0.0.5)(64bit)
libvirt.so.0(LIBVIRT_0.1.0)(64bit)
libvirt.so.0(LIBVIRT_0.1.1)(64bit)
libvirt.so.0(LIBVIRT_0.1.4)(64bit)
libvirt.so.0(LIBVIRT_0.1.5)(64bit)
An application will also automatically get extra data based on the versions of the symbols it links against
$ rpm -q virt-viewer --requires | grep libvirt
libvirt.so.0()(64bit)
libvirt.so.0(LIBVIRT_0.0.3)(64bit)
libvirt.so.0(LIBVIRT_0.0.5)(64bit)
libvirt.so.0(LIBVIRT_0.1.0)(64bit)
libvirt.so.0(LIBVIRT_0.4.0)(64bit)
libvirt.so.0(LIBVIRT_0.5.0)(64bit)
This example shows that this virt-viewer cannot run against this libvirt, since the libvirt library is missing versions 0.4.0 and 0.5.0
Thus, there is now rarely any need for an application developer to manually list dependencies against the libvirt library. ie they can remove any line like
Requires: libvirt >= 0.8.3
RPM will determine the minimum version automatically. This is good, because application developers will often forget to update these manually stated deps. The only possible exception to this, is where the
application needs to avoid a implementation bug in a particular version of libvirt and so request a version that is newer than that indicated by the symbol versions. In practice this isn’t very useful for libvirt, since most libvirt drivers have a client server model and the RPM dependency only validates the client, not the server.
The perils of backporting APIs with versioned symbols
It is fairly common when maintaining software in Linux distributions to want to backport code from a newer release, rather than rebase the entire package. Most maintainers will only attempt this for bug fixes or security fixes, but sometimes there is demand to backport new features, and even new APIs. From the descriptions shown above it should be clear that there are a number of serious downsides associated with backporting new APIs.
Impact on compile time pkgconfig checks
Application developers look at the version where a symbol was introduced and encode that in a pkgconfig check at compile time. If an OS distribution backports a symbol to an earlier version, the application developer has to now create OS-distro specific checks for the presence of the backported API. This is a retrograde step for the application developer, because the primary reason for using pkgconfig is that it is OS-distro *independent*
Next consider what happens to the libvirt.so symbols when an API is backported. There are three possible approaches. Keep the original symbol version, or adopt the symbol version of the release being backported to, or invent a new version. All approaches ultimately fail.
Impact of maintaining the new version with a backport
Consider backporting an API from 0.5.0, to a 0.4.0 release. If the original symbol version is maintained the runtime checks by the ELF library loader will still succeed normally. There is an impact on the install time package manager checks though. Backporting a single API from 0.5.0 release will result in the libvirt-client library gaining
Provides: libvirt.so.0(LIBVIRT_0.5.0)(64bit)
Even if there are many other APIs in 0.5.0 that are not backported. An application which uses an API from the 0.5.0 relase will gain
Requires: libvirt.so.0(LIBVIRT_0.5.0)(64bit)
This application can now be installed against either 0.4.0 + backported API, or 0.5.0. RPM is unable to correctly check requirements at application install time, because it cannot distinguish the two scenarios. The API backport has broken the RPM versioning.
Impact of adopting the older version with a backport
The other option for backporting is thus to adopt the version of the release being backported to. ie change the backported API’s symbol version string from
virEventRegisterImpl@LIBVIRT_0.5.0
To
virEventRegisterImpl@LIBVIRT_0.4.0
The problem here is that an application which was built against a normal libvirt binary will have a requirement for
virEventRegisterImpl@LIBVIRT_0.5.0
Attempting to run this application against the libvirt.so with the backported API will fail, because the linker will see that “LIBVIRT_0.4.0” is not the same as “LIBVIRT_0.5.0”.
All libvirt applications now have to be specially compiled for this OS distro’s libvirt to pick up correct symbol versions. This doesn’t solve the core problem, it merely reverses it. The application binary is now incompatible with a normal libvirt, making it impossible to test it against upstream libvirt releases without rebuilding the application too.
In addition the RPM version checking has once again been broken. An application that needs the backported symbol will have
Requires: libvirt.so.0(LIBVIRT_0.4.0)(64bit)
There is no way for RPM to validate whether this means the plain ‘0.4.0’ release, or the 0.4.0 + backport release.
Changing the symbol version when doing the backport has in effect change the ABI of the library in this particular OS distro.
What is worse, is that this changed symbol version has to be preserved in every single future release of the OS distro. ie applications from RHEL-5 expect to be able to run against and libvirt from RHEL-6, RHEL-7, etc without recompilation.
So changing the symbol version has all the downsides of not changing the symbol version, with an additional permanent maintenance burden of patching versions for every future OS release.
Impact of inventing a new version with a backport
For libraries which only target the Linux ELF loader and no other OS platform there is one other possible trick. The ELF linker script does not require a single linear progression of versioned symbol groups. The tree can have arbitrary branches, provided it remains a DAG. A symbol can also appear in multiple places at once.
This could hypothetically be leveraged during a backport of an API, by introducing a branch in the version tree at which the symbol appears. So instead of the symbol being either
virEventRegisterImpl@LIBVIRT_0.5.0
virEventRegisterImpl@LIBVIRT_0.4.0
It could be (eg based off the RPM release where it was backported).
virEventRegisterImpl@LIBVIRT_0.4.0_1
The RPM will thus gain
Provides: libvirt.so.0(LIBVIRT_0.4.0_1)(64bit)
And applications built against this backport will gain
Requires: libvirt.so.0(LIBVIRT_0.4.0_1)(64bit)
As compared to the first option of maintaining the original symbol version LIBVIRT_0.5.0, the RPM version checking will continue to operate correctly.
This approach, however, still has all the permanent package maintenance problems of the second approach. ie the patch which adds a ‘LIBVIRT_0.4.0_1’ version has to be maintained for every single future release of the OS distro, even once the backport of the actual code has gone. It also means that applications built against the backported library won’t work with an upstream build and vica-versa.
Conclusion for backporting symbols
It has been demonstrated that whatever approach is taken, backporting a API in a library with version symbols, will break either RPM version checking or cause incompatibilities in ELF loader version checking or both. It also requires applications to write OS distro specific checks. For a library that intends to maintain permanent API/ABI stability for application developers this is unacceptable.
The conclusion is clear: never backport APIs from *any* library, especially not one that makes any use of symbol versioning.
If a new API is required, then the entire package should be rebased to the version which includes the desired API.
A few months on from the first release, I’ve polished off the latest changes in Entangle and pushed out a new release version 0.2.0. This is primarily a bug fix and code maintenance release, with the following important changes
- Better compatibility with cameras which don’t support event notifications, allowing use of capture functionality
- Rewrote the preview mode to actually be useful, providing a continuously updating live view until ready to capture
- Directly offers to unmount the camera if it is locked by the GVFS gphoto2 plugin preventing its use for capture.
- Replaces custom plugin code with use of libpeas.
- Vastly improved the error reporting and logging to make diagnosis of camera problems easier
- Fixed a nasty infinite loop in the camera capture code where a camera event could keep resetting the event timeout
- Fixed a crash in accessing udev properties
It is already built for Fedora 14 and on its way into updates-testing, Fedora 15 will have to wait due to broken build roots caused by mutually conflicting upstart/systemd packages:-( Alternatively grab the source code from the download area. For the next release I’m aiming to improve the full screen mode to give access to some of the capture & config controls, and generally improve the UI for camera configuration controls.
In learning about virtual keyboard handling for my previous post, I spent alot of time researching just what scan code & key code sets are used at each point in the stack. I trawled across many websites and source code trees to discover this information, so I figure I should write it down in case I forget it all in 12 months time when I next get a keyboard handling bug report. I can’t guarantee I’ve got all the details here correct, so if you spot mistakes, leave a comment.
IBM XT keyboard
In the beginning the PC was invented. This has made a lot of people very angry and has been widely regarded as a bad move. The IBM XT scan codes are first set encountered. Most of the scan codes were a just a single byte, with the high bit clear for a make event and set for a break event. The keycodes don’t correspond at all to ASCII values, rather they were assigned incrementally across rows. As an example, the key in the 4th row, second column (labelled A on a us layout) has value 0x1e for make code and 0x9e for break code. Keys on the numeric keypad would generate multi-byte sequences, where the leading byte is “0xe0” (sometimes referred to as the grey code, I guess because the keypad was a darker grey plastic)
IBM AT & PS/2 keyboards
The IBM AT keyboard was incompatible with the XT at a wire protocol level. In terms of scan codes though, it can support upto 3 sets. The first set is the original XT scan codes. The second set is a new AT scan codes. The third set is so called PS/2 scan codes. In all cases though, the second set is the default and the first & third sets may not even be supported. There is no resemblance between AT and XT scan codes. The use of the high bit for break code was abandoned. The same scan code is used by make and break, but in the break case it is prefixed by a byte 0xf0. As an example, the key in the 4th row, second column (labelled A on a us layout) has the value 0x1c. Again the keys on the numeric keyboard generated multi-byte sequences with a leading “0xe0”.
For added fun, regardless of what scan code set the keyboard is generating the i8042 keyboard controller will, by default, provide the OS with XT scan codes unless told otherwise. This was for backwards compatible with software which only knows about the previous XT scan code set.
Over time Microsoft introduced the Windows keys, then Internet keys arrived followed by all sorts of other function keys on laptops. The extended 0xe0 (“grey code”) sequence space was used for these new scan codes. Of course these keys needed scan codes in both the XT and AT sets to be defined. So for example, the left Windows key got the XT make/break scan codes 0xe0+0x5c and 0xe0+0xdc and the AT make/break scan codes 0xe0+0x27 and 0xe0+0xf0+0x27.
USB keyboards
The introduction of USB input devices, allowed for a clean break with the past, so XT/AT scan codes are no more in the USB world. The USB HID spec defines a standard 16 bit scan code set for all compliant keyboards to use. The USB HID spec gives scan codes names based on the US ASCII key cap labels. Thus the US ASCII key labelled ‘A’ has HID page ID 0x07 and HID usage ID 0x04. Where both the XT/AT sets and USB HID sets support a common key it is possible to perform a lossless mapping in either direction. There are, however, some keys in the USB HID scan code set for which there isn’t a XT/AT scan code.
Linux internals
The Linux internal event subsystem has defined a standard set of key codes that are hardware independant, able to represent any scan code from any type of keyboard whether AT, XT or USB. There are names assigned to the key codes based on the common US ASCII key cap labels. The key codes are defined in /usr/include/linux/input.h. For an example ‘#define KEY_A 30’. For convenience, key codes 0-88 map directly to XT scan codes 0-88. All the low level hardware drivers for AT, XT, USB keyboards, etc have to translate from scan codes to the Linux key codes when queueing key events for dispatch by the input subsystem. In “struct input_event”, the “code” field contains the key code while the “value” field indicates the state (press/release). Mouse buttons are also represented as key codes in this set.
Linux console
The Linux console (eg /dev/console or /dev/tty*) can operate in several different modes.
- RAW: turns linux keycodes back into XT scancodes
- MEDIUMRAW: an encoding of the linux key codes. key codes < 127 generated directly, with high bit set for a release event. key codes >= 127 are generated as multi-byte sequences. The leading byte is 0x0 or 0x1 depending on whether a key press or release. The second byte contains the low 7 bits and third byte contains the high 7 bits. Both second and third bytes always have the 8th bit set. This allows room for future expansion upto 16384 different key codes
- ASCII: the ascii value associated with the key code + current modifiers, according to the loaded keymap
- UNICODE: the UTF-8 byte sequence associated with the key code + current modifiers, according to the loaded keymap
The state of the current console can be seen and changed using the “kbd_mode” command line tool. Changing the mode is not recommended except to switch between ASCII & UNICODE modes
Linux evdev
The evdev driver is a new way of exposing input events to userspace, bypassing the traditional console/tty devices. The “struct input_event” data is provided directly to userspace without any magic encodings. Thus apps using evdev will always be operating with the Linux key code set initially. They can of course convert this into the XT scan codes if desired, in exactly the same way that the console device already does when in RAW mode.
Xorg XKB
In modern Xorg servers, XKB is in charge of all keyboard event handling. Unusually, XKB does not actually define any standard key code set. At least not numerically. In the XKB configuration files there are logical names for every key, based on their position. Some names may be obvious ‘<TAB>’, while others are purely reflecting a physical row/column location ‘<AE01>’. Each Xorg keyboard driver is free to define whatever key code set it desires. The driver must provide a mapping from its key code values to the XKB key code names (in /usr/share/X11/xkb/keycodes/). There is then a mapping from XKB key code names to key symbols (in /usr/share/X11/xkb/keysyms). The problem with this is that the key code seen in the XKeyEvent can come from an arbitrary set of which the application is not explicitly aware. Three common key code sets from Xorg will now be mentioned.
Xorg kbd driver
The traditional “kbd” driver uses the traditional Linux console/tty devices for receiving input, configuring the devices in RAW mode. Thus it is accepting XT scan codes initially. The kdb keycodes are defined in its source code at src/atKeynames.h For scan codes below 89, a key code is formed simply by adding 8. Scan codes above that have a set of re-mapping rules that can’t be quickly illustrated here. The mapping is reverseable though, given a suitable mapping table.
Xorg evdev driver
The new “evdev” driver of course uses the new Linux evdev devices for receiving input, as Linux key codes. It forms its scan codes simply by adding 8 to the Linux key code. This is trivially reverseable.
Xorg XQuartz driver
The Xorg build for OS-X, known as XQuartz, does not use either of the previously mentioned drivers. Instead it has a OS-X specific keyboard driver that receives input directly from the OS-X graphical toolkit. It receives OS-X virtual key codes and produces X key codes simply by adding 8. This is again trivially reversable.
RFB protocol extended key event
The RFB protocol extension defined by GTK-VNC and QEMU allows for providing a key code in addition to the key symbol for all keypress/release events sent over the wire. The key codes are defined to be based on a simple encoding of XT scan code set. Single byte XT scan codes 0-127 are sent directly. Two-byte scan codes with the grey code (0xe0) are sent as single bytes with the high bit set. eg 0x1e is sent as 0x1e, while 0xe0+0x1e is sent as 0x9e (ie 0x1e | 0x80). The press/release state is sent as a separate value.
QEMU internals
QEMU’s internal input subsystem for keyboard events uses raw XT scan codes directly. The hardware device emulation maps to other scan code sets as required. It is no coincidence that the RFB protocol extension for sending key codes is a trivial mapping to QEMU scan codes.
Microsoft Windows
The Windows GDI defines a standard set of virtual key codes that are independent of any particular keyboard hardware.
Apple OS-X
OS-X defines a standard set of virtual key codes that are based on the original Apple extended keyboard scan codes.
SDL
The SDL_KeyboardEvent struct contains a “scancode” field. This provides the operating system dependant key code corresponding to the key event. This is easily interpreted on OS-X/Windows, but on Linux X11 this requires knowledge of what keyboard driver is used by Xorg.
GTK/GDK
The GDKKeyEvent struct contains a “hardware_keycode” field. This provides the operating system dependant key code corresponding to the key event. This is easily interpreted on OS-X/Windows, but on Linux X11 this requires knowledge of what keyboard driver is used by Xorg.
Future ideas
The number of different scan code and key code sets is really impressive / depressing / horrific. The good news is that given the right information, it is possible to map between them all fairly easily, in a lossless fashion. The bad news is that the data for performing such mappings is hidden across many web pages and many, many source trees. It would be great to have a single point of reference providing all the scan/key code sets, along with a tool that can generate the mapping tables & example code to convert between any 2 sets.
As a general rule, people using virtual machines have only one requirement when it comes to keyboard handling: any key they press should generate the same output in the host OS and the guest OS. Unfortunately, this is a surprisingly difficult requirement to satisfy. For a long time when using either Xen or KVM, to get workable keyboard handling it was necessary to configure the keymap in three places, 1. the VNC client OS, 2. the guest OS, 3. QEMU itself. The third item was a particular pain because it meant that a regular guest OS user would need administrative access to the host to change the keymap of their guest. Not good for delegation of control. In Fedora 11 we introduced a special VNC extension which allowed us to remove the need to configure keymaps in QEMU, so now it is merely necessary to configure the guest OS to match the client OS. One day when we get a general purpose guest OS agent, we might be able to automatically set the guest OS keymap to match client OS, whenever connecting via VNC, removing the last manual step. This post aims to give background on how keyboards work, what we done in VNC to improve the guest keyboard handling and what problems we still have.
Keyboard hardware scan codes
Between the time of pressing the physical key and text appearing on the screen, there are several steps in processing the input with data conversions along the way. The keyboard generates what are known as scan codes, a sequence of one or more bytes, which uniquely identifies the physical key and whether it is a make or break (press or release) event. Scan codes are invariant across all keyboards of the same type, regardless of what label is printed on the key. In the PC world, IBM defined the first scan code set with their IBM XT, followed later by AT scan codes and PS/2 scan codes. Other manufacturers adopted the IBM scan codes for their own products to ensure they worked out of the box. In the USB world, the HID specification defines the standard scan codes that manufacturers must use.
Operating system key codes
For operating systems wanted to support more than one different type of keyboard, scan codes are not a particularly good representation. They are also often unwieldly as a result of encoding both the key & its make/break state into the same byte(s). Thus operating systems typically define their own standard set of key codes, which is able to represent any possible keys on all known keyboards. They will also track the make/break state separately, now using press/release or up/down as terminology. Thus the first task of the keyboard driver is to convert from the hardware specific scan codes to the operating system specific key code. This is an easily reverseable, lossless mapping.
Display / toolkit key symbols & modifiers
Key codes still aren’t a concept that is particularly useful for (most) applications, which would rather known what user’s intended symbol was, rather than the physical key. Thus the display service (X11, Win32 GUI, etc) or application toolkit (GTK) define what are known as key symbols. To convert from key codes to key symbols, a key map is required for the language specific keyboard layout. The key map declares modifier keys (shift, alt, control, etc) and provides a list of key symbols that are associated with each key code. This mapping is only reverseable if you know the original key map. This is also a lossy mapping, because it is possible for several different key codes to map to the same key symbol.
Considering an end-to-end example, starting with the user pressing the key in row 4, column 2 which is labelled ‘A’ in a US layout, XT compatible keyboard. The operating system keyboard driver receives XT scan code 0x1e, which it converts to Linux key code 30 (KEY_A), Xorg server keyboard driver further converts to X11 key symbol 0x0061 (XK_a), GTK toolkit converts this to GDK key symbol 0x0061 (GDK_a), and finally the application displays the character ‘a’ in the text entry field on screen. There are actually a couple of conversions I’ve left out here, specifically how X11 gets the key codes from the OS and how X11 handles key codes internally, which I’ll come back to another time.
The problem with virtualization
For 99.99% of applications all these different steps / conversions are no problem at all, because they are only interested in text entry. Virtualization, as ever, introduces fun new problems where ever it goes. The problem occurs at the interface between the host virtual desktop client and the hardware emulation. The virtual desktop may be a local fat client using a toolkit like SDL, or it may be a remote network client using a protocol like VNC, RFB or SPICE. In both cases, a naive virtual desktop client will be getting key symbols with their key events. The hardware emulation layer will usually want to provide something like a virtualizated PS/2 or USB keyboard. This implies that there needs to be a conversion from key symbols back to hardware specific scan codes. Remember a couple of paragraphs ago where it was noted that the key code -> key symbol conversion is lossy. That is a now a big problem. In the case of a network client it is even worse, because the virtualization host does not even know what language specific keymap was used.
Faced with these obstacles, the initial approach QEMU took was to just add a command line parameter ‘-k $KEYMAP’. Without this parameter set, it will assume the virtuall desktop client is using a US layout, otherwise it will use the specified keymap. There is still the problem that many key codes can map to the same key symbol. It is impossible to get around this problem – QEMU just has to pick one of the many possible reverse mappings & use it. This means hat, even if the user configures matching keymaps on their client, QEMU and the guest OS, there may be certain keys that will never work in the guest OS. There is also a burden for QEMU to maintain a set of keymaps for every language layout. This set is inevitably incomplete.
The solution with virtualization
Fortunately, there is a get out of jail free card available. When passing key events to an application, all graphical windowing systems will provide both the key symbol and layout independent key code. Remember from many paragraphs earlier that scan code -> key code conversion is lossless and easily reverseable. For local fat clients, the only stumbling block is knowing what key code set is in use. Windows and OS-X both define a standard virtual key set, but sadly X11 does not :-( Instead the key codes are specific to the X11 keyboard driver that is in use, with a Linux based Xorg this is typically ‘kbd’ or ‘evdev’. QEMU has to use heuristics on X11 to decide which key codes it is receiving, but at least once this is identified, the conversion back to scan codes is trivial.
For the VNC network client though, there was one additional stumbling block. The RFB protocol encoding of key events only includes the key symbol, not the key code. So even though the VNC client has all the information the VNC server needs, there is no way to send it. Leading upto the development of Fedora 11, the upstream GTK-VNC and QEMU communities collaborated to define an official extension to the RFB protocol for an extended key event, that includes both key symbol and key code. Since every windowing system has its own set of key codes & the RFB needs to be platform independent, the protocol extension defined that the keycode set on the wire will be a special 32-bit encoding of the traditional XT scancodes. It is not a coincidence that QEMU already uses a 32-bit encoding of traditional XT scan codes internally :-) With this in place the RFB client, merely has to identify what operating system specific key codes it is receiving and then apply the suitable lossless mapping back to XT scancodes.
Considering an end-to-end example, starting with the user pressing the key in row 4, column 2 which is labelled ‘A’ in a US layout, XT compatible keyboard. The operating system keyboard driver receives XT scan code 0x1e, which it converts to Linux key code 30 (KEY_A), Xorg server keyboard driver further converts to X11 key symbol 0x0061 (XK_a) and an evdev key code, GTK toolkit converts the key symbol to GDK key symbol 0x0061 (GDK_a) but passes the evdev key code unchanged. The VNC client converts the evdev key code to the RFB key code and sends it to the QEMU VNC server along with the key symbol. The QEMU VNC server totally ignores the key symbol and does the no-op conversion from the RFB key code to the QEMU key code. The QEMU PS/2 emulation converts the QEMU keycode to either the XT scan code, AT scan code or PS/2 scan code, depending on how the guest OS has configured the keyboard controller. Finally the conversion mentioned much earlier takes place in the guest OS and the letter ‘a’ appears in a text field in the guest. The important bit to note is that although the key symbol was present, it was never used in the host OS or remote network client at any point. The only conversions performed were between scan codes and key codes all of which were lossless. The user is able to generate all the same key sequences in the guest OS as they can in the host OS. The user is happy their keyboard works as expected; the virtualization developer is happy at lack of bug reports about broken keyboard handling.