Nova and its use of Olso Incubator Guru Meditation Reports
This blogs describes a error reporting / troubleshooting feature added to Nova a while back which people are probably not generally aware of.
One of the early things I worked on in the Nova libvirt driver was integration of support for the libvirt event notification system. This allowed Nova to get notified when instances are shutdown by the guest OS administrator, instead of having to wait for the background power state sync task to run. Supporting libvirt events was a theoretically simple concept, but in practice there was a nasty surprise. The key issue was that we needed to have a native thread running the libvirt event loop, while the rest of Nova uses green threads. The code kept getting bizarre deadlocks, which were eventually traced to use of the python logging APIs in the native thread. Since eventlet had monkeypatched the thread mutex primitives, the logging APIs couldn’t be safely called from the native thread as they’d try to obtain a green mutex from a native thread context.
Eventlet has a concept of a backdoor port, which lets you connect to the python process using telnet and get an interactive python prompt. After turning this on, I got a stack trace of all green and native threads and eventually figured out the problem, which was great. Sadly the eventlet backdoor is not something anyone sane would ever enable out of the box on production systems – you don’t really want to allow remote command execution to anyone who can connect to a TCP port :-) Another debugging option is to use Python’s native debugger, but this is again something you have to enable ahead of time and won’t be something admins enable out of the box on production systems. It is possible to connect to a running python process with GDB and get a crude stack trace, but that’s not great either as it requires python debuginfo libraries installed. It would be possible to build an administrative debugging API for Nova using the REST API, but that only works if the REST API & message bus are working – not something that’s going to be much use when Nova itself has deadlocked the entire python interpretor
After this debugging experience I decided to propose something that I’ve had on previous complex systems, a facility that allows an admin to trigger a live error report. Crucially this facility must not require any kind of deployment setup tasks and thus be guaranteed available at all times, especially on production systems where debugging options are limited by the need to maintain service to users. I called the proposal the “Guru Meditation Report” in reference to old Amiga crash reports. I did a quick proof of concept to test the idea, but Solly Ross turned the idea into a complete feature for OpenStack, adding it to the Oslo Incubator in the openstack.common.reports namespace and integrating with Nova. This shipped with the Icehouse release of Nova.
Service integration & usage
Integration into projects is quite straightforward, the openstack-common.conf file needs to list the relevant modules to import from oslo-incubator
$ grep report openstack-common.conf module=report module=report.generators module=report.models module=report.views module=report.views.json module=report.views.text module=report.views.xml
then each service process needs to initialize the error report system. This just requires a single import line and method call from the main
method
$ cat nova/cmd/nova-compute ...snip... from nova.openstack.common.report import guru_meditation_report as gmr ...snip... def main(): gmr.TextGuruMeditation.setup_autorun(version) ...run eventlet service or whatever...
The setup_autorun
method installs a signal handler connected to SIGUSR1
which will dump an error report to stderr when triggered.
So from Icehouse onwards, if any Nova process is mis-behaving you can simply run something like
$ kill -USR1 `pgrep nova-compute`
to get a detailed error report of the process state sent to stderr. On RHEL-7 / Fedora systems this data should end up going into the systemd journal for that service. On other systems you may be lucky enough for the init service to have redirected stderr to a log file, or unlucky enough to have it sent to /dev/null. Did I mention the systemd journal is a really exactly feature when troubleshooting service problems :-)
Error report information
In the oslo-incubator code there are 5 standard sections defined for the error report
- Config – dump of all configuration settings loaded by oslo.config – useful because the config settings loaded in memory don’t necessarily match what is currently stored in /etc/nova/nova.conf on disk – eg admin may have modified the config and forgotten to reload the services.
- Package – information about the software package, as passed into the setup_autorun method previously. This lets you know the openstack release version number, the vendor who packaged it and any vendor specific version info such as the RPM release number. This is again key, because what’s installed on the host currently may not match the version that’s actually running. You can’t always trust the admins to give you correct info when reporting bugs, so having software report itself is more reliable.
- Process – information about the running process including the process ID, parent process ID, user ID, group ID and scheduler state.
- Green Threads – stack trace of every eventlet green thread currently in existence
- Native Threads – stack trace of every native thread currently in existence
The report framework is modular, so it is possible to register new generator functions which add further data to the error report. This is useful if there is application specific data that is useful to include, that would not otherwise be suitable for inclusion in oslo-incubator directly. The data model is separated from the output formatting code, so it is possible to output the report in a number of different data formats. The reports which get sent to stderr are using a crude plain text format, but it is possible to have reports generated in XML, JSON, or another completely custom format.
Future improvements
Triggering from a UNIX signal and printing to stderr, is a very simple and reliable approach that we can guarantee will almost always work no matter what operational state the OpenStack deployment as a whole is in. It should not be considered the only possible approach though. I can see that it may be desirable to also wire this up to RPC messaging bus, so a cloud admin can remotely generate an error report for a service and get the response back over the message bus in an XML or JSON format. This wouldn’t replace the SIGUSR1 based stderr dumps, but rather augment them, as we want to retain the ability to trigger reports even if rabbitmq bus connection is dead/broken for some reason.
AFAIK, this error report system is only wired up into the Nova project at this time. It is desirable to bring this feature over to projects like Neutron, Cinder, Glance, Keystone too, so it can be a considered an openstack wide standard system for admins to collect data for troubleshooting. As explained above, this is no more difficult that adding the modules to openstack-common.conf and then adding a single method call to the service startup main method. Those projects might like to register extra error report sections to provide further data, but that’s by no means required for initial integration.
Having error reports triggered on demand by the admin is nice, but I think there is also value in having error reports triggered automatically in response to unexpected error conditions. For example if a RPC request to boot a new VM instance fails, it could be desirable to save a detailed error report, rather than just having an exception hit the logs with no context around it. In such a scenario you would extend the error report generator so that the report included the exception information & stack trace, and also include the headers and/or payload of the RPC request that failed. The error report would probably be written to a file instead of stderr, using JSON or XML. Tools could then be written to analyse error reports and identify commonly recurring problems.
Even with the current level of features for the error report system, it has proved its worth in facilitating the debugging of a number of problems in Nova, where things like the eventlet backdoor or python debugger were impractical to use. I look forward to its continued development and broader usage across openstack.
Example report
What follows below is an example error report from the nova-compute service running in one of my development hosts. Notice that oslo.conf configuration parameters that were declared with the ‘secret’ flag have their value masked. This is primarily aiming to prevent passwords making their way into the error report, since the expectation is the users may attach these reports to public bugs.
======================================================================== ==== Guru Meditation ==== ======================================================================== |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ======================================================================== ==== Package ==== ======================================================================== product = OpenStack Nova vendor = OpenStack Foundation version = 2015.1 ======================================================================== ==== Threads ==== ======================================================================== ------ Thread #140157298652928 ------ /usr/lib64/python2.7/threading.py:784 in __bootstrap `self.__bootstrap_inner()` /usr/lib64/python2.7/threading.py:811 in __bootstrap_inner `self.run()` /usr/lib64/python2.7/threading.py:764 in run `self.__target(*self.__args, **self.__kwargs)` /usr/lib/python2.7/site-packages/eventlet/tpool.py:72 in tworker `msg = _reqq.get()` /usr/lib64/python2.7/Queue.py:168 in get `self.not_empty.wait()` /usr/lib64/python2.7/threading.py:339 in wait `waiter.acquire()` ------ Thread #140157307045632 ------ /usr/lib64/python2.7/threading.py:784 in __bootstrap `self.__bootstrap_inner()` /usr/lib64/python2.7/threading.py:811 in __bootstrap_inner `self.run()` /usr/lib64/python2.7/threading.py:764 in run `self.__target(*self.__args, **self.__kwargs)` /usr/lib/python2.7/site-packages/eventlet/tpool.py:72 in tworker `msg = _reqq.get()` /usr/lib64/python2.7/Queue.py:168 in get `self.not_empty.wait()` /usr/lib64/python2.7/threading.py:339 in wait `waiter.acquire()` ------ Thread #140157734876928 ------ /usr/lib64/python2.7/threading.py:784 in __bootstrap `self.__bootstrap_inner()` /usr/lib64/python2.7/threading.py:811 in __bootstrap_inner `self.run()` /usr/lib64/python2.7/threading.py:764 in run `self.__target(*self.__args, **self.__kwargs)` /usr/lib/python2.7/site-packages/eventlet/tpool.py:72 in tworker `msg = _reqq.get()` /usr/lib64/python2.7/Queue.py:168 in get `self.not_empty.wait()` /usr/lib64/python2.7/threading.py:339 in wait `waiter.acquire()` ------ Thread #140158288500480 ------ /usr/lib64/python2.7/threading.py:784 in __bootstrap `self.__bootstrap_inner()` /usr/lib64/python2.7/threading.py:811 in __bootstrap_inner `self.run()` /usr/lib64/python2.7/threading.py:764 in run `self.__target(*self.__args, **self.__kwargs)` /usr/lib/python2.7/site-packages/eventlet/tpool.py:72 in tworker `msg = _reqq.get()` /usr/lib64/python2.7/Queue.py:168 in get `self.not_empty.wait()` /usr/lib64/python2.7/threading.py:339 in wait `waiter.acquire()` ------ Thread #140158416287488 ------ /usr/lib64/python2.7/threading.py:784 in __bootstrap `self.__bootstrap_inner()` /usr/lib64/python2.7/threading.py:811 in __bootstrap_inner `self.run()` /usr/lib64/python2.7/threading.py:764 in run `self.__target(*self.__args, **self.__kwargs)` /usr/lib/python2.7/site-packages/eventlet/tpool.py:72 in tworker `msg = _reqq.get()` /usr/lib64/python2.7/Queue.py:168 in get `self.not_empty.wait()` /usr/lib64/python2.7/threading.py:339 in wait `waiter.acquire()` ------ Thread #140158424680192 ------ /usr/lib64/python2.7/threading.py:784 in __bootstrap `self.__bootstrap_inner()` /usr/lib64/python2.7/threading.py:811 in __bootstrap_inner `self.run()` /usr/lib64/python2.7/threading.py:764 in run `self.__target(*self.__args, **self.__kwargs)` /home/berrange/src/cloud/nova/nova/virt/libvirt/host.py:113 in _native_thread `libvirt.virEventRunDefaultImpl()` /usr/lib64/python2.7/site-packages/libvirt.py:340 in virEventRunDefaultImpl `ret = libvirtmod.virEventRunDefaultImpl()` ------ Thread #140157684520704 ------ /usr/lib64/python2.7/threading.py:784 in __bootstrap `self.__bootstrap_inner()` /usr/lib64/python2.7/threading.py:811 in __bootstrap_inner `self.run()` /usr/lib64/python2.7/threading.py:764 in run `self.__target(*self.__args, **self.__kwargs)` /usr/lib/python2.7/site-packages/eventlet/tpool.py:72 in tworker `msg = _reqq.get()` /usr/lib64/python2.7/Queue.py:168 in get `self.not_empty.wait()` /usr/lib64/python2.7/threading.py:339 in wait `waiter.acquire()` ------ Thread #140158296893184 ------ /usr/lib64/python2.7/threading.py:784 in __bootstrap `self.__bootstrap_inner()` /usr/lib64/python2.7/threading.py:811 in __bootstrap_inner `self.run()` /usr/lib64/python2.7/threading.py:764 in run `self.__target(*self.__args, **self.__kwargs)` /usr/lib/python2.7/site-packages/eventlet/tpool.py:72 in tworker `msg = _reqq.get()` /usr/lib64/python2.7/Queue.py:168 in get `self.not_empty.wait()` /usr/lib64/python2.7/threading.py:339 in wait `waiter.acquire()` ------ Thread #140158305285888 ------ /usr/lib64/python2.7/threading.py:784 in __bootstrap `self.__bootstrap_inner()` /usr/lib64/python2.7/threading.py:811 in __bootstrap_inner `self.run()` /usr/lib64/python2.7/threading.py:764 in run `self.__target(*self.__args, **self.__kwargs)` /usr/lib/python2.7/site-packages/eventlet/tpool.py:72 in tworker `msg = _reqq.get()` /usr/lib64/python2.7/Queue.py:168 in get `self.not_empty.wait()` /usr/lib64/python2.7/threading.py:339 in wait `waiter.acquire()` ------ Thread #140157709698816 ------ /usr/lib64/python2.7/threading.py:784 in __bootstrap `self.__bootstrap_inner()` /usr/lib64/python2.7/threading.py:811 in __bootstrap_inner `self.run()` /usr/lib64/python2.7/threading.py:764 in run `self.__target(*self.__args, **self.__kwargs)` /usr/lib/python2.7/site-packages/eventlet/tpool.py:72 in tworker `msg = _reqq.get()` /usr/lib64/python2.7/Queue.py:168 in get `self.not_empty.wait()` /usr/lib64/python2.7/threading.py:339 in wait `waiter.acquire()` ------ Thread #140158322071296 ------ /usr/lib64/python2.7/threading.py:784 in __bootstrap `self.__bootstrap_inner()` /usr/lib64/python2.7/threading.py:811 in __bootstrap_inner `self.run()` /usr/lib64/python2.7/threading.py:764 in run `self.__target(*self.__args, **self.__kwargs)` /usr/lib/python2.7/site-packages/eventlet/tpool.py:72 in tworker `msg = _reqq.get()` /usr/lib64/python2.7/Queue.py:168 in get `self.not_empty.wait()` /usr/lib64/python2.7/threading.py:339 in wait `waiter.acquire()` ------ Thread #140157726484224 ------ /usr/lib64/python2.7/threading.py:784 in __bootstrap `self.__bootstrap_inner()` /usr/lib64/python2.7/threading.py:811 in __bootstrap_inner `self.run()` /usr/lib64/python2.7/threading.py:764 in run `self.__target(*self.__args, **self.__kwargs)` /usr/lib/python2.7/site-packages/eventlet/tpool.py:72 in tworker `msg = _reqq.get()` /usr/lib64/python2.7/Queue.py:168 in get `self.not_empty.wait()` /usr/lib64/python2.7/threading.py:339 in wait `waiter.acquire()` ------ Thread #140158330464000 ------ /usr/lib64/python2.7/threading.py:784 in __bootstrap `self.__bootstrap_inner()` /usr/lib64/python2.7/threading.py:811 in __bootstrap_inner `self.run()` /usr/lib64/python2.7/threading.py:764 in run `self.__target(*self.__args, **self.__kwargs)` /usr/lib/python2.7/site-packages/eventlet/tpool.py:72 in tworker `msg = _reqq.get()` /usr/lib64/python2.7/Queue.py:168 in get `self.not_empty.wait()` /usr/lib64/python2.7/threading.py:339 in wait `waiter.acquire()` ------ Thread #140157332223744 ------ /usr/lib64/python2.7/threading.py:784 in __bootstrap `self.__bootstrap_inner()` /usr/lib64/python2.7/threading.py:811 in __bootstrap_inner `self.run()` /usr/lib64/python2.7/threading.py:764 in run `self.__target(*self.__args, **self.__kwargs)` /usr/lib/python2.7/site-packages/eventlet/tpool.py:72 in tworker `msg = _reqq.get()` /usr/lib64/python2.7/Queue.py:168 in get `self.not_empty.wait()` /usr/lib64/python2.7/threading.py:339 in wait `waiter.acquire()` ------ Thread #140157701306112 ------ /usr/lib64/python2.7/threading.py:784 in __bootstrap `self.__bootstrap_inner()` /usr/lib64/python2.7/threading.py:811 in __bootstrap_inner `self.run()` /usr/lib64/python2.7/threading.py:764 in run `self.__target(*self.__args, **self.__kwargs)` /usr/lib/python2.7/site-packages/eventlet/tpool.py:72 in tworker `msg = _reqq.get()` /usr/lib64/python2.7/Queue.py:168 in get `self.not_empty.wait()` /usr/lib64/python2.7/threading.py:339 in wait `waiter.acquire()` ------ Thread #140157692913408 ------ /usr/lib64/python2.7/threading.py:784 in __bootstrap `self.__bootstrap_inner()` /usr/lib64/python2.7/threading.py:811 in __bootstrap_inner `self.run()` /usr/lib64/python2.7/threading.py:764 in run `self.__target(*self.__args, **self.__kwargs)` /usr/lib/python2.7/site-packages/eventlet/tpool.py:72 in tworker `msg = _reqq.get()` /usr/lib64/python2.7/Queue.py:168 in get `self.not_empty.wait()` /usr/lib64/python2.7/threading.py:339 in wait `waiter.acquire()` ------ Thread #140157315438336 ------ /usr/lib64/python2.7/threading.py:784 in __bootstrap `self.__bootstrap_inner()` /usr/lib64/python2.7/threading.py:811 in __bootstrap_inner `self.run()` /usr/lib64/python2.7/threading.py:764 in run `self.__target(*self.__args, **self.__kwargs)` /usr/lib/python2.7/site-packages/eventlet/tpool.py:72 in tworker `msg = _reqq.get()` /usr/lib64/python2.7/Queue.py:168 in get `self.not_empty.wait()` /usr/lib64/python2.7/threading.py:339 in wait `waiter.acquire()` ------ Thread #140158537955136 ------ /usr/lib/python2.7/site-packages/eventlet/hubs/hub.py:346 in run `self.wait(sleep_time)` /usr/lib/python2.7/site-packages/eventlet/hubs/poll.py:85 in wait `presult = self.do_poll(seconds)` /usr/lib/python2.7/site-packages/eventlet/hubs/epolls.py:62 in do_poll `return self.poll.poll(seconds)` ------ Thread #140158313678592 ------ /usr/lib64/python2.7/threading.py:784 in __bootstrap `self.__bootstrap_inner()` /usr/lib64/python2.7/threading.py:811 in __bootstrap_inner `self.run()` /usr/lib64/python2.7/threading.py:764 in run `self.__target(*self.__args, **self.__kwargs)` /usr/lib/python2.7/site-packages/eventlet/tpool.py:72 in tworker `msg = _reqq.get()` /usr/lib64/python2.7/Queue.py:168 in get `self.not_empty.wait()` /usr/lib64/python2.7/threading.py:339 in wait `waiter.acquire()` ------ Thread #140157718091520 ------ /usr/lib64/python2.7/threading.py:784 in __bootstrap `self.__bootstrap_inner()` /usr/lib64/python2.7/threading.py:811 in __bootstrap_inner `self.run()` /usr/lib64/python2.7/threading.py:764 in run `self.__target(*self.__args, **self.__kwargs)` /usr/lib/python2.7/site-packages/eventlet/tpool.py:72 in tworker `msg = _reqq.get()` /usr/lib64/python2.7/Queue.py:168 in get `self.not_empty.wait()` /usr/lib64/python2.7/threading.py:339 in wait `waiter.acquire()` ------ Thread #140157323831040 ------ /usr/lib64/python2.7/threading.py:784 in __bootstrap `self.__bootstrap_inner()` /usr/lib64/python2.7/threading.py:811 in __bootstrap_inner `self.run()` /usr/lib64/python2.7/threading.py:764 in run `self.__target(*self.__args, **self.__kwargs)` /usr/lib/python2.7/site-packages/eventlet/tpool.py:72 in tworker `msg = _reqq.get()` /usr/lib64/python2.7/Queue.py:168 in get `self.not_empty.wait()` /usr/lib64/python2.7/threading.py:339 in wait `waiter.acquire()` ------ Thread #140158338856704 ------ /usr/lib64/python2.7/threading.py:784 in __bootstrap `self.__bootstrap_inner()` /usr/lib64/python2.7/threading.py:811 in __bootstrap_inner `self.run()` /usr/lib64/python2.7/threading.py:764 in run `self.__target(*self.__args, **self.__kwargs)` /usr/lib/python2.7/site-packages/eventlet/tpool.py:72 in tworker `msg = _reqq.get()` /usr/lib64/python2.7/Queue.py:168 in get `self.not_empty.wait()` /usr/lib64/python2.7/threading.py:339 in wait `waiter.acquire()` ======================================================================== ==== Green Threads ==== ======================================================================== ------ Green Thread ------ /usr/bin/nova-compute:9 in <module> `load_entry_point('nova==2015.1.dev352', 'console_scripts', 'nova-compute')()` /home/berrange/src/cloud/nova/nova/cmd/compute.py:74 in main `service.wait()` /home/berrange/src/cloud/nova/nova/service.py:444 in wait `_launcher.wait()` /home/berrange/src/cloud/nova/nova/openstack/common/service.py:187 in wait `status, signo = self._wait_for_exit_or_signal(ready_callback)` /home/berrange/src/cloud/nova/nova/openstack/common/service.py:170 in _wait_for_exit_or_signal `super(ServiceLauncher, self).wait()` /home/berrange/src/cloud/nova/nova/openstack/common/service.py:133 in wait `self.services.wait()` /home/berrange/src/cloud/nova/nova/openstack/common/service.py:473 in wait `self.tg.wait()` /home/berrange/src/cloud/nova/nova/openstack/common/threadgroup.py:145 in wait `x.wait()` /home/berrange/src/cloud/nova/nova/openstack/common/threadgroup.py:47 in wait `return self.thread.wait()` /usr/lib/python2.7/site-packages/eventlet/greenthread.py:175 in wait `return self._exit_event.wait()` /usr/lib/python2.7/site-packages/eventlet/event.py:121 in wait `return hubs.get_hub().switch()` /usr/lib/python2.7/site-packages/eventlet/hubs/hub.py:294 in switch `return self.greenlet.switch()` ------ Green Thread ------ No Traceback! ------ Green Thread ------ /usr/lib/python2.7/site-packages/eventlet/green/thread.py:40 in __thread_body `func(*args, **kwargs)` /usr/lib64/python2.7/threading.py:784 in __bootstrap `self.__bootstrap_inner()` /usr/lib64/python2.7/threading.py:811 in __bootstrap_inner `self.run()` /usr/lib64/python2.7/threading.py:764 in run `self.__target(*self.__args, **self.__kwargs)` /usr/lib/python2.7/site-packages/qpid/selector.py:126 in run `rd, wr, ex = select(self.reading, self.writing, (), timeout)` /usr/lib/python2.7/site-packages/eventlet/green/select.py:83 in select `return hub.switch()` /usr/lib/python2.7/site-packages/eventlet/hubs/hub.py:294 in switch `return self.greenlet.switch()` ------ Green Thread ------ /usr/lib/python2.7/site-packages/eventlet/greenthread.py:214 in main `result = function(*args, **kwargs)` /home/berrange/src/cloud/nova/nova/openstack/common/service.py:492 in run_service `done.wait()` /usr/lib/python2.7/site-packages/eventlet/event.py:121 in wait `return hubs.get_hub().switch()` /usr/lib/python2.7/site-packages/eventlet/hubs/hub.py:294 in switch `return self.greenlet.switch()` ------ Green Thread ------ /usr/lib/python2.7/site-packages/eventlet/greenthread.py:214 in main `result = function(*args, **kwargs)` /home/berrange/src/cloud/nova/nova/virt/libvirt/host.py:124 in _dispatch_thread `self._dispatch_events()` /home/berrange/src/cloud/nova/nova/virt/libvirt/host.py:228 in _dispatch_events `_c = self._event_notify_recv.read(1)` /usr/lib64/python2.7/socket.py:380 in read `data = self._sock.recv(left)` /usr/lib/python2.7/site-packages/eventlet/greenio.py:464 in recv `self._trampoline(self, read=True)` /usr/lib/python2.7/site-packages/eventlet/greenio.py:439 in _trampoline `mark_as_closed=self._mark_as_closed)` /usr/lib/python2.7/site-packages/eventlet/hubs/__init__.py:162 in trampoline `return hub.switch()` /usr/lib/python2.7/site-packages/eventlet/hubs/hub.py:294 in switch `return self.greenlet.switch()` ------ Green Thread ------ /usr/lib/python2.7/site-packages/eventlet/tpool.py:55 in tpool_trampoline `_c = _rsock.recv(1)` /usr/lib/python2.7/site-packages/eventlet/greenio.py:325 in recv `timeout_exc=socket.timeout("timed out"))` /usr/lib/python2.7/site-packages/eventlet/greenio.py:200 in _trampoline `mark_as_closed=self._mark_as_closed)` /usr/lib/python2.7/site-packages/eventlet/hubs/__init__.py:162 in trampoline `return hub.switch()` /usr/lib/python2.7/site-packages/eventlet/hubs/hub.py:294 in switch `return self.greenlet.switch()` ------ Green Thread ------ /usr/lib/python2.7/site-packages/eventlet/greenthread.py:214 in main `result = function(*args, **kwargs)` /usr/lib/python2.7/site-packages/oslo_utils/excutils.py:92 in inner_func `return infunc(*args, **kwargs)` /usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py:93 in _executor_thread `incoming = self.listener.poll()` /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:121 in poll `self.conn.consume(limit=1, timeout=timeout)` /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_qpid.py:755 in consume `six.next(it)` /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_qpid.py:685 in iterconsume `yield self.ensure(_error_callback, _consume)` /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_qpid.py:602 in ensure `return method()` /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_qpid.py:669 in _consume `timeout=poll_timeout)` <string>:6 in next_receiver (source not found) /usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py:674 in next_receiver `if self._ecwait(lambda: self.incoming, timeout):` /usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py:50 in _ecwait `result = self._ewait(lambda: self.closed or predicate(), timeout)` /usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py:580 in _ewait `result = self.connection._ewait(lambda: self.error or predicate(), timeout)` /usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py:218 in _ewait `result = self._wait(lambda: self.error or predicate(), timeout)` /usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py:197 in _wait `return self._waiter.wait(predicate, timeout=timeout)` /usr/lib/python2.7/site-packages/qpid/concurrency.py:59 in wait `self.condition.wait(timeout - passed)` /usr/lib/python2.7/site-packages/qpid/concurrency.py:96 in wait `sw.wait(timeout)` /usr/lib/python2.7/site-packages/qpid/compat.py:53 in wait `ready, _, _ = select([self], [], [], timeout)` /usr/lib/python2.7/site-packages/eventlet/green/select.py:83 in select `return hub.switch()` /usr/lib/python2.7/site-packages/eventlet/hubs/hub.py:294 in switch `return self.greenlet.switch()` ------ Green Thread ------ /home/berrange/src/cloud/nova/nova/openstack/common/loopingcall.py:90 in _inner `greenthread.sleep(-delay if delay < 0 else 0)` /usr/lib/python2.7/site-packages/eventlet/greenthread.py:34 in sleep `hub.switch()` /usr/lib/python2.7/site-packages/eventlet/hubs/hub.py:294 in switch `return self.greenlet.switch()` ------ Green Thread ------ /usr/lib/python2.7/site-packages/eventlet/greenthread.py:214 in main `result = function(*args, **kwargs)` /home/berrange/src/cloud/nova/nova/openstack/common/loopingcall.py:133 in _inner `greenthread.sleep(idle)` /usr/lib/python2.7/site-packages/eventlet/greenthread.py:34 in sleep `hub.switch()` /usr/lib/python2.7/site-packages/eventlet/hubs/hub.py:294 in switch `return self.greenlet.switch()` ======================================================================== ==== Processes ==== ======================================================================== Process 27760 (under 27711) [ run by: berrange (1000), state: running ] ======================================================================== ==== Configuration ==== ======================================================================== cells: bandwidth_update_interval = 600 call_timeout = 60 capabilities = hypervisor=xenserver;kvm os=linux;windows cell_type = compute enable = False manager = nova.cells.manager.CellsManager mute_child_interval = 300 name = nova reserve_percent = 10.0 topic = cells cinder: cafile = None catalog_info = volumev2:cinderv2:publicURL certfile = None cross_az_attach = True endpoint_template = None http_retries = 3 insecure = False keyfile = None os_region_name = None timeout = None conductor: manager = nova.conductor.manager.ConductorManager topic = conductor use_local = False workers = None database: backend = sqlalchemy connection = *** connection_debug = 0 connection_trace = False db_inc_retry_interval = True db_max_retries = 20 db_max_retry_interval = 10 db_retry_interval = 1 idle_timeout = 3600 max_overflow = None max_pool_size = None max_retries = 10 min_pool_size = 1 mysql_sql_mode = TRADITIONAL pool_timeout = None retry_interval = 10 slave_connection = *** sqlite_db = nova.sqlite sqlite_synchronous = True use_db_reconnect = False use_tpool = False default: allow_migrate_to_same_host = True allow_resize_to_same_host = True allow_same_net_traffic = True amqp_auto_delete = False amqp_durable_queues = False api_paste_config = /etc/nova/api-paste.ini api_rate_limit = False auth_strategy = keystone auto_assign_floating_ip = False backdoor_port = None bandwidth_poll_interval = 600 bindir = /usr/bin block_device_allocate_retries = 60 block_device_allocate_retries_interval = 3 boot_script_template = /home/berrange/src/cloud/nova/nova/cloudpipe/bootscript.template ca_file = cacert.pem ca_path = /home/berrange/src/cloud/data/nova/CA cert_manager = nova.cert.manager.CertManager client_socket_timeout = 900 cnt_vpn_clients = 0 compute_available_monitors = nova.compute.monitors.all_monitors compute_driver = libvirt.LibvirtDriver compute_manager = nova.compute.manager.ComputeManager compute_monitors = compute_resources = vcpu compute_stats_class = nova.compute.stats.Stats compute_topic = compute config-dir = None config-file = /etc/nova/nova.conf config_drive_format = iso9660 config_drive_skip_versions = 1.0 2007-01-19 2007-03-01 2007-08-29 2007-10-10 2007-12-15 2008-02-01 2008-09-01 console_host = mustard.gsslab.fab.redhat.com console_manager = nova.console.manager.ConsoleProxyManager console_topic = console consoleauth_manager = nova.consoleauth.manager.ConsoleAuthManager consoleauth_topic = consoleauth control_exchange = nova create_unique_mac_address_attempts = 5 crl_file = crl.pem db_driver = nova.db debug = True default_access_ip_network_name = None default_availability_zone = nova default_ephemeral_format = ext4 default_flavor = m1.small default_floating_pool = public default_log_levels = amqp=WARN amqplib=WARN boto=WARN glanceclient=WARN iso8601=WARN keystonemiddleware=WARN oslo.messaging=INFO qpid=WARN requests.packages.urllib3.connectionpool=WARN routes.middleware=WARN sqlalchemy=WARN stevedore=WARN suds=INFO urllib3.connectionpool=WARN websocket=WARN default_notification_level = INFO default_publisher_id = None default_schedule_zone = None defer_iptables_apply = False dhcp_domain = novalocal dhcp_lease_time = 86400 dhcpbridge = /usr/bin/nova-dhcpbridge dhcpbridge_flagfile = /etc/nova/nova.conf dmz_cidr = dmz_mask = 255.255.255.0 dmz_net = 10.0.0.0 dns_server = dns_update_periodic_interval = -1 dnsmasq_config_file = ebtables_exec_attempts = 3 ebtables_retry_interval = 1.0 ec2_listen = 0.0.0.0 ec2_listen_port = 8773 ec2_private_dns_show_ip = False ec2_strict_validation = True ec2_timestamp_expiry = 300 ec2_workers = 2 enable_new_services = True enabled_apis = ec2 metadata osapi_compute enabled_ssl_apis = fake_call = False fake_network = False fatal_deprecations = False fatal_exception_format_errors = False firewall_driver = nova.virt.libvirt.firewall.IptablesFirewallDriver fixed_ip_disassociate_timeout = 600 fixed_range_v6 = fd00::/48 flat_injected = False flat_interface = p1p1 flat_network_bridge = br100 flat_network_dns = 8.8.4.4 floating_ip_dns_manager = nova.network.noop_dns_driver.NoopDNSDriver force_config_drive = always force_dhcp_release = True force_raw_images = True force_snat_range = forward_bridge_interface = all gateway = None gateway_v6 = None heal_instance_info_cache_interval = 60 host = mustard.gsslab.fab.redhat.com image_cache_manager_interval = 2400 image_cache_subdirectory_name = _base injected_network_template = /home/berrange/src/cloud/nova/nova/virt/interfaces.template instance_build_timeout = 0 instance_delete_interval = 300 instance_dns_domain = instance_dns_manager = nova.network.noop_dns_driver.NoopDNSDriver instance_format = [instance: %(uuid)s] instance_name_template = instance-%08x instance_usage_audit = False instance_usage_audit_period = month instance_uuid_format = [instance: %(uuid)s] instances_path = /home/berrange/src/cloud/data/nova/instances internal_service_availability_zone = internal iptables_bottom_regex = iptables_drop_action = DROP iptables_top_regex = ipv6_backend = rfc2462 key_file = private/cakey.pem keys_path = /home/berrange/src/cloud/data/nova/keys keystone_ec2_insecure = False keystone_ec2_url = http://10.33.8.112:5000/v2.0/ec2tokens l3_lib = nova.network.l3.LinuxNetL3 linuxnet_interface_driver = nova.network.linux_net.LinuxBridgeInterfaceDriver linuxnet_ovs_integration_bridge = br-int live_migration_retry_count = 30 lockout_attempts = 5 lockout_minutes = 15 lockout_window = 15 log-config-append = None log-date-format = %Y-%m-%d %H:%M:%S log-dir = None log-file = None log-format = None logging_context_format_string = %(asctime)s.%(msecs)03d %(color)s%(levelname)s %(name)s [%(request_id)s %(user_name)s %(project_name)s%(color)s] %(instance)s%(color)s%(message)s logging_debug_format_suffix = from (pid=%(process)d) %(funcName)s %(pathname)s:%(lineno)d logging_default_format_string = %(asctime)s.%(msecs)03d %(color)s%(levelname)s %(name)s [-%(color)s] %(instance)s%(color)s%(message)s logging_exception_prefix = %(color)s%(asctime)s.%(msecs)03d TRACE %(name)s %(instance)s max_age = 0 max_concurrent_builds = 10 max_header_line = 16384 max_local_block_devices = 3 maximum_instance_delete_attempts = 5 memcached_servers = None metadata_host = 10.33.8.112 metadata_listen = 0.0.0.0 metadata_listen_port = 8775 metadata_manager = nova.api.manager.MetadataManager metadata_port = 8775 metadata_workers = 2 migrate_max_retries = -1 mkisofs_cmd = genisoimage monkey_patch = False monkey_patch_modules = nova.api.ec2.cloud:nova.notifications.notify_decorator nova.compute.api:nova.notifications.notify_decorator multi_host = True multi_instance_display_name_template = %(name)s-%(count)d my_block_storage_ip = 10.33.8.112 my_ip = 10.33.8.112 network_allocate_retries = 0 network_api_class = nova.network.api.API network_device_mtu = None network_driver = nova.network.linux_net network_manager = nova.network.manager.FlatDHCPManager network_size = 256 network_topic = network networks_path = /home/berrange/src/cloud/data/nova/networks non_inheritable_image_properties = bittorrent cache_in_nova notification_driver = notification_topics = notifications notify_api_faults = False notify_on_state_change = None novncproxy_base_url = http://10.33.8.112:6080/vnc_auto.html null_kernel = nokernel num_networks = 1 osapi_compute_listen = 0.0.0.0 osapi_compute_listen_port = 8774 osapi_compute_workers = 2 ovs_vsctl_timeout = 120 password_length = 12 pci_alias = pci_passthrough_whitelist = periodic_enable = True periodic_fuzzy_delay = 60 policy_default_rule = default policy_dirs = policy.d policy_file = policy.json preallocate_images = none project_cert_subject = /C=US/ST=California/O=OpenStack/OU=NovaDev/CN=project-ca-%.16s-%s public_interface = br100 publish_errors = False pybasedir = /home/berrange/src/cloud/nova qpid_heartbeat = 60 qpid_hostname = 10.33.8.112 qpid_hosts = 10.33.8.112:5672 qpid_password = *** qpid_port = 5672 qpid_protocol = tcp qpid_receiver_capacity = 1 qpid_sasl_mechanisms = qpid_tcp_nodelay = True qpid_topology_version = 1 qpid_username = quota_cores = 20 quota_driver = nova.quota.DbQuotaDriver quota_fixed_ips = -1 quota_floating_ips = 10 quota_injected_file_content_bytes = 10240 quota_injected_file_path_length = 255 quota_injected_files = 5 quota_instances = 10 quota_key_pairs = 100 quota_metadata_items = 128 quota_ram = 51200 quota_security_group_rules = 20 quota_security_groups = 10 quota_server_group_members = 10 quota_server_groups = 10 reboot_timeout = 0 reclaim_instance_interval = 0 remove_unused_base_images = True remove_unused_original_minimum_age_seconds = 86400 report_interval = 10 rescue_timeout = 0 reservation_expire = 86400 reserved_host_disk_mb = 0 reserved_host_memory_mb = 512 resize_confirm_window = 0 resize_fs_using_block_device = False resume_guests_state_on_host_boot = False rootwrap_config = /etc/nova/rootwrap.conf routing_source_ip = 10.33.8.112 rpc_backend = qpid rpc_conn_pool_size = 30 rpc_response_timeout = 60 rpc_thread_pool_size = 64 run_external_periodic_tasks = True running_deleted_instance_action = reap running_deleted_instance_poll_interval = 1800 running_deleted_instance_timeout = 0 scheduler_available_filters = nova.scheduler.filters.all_filters scheduler_default_filters = AvailabilityZoneFilter ComputeCapabilitiesFilter ComputeFilter ImagePropertiesFilter RamFilter RetryFilter ServerGroupAffinityFilter ServerGroupAntiAffinityFilter scheduler_manager = nova.scheduler.manager.SchedulerManager scheduler_max_attempts = 3 scheduler_topic = scheduler scheduler_weight_classes = nova.scheduler.weights.all_weighers security_group_api = nova send_arp_for_ha = True send_arp_for_ha_count = 3 service_down_time = 60 servicegroup_driver = db share_dhcp_address = False shelved_offload_time = 0 shelved_poll_interval = 3600 shutdown_timeout = 60 snapshot_name_template = snapshot-%s ssl_ca_file = None ssl_cert_file = None ssl_key_file = None state_path = /home/berrange/src/cloud/data/nova sync_power_state_interval = 600 syslog-log-facility = LOG_USER tcp_keepidle = 600 teardown_unused_network_gateway = False tempdir = None transport_url = None until_refresh = 0 update_dns_entries = False use-syslog = False use-syslog-rfc-format = False use_cow_images = True use_forwarded_for = False use_ipv6 = False use_network_dns_servers = False use_project_ca = False use_single_default_gateway = False use_stderr = True user_cert_subject = /C=US/ST=California/O=OpenStack/OU=NovaDev/CN=%.16s-%.16s-%s vcpu_pin_set = None vendordata_driver = nova.api.metadata.vendordata_json.JsonFileVendorData verbose = True vif_plugging_is_fatal = True vif_plugging_timeout = 300 virt_mkfs = vlan_interface = vlan_start = 100 vnc_enabled = True vnc_keymap = en-us vncserver_listen = 127.0.0.1 vncserver_proxyclient_address = 127.0.0.1 volume_api_class = nova.volume.cinder.API volume_usage_poll_interval = 0 vpn_flavor = m1.tiny vpn_image_id = 0 vpn_ip = 10.33.8.112 vpn_key_suffix = -vpn vpn_start = 1000 wsgi_default_pool_size = 1000 wsgi_keep_alive = True wsgi_log_format = %(client_ip)s "%(request_line)s" status: %(status_code)s len: %(body_length)s time: %(wall_seconds).7f xvpvncproxy_base_url = http://10.33.8.112:6081/console ephemeral_storage_encryption: cipher = aes-xts-plain64 enabled = False key_size = 512 glance: allowed_direct_url_schemes = api_insecure = False api_servers = http://10.33.8.112:9292 host = 10.33.8.112 num_retries = 0 port = 9292 protocol = http guestfs: debug = False keymgr: api_class = nova.keymgr.conf_key_mgr.ConfKeyManager libvirt: block_migration_flag = VIR_MIGRATE_UNDEFINE_SOURCE, VIR_MIGRATE_PEER2PEER, VIR_MIGRATE_LIVE, VIR_MIGRATE_TUNNELLED, VIR_MIGRATE_NON_SHARED_INC checksum_base_images = False checksum_interval_seconds = 3600 connection_uri = cpu_mode = none cpu_model = None disk_cachemodes = disk_prefix = None gid_maps = glusterfs_mount_point_base = /home/berrange/src/cloud/data/nova/mnt hw_disk_discard = None hw_machine_type = None image_info_filename_pattern = /home/berrange/src/cloud/data/nova/instances/_base/%(image)s.info images_rbd_ceph_conf = images_rbd_pool = rbd images_type = default images_volume_group = None inject_key = False inject_partition = -2 inject_password = False iscsi_iface = None iscsi_use_multipath = False iser_use_multipath = False live_migration_bandwidth = 0 live_migration_flag = VIR_MIGRATE_UNDEFINE_SOURCE, VIR_MIGRATE_PEER2PEER, VIR_MIGRATE_LIVE, VIR_MIGRATE_TUNNELLED live_migration_uri = qemu+ssh://berrange@%s/system mem_stats_period_seconds = 10 nfs_mount_options = None nfs_mount_point_base = /home/berrange/src/cloud/data/nova/mnt num_aoe_discover_tries = 3 num_iscsi_scan_tries = 5 num_iser_scan_tries = 5 qemu_allowed_storage_drivers = quobyte_client_cfg = None quobyte_mount_point_base = /home/berrange/src/cloud/data/nova/mnt rbd_secret_uuid = None rbd_user = None remove_unused_kernels = False remove_unused_resized_minimum_age_seconds = 3600 rescue_image_id = None rescue_kernel_id = None rescue_ramdisk_id = None rng_dev_path = None scality_sofs_config = None scality_sofs_mount_point = /home/berrange/src/cloud/data/nova/scality smbfs_mount_options = smbfs_mount_point_base = /home/berrange/src/cloud/data/nova/mnt snapshot_compression = False snapshot_image_format = None snapshots_directory = /home/berrange/src/cloud/data/nova/instances/snapshots sparse_logical_volumes = False sysinfo_serial = auto uid_maps = use_usb_tablet = False use_virtio_for_bridges = True virt_type = kvm volume_clear = zero volume_clear_size = 0 volume_drivers = aoe=nova.virt.libvirt.volume.LibvirtAOEVolumeDriver fake=nova.virt.libvirt.volume.LibvirtFakeVolumeDriver fibre_channel=nova.virt.libvirt.volume.LibvirtFibreChannelVolumeDriver glusterfs=nova.virt.libvirt.volume.LibvirtGlusterfsVolumeDriver gpfs=nova.virt.libvirt.volume.LibvirtGPFSVolumeDriver iscsi=nova.virt.libvirt.volume.LibvirtISCSIVolumeDriver iser=nova.virt.libvirt.volume.LibvirtISERVolumeDriver local=nova.virt.libvirt.volume.LibvirtVolumeDriver nfs=nova.virt.libvirt.volume.LibvirtNFSVolumeDriver quobyte=nova.virt.libvirt.volume.LibvirtQuobyteVolumeDriver rbd=nova.virt.libvirt.volume.LibvirtNetVolumeDriver scality=nova.virt.libvirt.volume.LibvirtScalityVolumeDriver sheepdog=nova.virt.libvirt.volume.LibvirtNetVolumeDriver smbfs=nova.virt.libvirt.volume.LibvirtSMBFSVolumeDriver wait_soft_reboot_seconds = 120 xen_hvmloader_path = /usr/lib/xen/boot/hvmloader osapi_v3: enabled = True extensions_blacklist = extensions_whitelist = oslo_concurrency: disable_process_locking = False lock_path = /home/berrange/src/cloud/data/nova rdp: enabled = False html5_proxy_base_url = http://127.0.0.1:6083/ remote_debug: host = None port = None serial_console: base_url = ws://127.0.0.1:6083/ enabled = False listen = 127.0.0.1 port_range = 10000:20000 proxyclient_address = 127.0.0.1 spice: agent_enabled = True enabled = False html5proxy_base_url = http://10.33.8.112:6082/spice_auto.html keymap = en-us server_listen = 127.0.0.1 server_proxyclient_address = 127.0.0.1 ssl: ca_file = None cert_file = None key_file = None upgrade_levels: baseapi = None cells = None compute = None conductor = None console = None consoleauth = None network = None scheduler = None workarounds: disable_libvirt_livesnapshot = True disable_rootwrap = False
You may also try faulthandler which works even if Python is stuck ;-) Since it works as lower level, it cannot provide as much information as your code. you can use faulthandler with a timeout (watchdog) or a signal, and so it majes sense to use it on production.
This is really interesting! I didn’t know about such a feature and I think it could be very useful.
However, when I try it on my Juno testbed (based on Ubuntu 14.04), it doesn’t work… I’ve created the /etc/nova/openstack-common.conf and restarted nova-compute, run kill -USR1 but nothing is printed on standard error (nor written to logfile)
Is the openstack-common.conf to be placed in /etc/nova or somwhere else? Is there an option ot set in nova.conf to enable reporting?
The note about openstack-common.conf is refering to what developers have todo to enable it in other openstack projects. It is not something you do as an admin in production. It is unconditionally enabled in Nova in production and just gets sent to stderr, assuming your initsystem has not redirected stderr to /dev/null.
When running under mod_wsgi, it doesn’t work. For example, I start ceilometer-api in mod_wsgi, which doesn’t start its own standalone process but existing a mod_wsgi daemon. In this case, runing kill -USR1 will kill the subprocess when we use the daemon model of mod_wsgi.
What can we do to deal with the case? thanks.
Reading from https://code.google.com/p/modwsgi/wiki/ConfigurationDirectives#WSGIRestrictSignal it seems mod_wsgi needs USR1 for its own purposes, so the obvious thing would be to make openstack use a different signal, eg USR2, or indeed any one of the realtime signals – IIRC there are about 32 realtime signals available for application use. eg kill -RTMIN+5 $PID
Hi, I have done some tests about it. It still doesn’t work. Because the Apache can start a main process for an application, when we send a request to application, the application can start a subprocess in the WSGIDaemonProcess mode. we can not add the guru-meditation-reports the main process, but if adding the guru-meditation-reports to the application, when we send a request. It will report an error:
ValueError: signal only works in main thread
1. move the guru-meditation-reports into ceilometer.
2. I make the config based on http://docs.openstack.org/developer/ceilometer/install/mod_wsgi.html
the app.wsgi is the following code:
from ceilometer import service
from ceilometer.api import app
from ceilometer.openstack.common.report import guru_meditation_report as gmr
from ceilometer import version
# Initialize the oslo configuration library and logging
service.prepare_service([])
application = app.load_app()
log_dir=”/var/log/apache2/ceilometer_error.log”
gmr.TextGuruMeditation.setup_autorun(version, service_name=None, log_dir=log_dir, signum=None)
3. Set WSGIRestrictSignal Off in the /etc/apache2/mods-available/wsgi.conf
4. service apache2 restart
root@tralon-2:/var/www/ceilometer# ps -ef | grep wsgi:ceilometer
root@tralon-2:/var/www/ceilometer# ps -ef | grep wsgi:ceilometer
stack 27605 27602 0 11:41 ? 00:00:00 (wsgi:ceilometer- -k start
stack 27606 27602 0 11:41 ? 00:00:00 (wsgi:ceilometer- -k start
root 29142 15894 0 11:50 pts/1 00:00:00 grep –color=auto wsgi:ceilometer
5. send a command
curl -g -i -X ‘GET’ ‘http://192.168.86.90:8777/v2/meters’ -H ‘User-Agent: ceilometerclient.openstack.common.apiclient’ -H ‘X-Auth-Token: 699d63abb60346d89fa8bd02c74804ac’
return a 500 error
the log is like this:
2015-05-21 11:51:37.686045 mod_wsgi (pid=27606): Exception occurred processing WSGI script ‘/var/www/ceilometer/app.wsgi’.
2015-05-21 11:51:37.686093 Traceback (most recent call last):
2015-05-21 11:51:37.686120 File “/var/www/ceilometer/app.wsgi”, line 30, in
2015-05-21 11:51:37.686194 gmr.TextGuruMeditation.setup_autorun(version, service_name=None, log_dir=log_dir, signum=None)
2015-05-21 11:51:37.686213 File “/opt/stack/ceilometer/ceilometer/openstack/common/report/guru_meditation_report.py”, line 126, in setup_autorun
2015-05-21 11:51:37.686378 lambda sn, tb: cls.handle_signal(
2015-05-21 11:51:37.686457 ValueError: signal only works in main thread