Fix: ust: UST communication can return -EAGAIN
Observed issue
==============
The following scenario lead to an abort on event creation. The
problem manifest itself when an application is unresponsive. Note that
the default timeout for ust communication is 5 seconds.
# Start an instrumented app
./app
gdb lttng-sessiond
# put a breakpoint on ustctl_create_event.
lttng create my_session
lttng enable-event -u -a
lttng start
# The tracepoint should hit. Do not continue.
kill -s SIGSTOP $(pgrep app)
# Continue lttng-sessiond.
# lttng-sessiond will abort.
Note that for UST this is not an expected behaviour. Expected
communication failure with a single app should not invalidate the
complete channel, compromise its setup or result in an abort.
Note that a similar scenario for the following ustctl call sites also
lead to scenario where failure of a single app lead to error reporting
and/or error propagation to upper level object.
Problematic callsites:
ustctl_set_exclusion
ustctl_set_filter
ustctl_disable_channel
These callsites are also fixed by this patch.
Cause
=====
For an unresponsive application, EAGAIN is returned and is treated as an
"unknown" hard error.
In this particular case the abort() call was introduced by commit:
88e3c2f5610b9ac89b0923d448fee34140fc46fb [1]. It is not clear if this is
a leftover from debugging session since this is the only callsite where
an abort is issued on communication failure via ustctl.
Solution
========
Handle EAGAIN coming from ustctl_* and treat it the same way a
dying application is handled. The only minor difference is that we WARN
on communication time out. Albeit not the most useful thing for a CLI
client, it could help overall user of lttng-sessiond in time out
situation.
Most call site already handled "unknown" error correctly. For those call
site we simply end up bringing more info in regards to the timeout
issue instead of mentioning that "-11" was returned.
Note, the reclamation of "app" is handled by the poll loop and
ust_app_unregister since the socket is shutdown by lttng-ust internally
on error, including EAGAIN.
Note that the application will try to register itself back to the
lttng-sessiond based on its configuration.
Known drawbacks
=========
None
Note
==========
Some logging call sites used the ppid of the app instead of the pid.
Those have been changed to pid.
References
==========
[1] https://github.com/lttng/lttng-tools/commit/
88e3c2f5610b9ac89b0923d448fee34140fc46fb
Fixes: #1384
Change-Id: If364b5d48e7fd2b664276a0fb1b7eec2c45ed683
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>