From: Jonathan Rajotte Date: Thu, 9 Dec 2021 20:14:26 +0000 (-0500) Subject: Fix: sessiond: assert on lttng_ht_add_unique_str on ltt_sessions_ht_by_name X-Git-Url: http://drtracing.org/?a=commitdiff_plain;h=0e8b1d49c46e95dccaf0fa0fb5ad952643c60584;p=lttng-tools.git Fix: sessiond: assert on lttng_ht_add_unique_str on ltt_sessions_ht_by_name Observed issue ============== The lttng-sessiond asserts with the following backtrace on lttng create: #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007ffff7ab5859 in __GI_abort () at abort.c:79 #2 0x00007ffff7ab5729 in __assert_fail_base (fmt=0x7ffff7c4b588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5555556ab0a6 "node_ptr == &node->node", file=0x5555556ab085 "hashtable.c", line=298, function=) at a #3 0x00007ffff7ac6f36 in __GI___assert_fail (assertion=assertion@entry=0x5555556ab0a6 "node_ptr == &node->node", file=file@entry=0x5555556ab085 "hashtable.c", line=line@entry=298, function=function@entry=0x5555556ab380 <__PRETTY_FUNCTIO #4 0x000055555560be44 in lttng_ht_add_unique_str (ht=, node=0x7fffe0026c58) at hashtable.c:298 #5 0x000055555558fb6a in add_session_ht (ls=0x7fffe0024970) at session.c:372 #6 session_create (name=, uid=1000, gid=1000, out_session=out_session@entry=0x7fffedfddbd8) at session.c:1308 #7 0x000055555559b219 in cmd_create_session_from_descriptor (creds=, creds=, home_path=, descriptor=) at cmd.c:3040 #8 cmd_create_session (cmd_ctx=cmd_ctx@entry=0x7fffedfe5fa0, sock=, return_descriptor=return_descriptor@entry=0x7fffedfddd68) at cmd.c:3176 #9 0x00005555555cc341 in process_client_msg (sock_error=0x7fffedfddd10, sock=0x7fffedfddd0c, cmd_ctx=0x7fffedfe5fa0) at client.c:2177 #10 thread_manage_clients (data=) at client.c:2742 #11 0x00005555555c5fff in launch_thread (data=0x55555571b780) at thread.c:66 #12 0x00007ffff7c8b609 in start_thread (arg=) at pthread_create.c:477 #13 0x00007ffff7bb2293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 The issue can be reproduced with modifications to the rotation thread code and the following scenario: $ lttng create test $ lttng enable-event -u -a $ lttng start run any app just so that we have a complete valid session. (might not be necessary) $ lttng destroy --no-wait $ lttng create test ^ Should assert here. The diff to be applied: diff --git a/src/bin/lttng-sessiond/rotation-thread.cpp b/src/bin/lttng-sessiond/rotation-thread.cpp index ac149c845..c11f068ed 100644 --- a/src/bin/lttng-sessiond/rotation-thread.cpp +++ b/src/bin/lttng-sessiond/rotation-thread.cpp @@ -565,6 +565,8 @@ int handle_job_queue(struct rotation_thread_handle *handle, { int ret = 0; + sleep(5); + for (;;) { struct ltt_session *session; struct rotation_thread_job *job; Note that the initial report for this issue was on a system under load for which the `lttng destroy` completion check failed and a `lttng create` was performed. As of today the exact reason why the completion check failed is not known. Still we can "fix" the race leading to the lttng-sessiond assertion considering a user might use the `--no-wait` variant of `lttng destroy` and could easily end up in this situation. Cause ===== Note: all `lttng create` commands have the same session name passed as argument. On `lttng destroy` the ltt_session object is flagged as destroyed (ltt_session::destroyed). The removal of the object from the hash table (ltt_sessions_ht_by_name) will be performed during the `session_release` which is driven by the session refcount. A reference on the `ltt_session` object is held for the rotation initiated by the `lttng destroy` command. The rotation is enqueued by the rotation thread. At this point the system is busy and the rotation thread does not run. We simulate this with a `sleep(5)` during the `handle_job_queue`. The `lttng destroy --no-wait` returns. If the `--no-wait` option is not passed the `lttng destroy` command will work as expected and wait for completion. We can SIGINT the `lttng destroy` command and perform a `lttng create` yielding the same backtrace. On `lttng create`, `session_create` validates that the name does not conflict with an existing session using `session_find_by_name`. It is important to note that `session_find_by_name` discriminates also on the `session->destroyed` flag (introduced by [1]). The `ltt_sessions_ht_by_name` hash table was introduced by [2] to remove the need to lock the session list to sample a session id during the queueing of actions to be executed related to a trigger. The assumption was made that, during the creation phase, the session would always be unique in that hash table based on its name. This is simply not true since multiple sessions with the same name can coexist as long as only a single one is marked as "not destroyed". This is an important concept due to the refcounting of the object and the feature relying on the lifetime of the object (i.e rotation). This is mostly valid when talking about the global session list. Solution ======== Move the hash table removal earlier during the release of the session object. Move the removal from `del_session_ht`, which is done during the `session_release` function, to the `lttng_session_destroy` function. It is safe to do so since currently the only user of that hash table (the action executor) does not care much about destroyed session at that point. This ensures that we maintain the uniqueness property of the key (name) for that hash table on insertion. The alternative was to expose an hash table that could contain duplicates and force the handling of a set on all lookups. Known drawbacks ========= None. References ========== [1] https://git.lttng.org/?p=lttng-tools.git;a=commit;h=e32d7f274604b77bcd83c24994e88df3761ed658 [2] https://git.lttng.org/?p=lttng-tools.git;a=commit;h=e1bbf98908a6399f39a9a8bc95bd8b59cecaa816 Signed-off-by: Jonathan Rajotte Signed-off-by: Jérémie Galarneau Change-Id: I2f1d0d6c04ee7210166e9847a850afbe6eaa7609 --- diff --git a/src/bin/lttng-sessiond/session.c b/src/bin/lttng-sessiond/session.c index 2d95aa268..fd7a0aae8 100644 --- a/src/bin/lttng-sessiond/session.c +++ b/src/bin/lttng-sessiond/session.c @@ -398,7 +398,7 @@ end: } /* - * Remove a ltt_session from the ltt_sessions_ht_by_id/name. + * Remove a ltt_session from the ltt_sessions_ht_by_id. * If empty, the ltt_sessions_ht_by_id/name HTs are freed. * The session list lock must be held. */ @@ -415,10 +415,6 @@ static void del_session_ht(struct ltt_session *ls) ret = lttng_ht_del(ltt_sessions_ht_by_id, &iter); assert(!ret); - iter.iter.node = &ls->node_by_name.node; - ret = lttng_ht_del(ltt_sessions_ht_by_name, &iter); - assert(!ret); - if (ltt_sessions_ht_empty()) { DBG("Empty ltt_sessions_ht_by_id/name, destroying hast tables"); ltt_sessions_ht_destroy(); @@ -1065,8 +1061,23 @@ void session_put(struct ltt_session *session) */ void session_destroy(struct ltt_session *session) { + int ret; + struct lttng_ht_iter iter; + assert(!session->destroyed); session->destroyed = true; + + /* + * Remove immediately from the "session by name" hash table. Only one + * session is expected to exist with a given name for at any given time. + * + * Even if a session still technically exists for a little while longer, + * there is no point in performing action on a "destroyed" session. + */ + iter.iter.node = &session->node_by_name.node; + ret = lttng_ht_del(ltt_sessions_ht_by_name, &iter); + assert(!ret); + session_put(session); }