2022-02-01 15:22:30 Two problems.
- If a stop action panics (in `_terminate_facet`), the Facet is dropped before its outbound
handles are removed. With the code as it stands, this leaks assertions (!!).
- The logic for removing an outbound handle seems to be running in the wrong facet context???
(See `f.outbound_handles.remove(&handle)` in the cleanup actions
- I think I need to remove the for_myself mechanism
- and add some callbacks to run only on successful commit
2022-02-02 12:12:33 This is hard.
Here's the current implementation:
- assert
- inserts into outbound_handles of active facet
- adds cleanup action describing how to do the retraction
- enqueues the assert action, which
- calls e.assert()
- retract
- looks up & removes the cleanup action, which
- enqueues the retract action, which
- removes from outbound_handles of the WRONG facet in the WRONG actor
- calls e.retract()
- _terminate_facet
- uses outbound_handles to retract the facet's assertions
- doesn't directly touch cleanup actions, relying on retract to do that
- if one of a facet's stop actions panics, will drop the facet, leaking its assertions
- actually, even if a stop action yields `Err`, it will drop the facet and leak assertions
- yikes
- facet drop
- panics if outbound_handles is nonempty
- actor cleanup
- relies on facet tree to find assertions to retract
Revised plan:
- ✓ revise Activation/PendingEvents structures
- rename `cleanup_actions` to `outbound_assertions`
- remove `for_myself` queues and `final_actions`
- add `pre_commit_actions`, `rollback_actions` and `commit_actions`
- ✓ assert
- as before
- but on rollback, removes from `outbound_handles` (if the facet still exists) and
`outbound_assertions` (always)
- marks the new assertion as "established" on commit
- ✓ retract
- lookup in `outbound_assertions` by handle, using presence as indication it hasn't been
scheduled in this turn
- on rollback, put it back in `outbound_assertions` ONLY IF IT IS MARKED ESTABLISHED -
otherwise it is a retraction of an `assert` that has *also* been rolled back in this turn
- on commit, remove it from `outbound_handles`
- enqueue the retract action, which just calls e.retract()
- ✓ _terminate_facet
- revised quite a bit now we rely on `RunningActor::cleanup` to use `outbound_assertions`
rather than the facet tree.
- still drops Facets on panic, but this is now mostly harmless (reorders retractions a bit)
- handles `Err` from a stop action more gracefully
- slightly cleverer tracking of what needs doing based on a `TerminationDirection`
- now ONLY applies to ORDERLY cleanup of the facet tree. Disorderly cleanup ignores the
facet tree and just retracts the assertions willy-nilly.
- ✓ facet drop
- warn if outbound_handles is nonempty, but don't do anything about it
- ✓ actor cleanup
- doesn't use the facet tree at all.
- cleanly shutting down is done elsewhere
- uses the remaining entries in `outbound_assertions` (previously `cleanup_actions`) to
deal with retractions for dropped facets as well as any other facets that haven't been
cleanly shut down
- ✓ activate
- now has a panic_guard::PanicGuard RAII for conveying a crash to an actor in case the
activation is happening from a linked task or another thread (this wasn't the case in the
examples that provoked this work, though)
- simplified
- explicit commit/rollback decision
- ✓ Actor::run
- no longer uses the same path for crash-termination and success-termination
- instead, for success-termination, takes a turn that calls Activation::stop_root
- this cleans up the facet tree using _terminate_facet
- when the turn ends, it notices that the root facet is gone and shuts down the actor
- so in principle there will be nothing for actor cleanup to do
2022-02-04 13:52:34 This took days. :-(
If a facet, during X, asserts X, for all X, then X includes all
`Observe` assertions. Assertion of X should be a no-op (though
subsequent retractions of X will have no effect!) since duplicates are
ignored. However, the implementation had been ignoring whether it had
seen `Observe` assertions before, and was *always* (re)placing them
into the index, leading to runaway growth.
The repair is to only process `Observe` records on first assertion and
last retraction.
As part of this change, Dataspaces have been given names, and some
cruft from the previous implementation has been removed.
They're still there: you can use turn.state.shutdown(), which enqueues
a message for eventual actor shutdown. But it's better to use
turn.stop_root(), which terminates the actor's root facet within the
current turn, ensuring that the actor's exit_status is definitely set
by the time the turn has committed.
This is necessary to avoid a racy panic in supervision: before this
change, an asynchronous SystemMessage::Release was sent when the last
facet of an actor was stopped. Depending on load (!), any retractions
resulting from the shutdown would be delivered before the Release
arrived at the stopping actor. The supervision logic expected
exit_status to be definitely set by the time release() fired, which
wasn't always true. Now that in-turn shutdown has been implemented,
this is a reliable invariant.
A knock-on change is the need to remove
enqueue_for_myself_at_commit(), replacing it with a use of
pending.for_myself.push(). The old enqueue_for_myself_at_commit
approach could lead to lost actions as follows:
A: start linked task T, which spawns a new tokio coroutine
T: activate some facet in A and terminate A's root facet
T: at this point, A transitions to "not running"
A: spawn B, enqueuing a call to B's boot()
A: commit turn. Deliveries for others go out as usual,
but those for A will be discarded since A is "not running".
This means that the call to B's boot() goes missing.
Using pending.for_myself.push() instead assures that B's boot will
always run at the end of A's turn, without regard for whether A is in
some terminated state.
I think that this kind of race could have happened before, but
something about switching away from shutdown() seems to trigger it
somewhat reliably.
In particular:
1. The root facet is considered inert even if it has outbound
assertions. This is because the only outbound assertion it can have is
a half-link to a peer actor, which shouldn't prevent the actor from
terminating normally if the user-level "root" facet stops.
2. On stop_facet_and_continue, parent-facet continuations execute
inline rather than at commit time. This is so that a user-level "root"
facet can *replace* itself. Remains to be properly exercised/tested.