Supervise Actors

A supervisor is an actor looking after child actors and restarting them as necessary when they exit.

Setup a supervisor

We setup a supervisor A10 with the default supervision strategy :one_by_one:

julia> using Actor, .Threads

julia> import Actors: spawn

julia> A10 = supervisor()
Link{Channel{Any}}(Channel{Any}(32), 1, :supervisor)

Supervise child actors

We start six actors A1-A6 and supervise them with A10 with default restart arguments. If they fail, they will be restarted with their threadid behavior and are assumed to be :transient (they get restarted if they terminate abnormally).

julia> A = map(_->spawn(threadid), 1:6);    # spawn A1 - A6

julia> t = map(a->Actors.diag(a, :task), A) # A1 - A6 are running
6-element Vector{Task}:
 Task (runnable) @0x000000016e948560
 Task (runnable) @0x000000016e949660
 Task (runnable) @0x000000016e949880
 Task (runnable) @0x000000016e949bb0
 Task (runnable) @0x000000016e949ee0
 Task (runnable) @0x000000016e94a100

julia> foreach(a->exec(a, supervise, A10), A)

One for one: With the default supervision strategy :one_for_one the supervisor restarts a single actor when it fails:

julia> send(A[4], :boom);                   # let A4 fail
┌ Warning: 2021-02-13 12:55:27 x-d-kuhub-dabab: Exit: supervised Task (failed) @0x000000016e949bb0, MethodError(Base.Threads.threadid, (:boom,), 0x0000000000007458)
└ @ Actors ~/.julia/dev/Actors/src/logging.jl:31
┌ Warning: 2021-02-13 12:55:27 x-d-kuhub-dabab: supervisor: restarting
└ @ Actors ~/.julia/dev/Actors/src/logging.jl:31

julia> t = map(a->Actors.diag(a, :task), A) # look at the tasks
6-element Vector{Task}:
 Task (runnable) @0x000000016e948560
 Task (runnable) @0x000000016e949660
 Task (runnable) @0x000000016e949880
 Task (runnable) @0x000000010e3b8230
 Task (runnable) @0x000000016e949ee0
 Task (runnable) @0x000000016e94a100

A1-A6 have all runnable tasks, but A4 has been restarted.

One for all: With the second strategy :one_for_all, all supervised actors/tasks get restarted if one of them fails. That allows to restart a group of equitable actors depending on each other. Normally we would choose the strategy at supervisor start, but now we change the supervision strategy of the running supervisor A10 and let A4 fail again:

julia> set_strategy(A10, :one_for_all)      # change restart strategy
(Actors.Strategy(:one_for_all),)

julia> send(A[4], :boom);                   # let A4 fail again
┌ Warning: 2021-02-13 12:57:16 x-d-kuhub-dabab: Exit: supervised Task (failed) @0x000000010e3b8230, MethodError(Base.Threads.threadid, (:boom,), 0x0000000000007459)
└ @ Actors ~/.julia/dev/Actors/src/logging.jl:31
┌ Warning: 2021-02-13 12:57:16 x-d-kuhub-dabab: supervisor: restarting all
└ @ Actors ~/.julia/dev/Actors/src/logging.jl:31

julia> t = map(a->Actors.diag(a, :task), A)
6-element Vector{Task}:
 Task (runnable) @0x000000010e3b8450
 Task (runnable) @0x000000010e3b8670
 Task (runnable) @0x000000010e3b8890
 Task (runnable) @0x000000010e3b8ab0
 Task (runnable) @0x000000010e3b8cd0
 Task (runnable) @0x000000010e3b9000

All actors have been restarted (got new tasks).

Rest for one: With :rest_for_one only the failed actor and the actors that registered for supervision after it are restarted. That allows to restart a failed actor and only those other actors depending on it. Again we change A10's strategy and let A4 fail:

julia> set_strategy(A10, :rest_for_one)     # change strategy again
(Actors.Strategy(:rest_for_one),)

julia> send(A[4], :boom);                   # let A4 fail
┌ Warning: 2021-02-13 12:58:33 x-d-kuhub-dabab: Exit: supervised Task (failed) @0x000000010e3b8ab0, MethodError(Base.Threads.threadid, (:boom,), 0x000000000000745a)
└ @ Actors ~/.julia/dev/Actors/src/logging.jl:31
┌ Warning: 2021-02-13 12:58:33 x-d-kuhub-dabab: supervisor: restarting rest
└ @ Actors ~/.julia/dev/Actors/src/logging.jl:31

julia> t = map(a->Actors.diag(a, :task), A)
6-element Vector{Task}:
 Task (runnable) @0x000000010e3b8450
 Task (runnable) @0x000000010e3b8670
 Task (runnable) @0x000000010e3b8890
 Task (runnable) @0x000000010e3b9220
 Task (runnable) @0x000000010e3b9440
 Task (runnable) @0x000000010e3b9770

Now A4 - A6 have been restarted.

Further options

With further supervisor options we can limit how often a supervisor tries to restart children in a given timeframe. If it exceeds this limit, it terminates itself and all of its children with a warning.

Query failed actors

For all failures we got warnings, but we can query the last failures from the supervisor and get more information about them:

julia> failed = Actors.diag(A10, :err)      # the three failed tasks can be queried from the supervisor
3-element Vector{Task}:
 Task (failed) @0x000000016e949bb0
 Task (failed) @0x000000010e3b8230
 Task (failed) @0x000000010e3b8ab0

julia> failed[1]                            # exceptions and stacktraces are available
Task (failed) @0x000000016e949bb0
MethodError: no method matching threadid(::Symbol)
....

Maintain actor state across restarts

By default a supervisor restarts an actor with the behavior it had before exiting. An actor thus maintains its state over a restart:

julia> sv = supervisor()
Link{Channel{Any}}(Channel{Any}(32), 1, :supervisor)

julia> incr(xr, by=0) = xr[] += by        # define an accumulator
incr (generic function with 2 methods)

julia> myactor = spawn(incr, Ref(10))     # start an actor accumulating from 10
Link{Channel{Any}}(Channel{Any}(32), 1, :default)

julia> exec(myactor, supervise, sv);      # put it under supervision

julia> foreach(x->call(myactor, x), 1:10) # accumulate it

julia> call(myactor)
65

julia> send(myactor, :boom);              # let it fail
┌ Warning: 2021-04-19 17:00:17 x-d-kolok-ib Exit: supervised Task (failed) @0x000000016b9d4010, MethodError(+, (65, :boom), 0x00000000000074aa)
└ @ Actors ~/.julia/dev/Actors/src/logging.jl:39
┌ Warning: 2021-04-19 17:00:17 x-d-kolok-ib supervisor: restarting
└ @ Actors ~/.julia/dev/Actors/src/logging.jl:39

julia> call(myactor)
65

julia> info(myactor)
Actor    default
Behavior incr
Pid      1, Thread 1
Task     @0x000000016da2ba80
Ident    x-d-kukof-ropab

myactor has maintained its state over failure even if it got a new task.

Actor state recovery after node failures is different!

In case of a node failure an actor cannot send its state at failure time to the supervisor. In those cases you can use termination and restart callbacks and checkpointing for recovery.