Error Handling Overview


Failures in computations are due to hardware, software, and human cause failures. A single cause for such errors is difficult to quantify and depends strongly on circumstances. Therefore we can conclude that

  • failures are in principle unavoidable and that
  • we must take care of all three sources of errors [1].

Error Handling

An actor system consists of computationally separate and concurrent entities. If one actor fails, the system does not crash immediately as do sequentially organized applications. Other actors can continue their tasks as long as they do not try to communicate with the failed one. The system now is in a problematic state, and we must somehow prevent further cascading failures.

The solution is not to defend against errors but to organize the system such that actors

  • monitor each other for failures and
  • perform corrective actions if a failure is detected.

Actors provides the following mechanisms [1][2][3] to handle errors in actors and tasks:

mechanismbrief description
connectionsActors connect and propagate an Exit to each other.
monitorsActors can monitor other actors and tasks.
supervisorsActors can be supervised and restarted.
checkpointingActors can save checkpoints to checkpointing actors and restore them.

The _ROOT Actor

If connect or monitor are called from the REPL or a user script and not from an actor, the given link will be connected to or monitored by the Actors._ROOT actor.

  • 1Egwutuoha, I.P., Levy, D., Selic, B. et al. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J Supercomput 65, 1302–1326 (2013).
  • 2An outline of actor-based error handling is Joe Armstrong's dissertation: Making reliable distributed systems in the presence of software errors
  • 3For implementation see also Joe Armstrong 2013. Programming Erlang, 2nd ed: Software for a Concurrent World; Manning, chs. 13 and 23 as well as the Erlang/OTP and Elixir online documentations.