Service restarts
You can configure a service restart backoff period to control how quickly a service is restarted following the failure of an init or run hook.
Overview
The restart backoff behavior is set by three parameters:
- The minimum backoff period sets the minimum duration in seconds to wait before restarting a service.
- The maximum backoff period sets the maximum duration in seconds to wait before restarting a service.
- The restart cooldown period sets the time in seconds to wait before resetting the current backoff duration to the minimum backoff period.
Note
Enable these values by using the sup run command and passing the number of seconds to the following parameters:
service-min-backoff-periodservice-max-backoff-periodservice-restart-cooldown-period
For example:
hab sup run --service-min-backoff-period 5 --service-max-backoff-period 20 --service-restart-cooldown-period 60 core/redis
You can also set this behavior using these parameters in the supervisor configuration file:
service_min_backoff_periodservice_max_backoff_periodservice_restart_cooldown_period.
Chef Habitat uses a decorrelated jitter algorithm to determine the backoff period. See this blog post for a more in-depth comparison of various backoff algorithms and their efficiency.
Note
service-min-backoff-period and service-max-backoff-period to the same time in seconds.Service failure detection
Adding restart backoff behavior requires the ability to detect when a service has successfully started so the backoff period can be reset. Unfortunately, there is no clean way to differentiate between a service failure and a service taking too long to start up. A health-check hook would enable detection of successful service startups; however, if a health check is absent, there is no way to know if the service started successfully. There may also be cases where the initial health check succeeds, but the service goes down shortly afterward.
We attempt to solve this problem by using a restart cooldown period. The cooldown period is a continuous duration without a restart, after which we assume a service has started successfully. It’s important to configure this correctly to ensure the backoff period doesn’t get reset prematurely.
We recommend setting the service-restart-cooldown-period to be at least double your expected startup time to be safe. In general, a longer cooldown won’t have an adverse effect; however, a shorter one may prevent the backoff behavior completely.
See the following examples.
Examples
Slow service with an incorrect configuration
hab sup run --service-min-backoff-period 5 --service-max-backoff-period 20 --service-restart-cooldown-period 10 ORG_NAME/SERVICE_NAME
In the event of a failure during startup with the above configuration, the service will continue to restart after 5 seconds because the service crashes again after the short restart cooldown period has passed, potentially leading to excessive load on external APIs:
- T = 0, service starts up
- T = 30, service crashes and will be restarted after 5 seconds
- T = 35, service is restarted
- T = 45, service backoff period is reset to 5 seconds because 10 seconds have elapsed since the last restart
- T = 65, service crashes again and is restarted after 5 seconds due to the backoff period resetting at T = 45
- T = 70, service is restarted again
Slow service with a correct configuration
hab sup run --service-min-backoff-period 5 --service-max-backoff-period 20 --service-restart-cooldown-period 60 ORG_NAME/SERVICE_NAME
In the event of a failure during startup with the above configuration, the service restarts at a random time (15 seconds in this example), which reduces the load on external APIs:
- T = 0, service starts up
- T = 30, service crashes and will be restarted after 5 seconds
- T = 35, service is restarted
- T = 65, service crashes again and restarts after a random duration between 5 and 20 seconds; for this example, assume 15
- T = 80, service is restarted again, notice that the backoff period hasn’t been reset.
- T = 140, service backoff period is reset to 5 seconds because 60 seconds have elapsed since the last restart