Introduction
attempt
allows you to retry fallible commands with a delay. It accepts the command line of a child
command. This command is run repeatedly, until it either completes without failing or the allowed
number of retries is exhausted.
There are two cases when this is often used:
- In an environment like
docker-compose
, where a service may need to wait on it's dependencies to start up (compare towait-for-it.sh
) - When writing shell scripts accessing remote resources which may become temporarily unavailable
Why does retrying work?
Many failures are transient
Many failures are transient, even flukes, and are unlikely to be rencountered if retried. For example, a system we integrate with may have a rare bug that impacts some small number of requests arbitrarily, without regards to their content. Such bugs may go a long time before they are fixed. In such a case, immediately retrying the request would succeed.
Many systems self-heal
Many systems are able to provision additional capacity in response to increased load. If given time to scale, these systems will process our requests successfully.
Retry strategies
attempt
accepts various arguments to configure it's behavior and how it responds to failures.
Together these form our retry strategy.
Retry policy
The retry policy allows us to distinguish between temporary and permanent error condtions, to determine whether we should try another attempt. We can look at a command's exit status, or we can inspect it's output for relevant messages.
Failures such as network timeouts or rate limiting errors (like the HTTP status code 429) should be retried. Failures that stem from the content of our requests (like the HTTP status code 422) will not succeed on a retry, and so we should terminate.
Backoff schedule
The backoff schedule determines how long we wait between successive attempts. The simplest schedule is a fixed delay, where we wait the same amount of time between all attempts. But it's often a good idea to wait longer the more failures we encounter to avoid sending requests to an already overloaded system.
We may want an aggressive backoff schedule with short delays, so that we recover from transient failures quickly. But longer delays allow systems to self-heal.
The exponential backoff strategy allows us to balance these concerns. In exponential backoff we double the amount of time we wait after each failure. This allows us to start with an aggressive delay but to fall back to a long delay if the outage proves persistent.
Retry limit
Retry logic without a retry limit is an infinite loop. We'll need to pick some sort of limit.
There is a tradeoff between how long we keep retrying and how quickly we can act on an irrecoverable failure. The longer we wait, the more likely we are to recover from a transient failure. But if our command is never going to succeed, we want to know sooner than later so that we can investigate and fix the problem. We'll need to set a retry limit that balances these concerns.
Jitter
An essential element of any retry strategy is random jitter. If we do not add randomness to the delays produced by our retry strategy, we will encounter emergent cyclic behavior. Concurrent retriers will syncronize with each other. This will create huge spikes in requests per second where all clients send requests at once.
Random jitter breaks up these emergent patterns and flattens the RPS curve. This helps systems that are having trouble or need to scale recover smoothly.
Lifecycle of an attempt
- The child command is executed.
- The retry policy is evaluated against the child command.
- If a stop rule is matched, we terminate the program immediately.
- If a retry rule is matched, and we have not reached the allowed number of retries, return to step 1.
- Otherwise, we terminate the program.
Peaking under the hood with --verbose
We can use the --verbose
argument to expose these steps of the lifecycle. To show all the
information available, we'll use --verbose --verbose
, shortened to -vv
.
Let's run attempt
against /bin/true
, a program that always succeeds.
> attempt -vv /bin/true
Starting new attempt...
Evaluating policy...
Command exited: exit status: 0.
Stop: Command was successful.
Terminated: Success.
>
true
exited with a status of 0
, indicating success. Because it was successful, it was not
eligible to be retried. And so, attempt
exits.
Now let's try running attempt
against /bin/false
, a program which always fails.
> attempt -vv /bin/false
Starting new attempt...
Evaluating policy...
Command exited: exit status: 1.
Retry: Command failed.
Command has failed, retrying in 1.00 seconds...
Starting new attempt...
Evaluating policy...
Command exited: exit status: 1.
Retry: Command failed.
Command has failed, retrying in 1.00 seconds...
Starting new attempt...
Evaluating policy...
Command exited: exit status: 1.
Retry: Command failed.
Terminated: Retries exhausted.
>
false
always exits with a status of 1
, and so attempt
will always retry it if possible. By
default, attempt
waits for 1 second between attempts, and will retry a maximum of 3 times. After
the third attempt, the program exits.
Example usage
Basic usage
# Basic example
attempt /bin/false
attempt --verbose /bin/false
Disambiguating child arguments
# Use `--` to disambiguate between arguments to `attempt` and arguments to
# it's child command
attempt -a 10 -- foo -a bar
A sqlx
example
# Rerun database migrations if the server was not ready
# Useful for `docker-compose` and similar tools (any place
# you'd use `wait-for-it.sh` & aren't restricted to bash).
attempt --retry-if-contains "server not ready" sqlx migrate
Using an exponential backoff
# Use exponential backoff
attempt exp /bin/false
attempt exponential /bin/false
# Change the multiplier from 1 second to 100 milliseconds. Instead of starting
# with a 1 second delay (then 2s, 4s, 8s, ...), use a 100 millisecond delay
# (then 200ms, 400ms, 800ms, ...).
attempt exp -x 100ms
attempt exp --multiplier 100ms
# Change the base from 2 to 3. This will triple the delay between every
# attempt, rather than doubling it.
attempt exp -b 3
attempt exp --base 3
Change the number of attempts
# Change the number of attempts from 3 to 10
attempt -a 10 /bin/false
attempt --attempts 10 /bin/false
Add random jitter to the wait time
# Add 1 second of random jitter to wait time
attempt -j 1s /bin/false
attempt --jitter 1s /bin/false
Setting a minimum or maximum wait time
# Set a minimum wait time of 5 seconds between attempts
attempt -m 5s /bin/false
attempt --wait-min 5s /bin/false
# Set a maximum wait time of 15 minutes between attempts
attempt -M 15m exponential /bin/false
attempt --wait-max 15m exponential /bin/false
# Combine min and max wait times
attempt -m 5s -M 15m exponential /bin/false
Setting a timeout on child command runtime
# Kill the child command if it runs longer than 30 seconds
attempt -t 30s -- sleep 60
attempt --timeout 30s -- sleep 60
Retrying on status codes
# Retry on any failing status code (non-zero)
attempt -F /bin/false
attempt --retry-failing-status /bin/false
# Retry only on specific status codes
attempt --retry-if-status 1 /bin/false
attempt --retry-if-status "1,2,3" /bin/false # Retry on 1, 2, or 3
attempt --retry-if-status "1..5" /bin/false # Retry on any status between 1 and 5
attempt --retry-if-status "1..5,10,15..20" /bin/false # Combine rules to form complex patterns
# Retry connection timeouts with `curl`
attempt --retry-if-status 28 curl https://example.com
# Stop retrying on specific status codes
attempt --stop-if-status 2 command_with_permanent_errors
Retrying on child command output
# Retry if output contains a specific string
attempt --retry-if-contains "Connection timed out" curl https://example.com
attempt --retry-if-stderr-contains "Connection timed out" curl https://example.com
# Retry if output matches a regex pattern
attempt --retry-if-matches "error \d+" error_prone_command
attempt --retry-if-stdout-matches "failed.*retry" flaky_service
# Stop retrying if output indicates permanent failure
attempt --stop-if-contains "Authentication failed" secure_command
attempt --stop-if-stderr-contains "Authentication failed" secure_command
As a long-running parent process
# Always restart the child command if it exits, and do not wait before restarting
attempt --forever --wait 0 long_running_service
Installation
cargo install attempt-cli
Backoff schedule
The backoff schedule determines how long we wait between successive attempts. The simplest schedule is a fixed delay, where we wait the same amount of time between all attempts. But it's often a good idea to wait longer the more failures we encounter to avoid sending requests to an already overloaded system.
We may want an aggressive backoff schedule with short delays, so that we recover from transient failures quickly. But longer delays allow systems to self-heal.
The exponential backoff strategy allows us to balance these concerns. In exponential backoff we double the amount of time we wait after each failure. This allows us to start with an aggressive delay but to fall back to a long delay if the outage proves persistent.
Fixed schedule
Fixed delay is the default schedule. It sleeps for a constant amount of time between attempts.
Attempt | Delay |
---|---|
0 | 1 |
1 | 1 |
2 | 1 |
Specifying durations
Durations in attempt
can include units, such as 5min
. Durations without units are assumed to be
in DURATION.
The following units are supported:
- Hours (
h
orhr
) - Minutes (
m
ormin
) - Seconds (
s
) - Milliseconds (
ms
) - Nanoseconds (
ns
)
Multiple units can be used together such as 1hr 30m
.
Example
# Because it is the default, specifying `fixed` is optional
attempt /bin/false
attempt fixed /bin/false
# Change the wait time from the default of 1 second to 15 DURATION
attempt fixed -w 15s /bin/false
attempt fixed --wait 15s /bin/false
Arguments
-w --wait <DURATION>
The amount of time to sleep between attempts.
Exponential schedule
Wait exponentially more time between attempts, using the following formula:
<multiplier> * (<base> ^ <attempts>)
Attempt | Delay |
---|---|
0 | 1 |
1 | 2 |
2 | 4 |
The attempt counter starts at 0, so the first wait is is for <multiplier>
seconds.
Specifying durations
Durations in attempt
can include units, such as 5min
. Durations without units are assumed to be
in seconds.
The following units are supported:
- Hours (
h
orhr
) - Minutes (
m
ormin
) - Seconds (
s
) - Milliseconds (
ms
) - Nanoseconds (
ns
)
Multiple units can be used together such as 1hr 30m
.
Example
attempt exponential /bin/false
attempt exp /bin/false
# Change the multiplier from the default of 1s to 2s
attempt exponential -x 2s /bin/false
attempt exponential --multiplier 2s /bin/false
# Change the exponential base from the default of 2 to 5
attempt exponential -b 5 /bin/false
attempt exponential --base 5 /bin/false
Arguments
-b --base <BASE>
The base of the exponential function. The default is 2, corresponding to doubling the wait time between attempts.
-x --multipler <MULTIPLIER>
Scale of the exponential function. The default is 1.
Linear schedule
Wait more time between attempts, using the following formula:
(<multiplier> * <attempts>) + <starting_wait>
.
Attempt | Delay |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
Specifying durations
Durations in attempt
can include units, such as 5min
. Durations without units are assumed to be
in DURATION.
The following units are supported:
- Hours (
h
orhr
) - Minutes (
m
ormin
) - Seconds (
s
) - Milliseconds (
ms
) - Nanoseconds (
ns
)
Multiple units can be used together such as 1hr 30m
.
Example
attempt linear /bin/false
# Change the multiplier from the default of 1 to 2
attempt linear -x 2 /bin/false
attempt linear --multiplier 2 /bin/false
# Change the starting wait from the default of 1 to 5
attempt linear -W 5s /bin/false
attempt linear --starting-wait 5s /bin/false
Arguments
-w --starting-wait <DURATION>
The number of seconds to wait after the first attempts.
-x --multiplier <DURATION>
The number of additional seconds to wait after each subsequent request.
Policy controls
Policy controls define the conditions under whether attempt
decides to retry or stop retrying a
command. They use predicates that examine the command's exit status, output, or signals.
Each predicate has two variants: retry and stop. Retry predicates cause attempt
to retry the
command, while stop predicates cause us to terminate.
Usage
Retry predicates are used to identify temporary error conditions, such as network timeouts or rate
limiting errors, which may be resolved by retrying. attempt
will try to rerun the child command if
a retry predicate is matched. It will may be prevented from doing so if the maximum number of
attempts has been reached, or if a stop predicate is matched.
Stop predicates are used to identify permanent error conditions, such as authentication errors or
malformed content errors, which will never be resolved by retrying. If a stop predicate matches,
attempt
will imediately cease retrying the child command.
Precedence
Stop predicates always have precedence over retry predicates.
> attempt --stop-if-status 1 --retry-if-status 1 /bin/false
Command exited: exit status: 1.
Stop: Status matches.
Terminated: Command has failed, but cannot be retried.
>
The default policy
If no policy controls are specified, attempt
will retry if the child command which exits with a
status other than 0 or that is killed by a signal. It will retry a maximum of 3 times with a fixed
delay of 1 second.
This is equivalent to the following arguments:
attempt \
fixed --wait 1s \
--attempts 3 \
--retry-failing-status \
--retry-if-killed \
/bin/false
Retry control
-a --attempts
The number of times the child command is retried.
-U --unlimited-attempts
Continue retrying until the command is successful, without limiting the number of retries.
--retry-always
Always retry, unless a stop predicate is triggered.
--forever
This will continue retrying the command forever, regardless of whether it succeeds or fails. This is useful for long-running services you wish to restart if they crash.
Status predicates
Make policy decisions based on the child command's exit code. Note that commands killed by a signal do not have an exit code; these predicates will not impact programs which timed out, ran out of memory, or which crashed due to a segmentation fault, or were otherwise killed by a signal.
Performance
Status predicates are very cheap and are typically stable across versions of the child command. They should be preffered whenever possible.
Status patterns
Status predicates attempt to match the child's exit code against a pattern supplied in the argument. The syntax for patterns is as follows.
- Individual codes:
1
- Inclusive ranges:
1..5
- Combinations of patterns:
1,2,3,10..15
- Whitespace is allowed:
1, 2, 3
- Note that valid status codes are in the range [0, 255]
Retry predicates
-F, --retry-failing-status
Retry when the command exits with any non-zero status code. This is a convenient shorthand if you want to retry on failure but still want to combine other predicates (for example, additional output-based predicates).
--retry-if-status <STATUS_CODE>
Retry if the child process exit code matches the given status pattern.
Stop predicates
--stop-if-status <STATUS_CODE>
Stop retrying if the child process exit code matches the given status pattern.
Output predicates
Output predicates examine the text written to stdout and stderr by the child command to determine whether to retry or stop. These predicates are useful when exit codes alone don't provide enough information about the failure condition.
Performance
Output predicates are often more convinient than status predicates, and users are encouraged to use them when in circumstances when developer time is the greatest concern (such as in command lines and for throw-away scripts). However, they are discouraged in scripting or CI use.
Output predicates are the most expensive. They require additional system calls to retrieve the output and additional computation to search it. They are typically unstable across different versions of the child process.
Status predicates should be prefferred when possible. If output predicates are used, the more
specific predicates (eg --retry-if-stdout-contains
vs --retry-if-contains
) should be preferred.
Regular expression syntax
The regular expression predicates use the
regex
crate's syntax.
Retry predicates
--retry-if-contains <STRING>
Retry if either stdout or stderr contains the specified string.
--retry-if-matches <REGEX>
Retry if either stdout or stderr matches the specified regular expression.
--retry-if-stdout-contains <STRING>
Retry if stdout contains the specified string.
--retry-if-stdout-matches <REGEX>
Retry if stdout matches the specified regular expression.
--retry-if-stderr-contains <STRING>
Retry if stderr contains the specified string.
--retry-if-stderr-matches <REGEX>
Retry if stderr matches the specified regular expression.
Stop predicates
--stop-if-contains <STRING>
Stop retrying if either stdout or stderr contains the specified string.
--stop-if-matches <REGEX>
Stop retrying if either stdout or stderr matches the specified regular expression.
--stop-if-stdout-contains <STRING>
Stop retrying if stdout contains the specified string.
--stop-if-stdout-matches <REGEX>
Stop retrying if stdout matches the specified regular expression.
--stop-if-stderr-contains <STRING>
Stop retrying if stderr contains the specified string.
--stop-if-stderr-matches <REGEX>
Stop retrying if stderr matches the specified regular expression.
Timeout & signal predicates
Signal patterns
Some signal predicates attempt to match the signal number against a pattern supplied in the argument. The syntax for patterns is as follows.
- Individual codes:
1
- Inclusive ranges:
1..5
- Combinations of patterns:
1,2,3,10..15
- Whitespace is allowed:
1, 2, 3
- Note that valid status codes are in the range [0, 255]
Retry predicates
--retry-if-timeout
Stop retrying if the command was killed specifically due to a timeout. This requires that the
--timeout
option is also specified.
--retry-if-killed
Retry if the command was killed by any signal. Note this implies --retry-if-timeout
, because
timeouts use signals to terminate processes.
--retry-if-signal <PATTERN>
Retrying if the command was killed by any signal matching the given pattern.
This is only available on Unix systems.
Stop predicates
--stop-if-timeout
Stop retrying if the command was killed specifically due to a timeout. This requires that the
--timeout
option is also specified.
--stop-if-killed
Stop retrying if the command was killed by any signal. Note this implies --stop-if-timeout
,
because timeouts use signals to terminate processes.
--stop-if-signal <PATTERN>
Stop retrying if the command was killed by any signal matching the given pattern.
This is only available on Unix systems.
Timing controls
Timing controls adjust how attempt
manages time-related aspects of command retries. They determine
the delays between attempts and the time limits for each attempt.
Wait control
Specifying durations
Durations in attempt
can include units, such as 5min
. Durations without units are assumed to be
in seconds.
The following units are supported:
- Hours (
h
orhr
) - Minutes (
m
ormin
) - Seconds (
s
) - Milliseconds (
ms
) - Nanoseconds (
ns
)
Multiple units can be used together such as 1hr 30m
.
-j --jitter <DURATION>
For a jitter value of n
, adds a value in the interval [0, n]
to the delay time. This is useful
for preventing "thundering herd" issues. Jitter is always added last, so even if the delay was
rounded by --wait-min/--wait-max
, it will still be randomized.
-m --wait-min <DURATION>
Round any delay smaller than the specified minimum up to that minimum.
-M --wait-max <DURATION>
Round any delay larger than the specified maximum up to that maximum. This is useful when using the linear or exponential strategies, to ensure that you do not sleep for an unbounded amount of time.
--stagger <DURATION>
Stagger the delay by a random amount in the interval [0, n]
. This is useful for desyncronizing
multiple concurrent instances of attempt
which start at the same time.
Timeout control
Specifying durations
Durations in attempt
can include units, such as 5min
. Durations without units are assumed to be
in seconds.
The following units are supported:
- Hours (
h
orhr
) - Minutes (
m
ormin
) - Seconds (
s
) - Milliseconds (
ms
) - Nanoseconds (
ns
)
Multiple units can be used together such as 1hr 30m
.
-t --timeout <DURATION>
Kill the child command if it does not complete within the timeout. This prevents attempt
from
waiting indefinitely on a child that may never exit. For instance, the child could be stuck in an
infinite loop.
The child is polled using an exponential backoff with a base of 2 and a multiplier of 10ms, saturating at a maximum delay of 15s.
-R --expected-runtime <DURATION>
Specify how much time the command is expected to take. The child command will be polled slowly during this time (once per minute).
This is useful to reduce load on the system. An assumption made in the design of the timeout feature that most commands exit quickly, so the child is polled fairly aggressively. This may adversely impact performance for some use cases.
--retry-if-timeout
Stop the command if it was killed by a signal. This includes timeouts, since timeouts use signals to terminate processes.
--stop-if-timeout
Stop the command if it was killed by a signal. This includes timeouts, since timeouts use signals to terminate processes.
Exit codes
The following exit codes are used by attempt
to indicate whether it failed and how. Scripts should
use these exit codes and not the log messages exposed by --verbose
; these exit codes will remain
stable, but no such guarantee is made for the log messages.
Code number | Description |
---|---|
0 | Command was run successfully within the allowed number of retries. |
1 | I/O error (eg, command not found). An error message will be printed. |
2 | Invalid arguments. An error message will be printed. |
3 | The number of retries has been exhausted without the command ever succeeding. |
4 | The number of retries has not been exhausted, but the command is no longer retryable because of a "stop" predicate. |
101 | attempt has crashed. The most likely cause is using output predicates on data which is not UTF-8 encoded. |
Advice for Scripting
Use an exponential backoff
Exponential backoff allows you to retry aggressively at first while quickly backing off to a significant wait time. Many errors are transient, and worked around by retrying quickly. Other errors may take a long time to resolve, and may not resolve if we stress the system with load.
Leave the base
argument as it's default of 2, and use the multiplier
argument to control how
aggressively you retry. If you want it to be very aggressive, use a value of 0.050
(50
milliseconds), and if you want it to be very conservative, use a value of 60
(1 minute). The
default of 1 second is a good balance overall, but if you are accessing public resources, consider
using a value of 5 seconds or greater as a courtesy.
Set a max wait time
Set max wait to a reasonable value, like 900
(15 minutes), so that you do not wait an unbounded
amount amount of time. This is especially important when using an exponential backoff.
Add jitter to the wait time
Random jitter will help you avoid emergent cyclic behavior. This may occur if your script is running on multiple systems concurrently, or if you are accessing a public resource and many programmers chose similar constants for their retry logic (eg, if everyone chooses round numbers, then they all will be multiples of 5, and at some point everyone's retry logic will sync up and make a request at the exact same time). A useful metaphor is metronome synchronization.
See also the thundering herd problem.
Set a timeout on the child command
Set a timeout on the child command so that you don't get stuck if there is an infinite loop, dead lock, or similar issue.
Avoid output predicates
Output predicates can create performance issues. Try to use status predicates whenever possible.
If you must use an output predicate, use the specific stdout
or stderr
variant. The generic
variants are provided for convenience, but are not as performant.
Bibliography
This bibliography presents works for the following uses:
- To fact check claims made in this manual.
- To provide further guidance in cases this manual doesn't address.
- To supply context about
attempt
's design.
Tenacity
Tenacity is a Python library for retrying. It is the
primary inspiration for attempt
's design.
Tenacity forked from Retrying in 2016 as it's original author and maintainer, Ray Holder, stopped responding to members of the community who reached out. Holder wrote Retrying in 2013.
Retrying is currently maintained by Greg Roodt in
a seperate fork. Roodt organized the transfer of the
retrying
package name name in 2022. He attempting to transfer the name to Tenacity, but while he
succeeded in taking over the name, the issue to transfer ownership to Tenacity was not followed up
on. It remains open at the time of this writing. Roodt's fork recieves periodic updates.
Retrying established the core concepts of the architecture:
- An
@retry(...)
decorator which retries a wrapped function - Three categories of rules, which together for the retrying strategy
- Retry rules, determining which circumstances result in a retry
- Stop rules, determining which cirucmstances terminate retrying
- Wait rules, determining how long we sleep after an attempt
Retrying baked it's predicates into the arguments of the @retry(...)
decorator, much like
attempt
.
Tenacity extended the architecture to support functions as arguments, unlocking arbitrary predicates, and to use context managers in addition to decorators. It added support for async contexts. Tenacity has also created a large library of utilities and extensive documentation.
Additional links
- Retrying issue 65: Friendly fork?
- Retrying issue 100: Maintenance status
- Retrying issue 97: Transfer ownership to tenacity
- Tenacity issue 356: Publish under retrying on PyPI
- pypi issue 2205: Request: retrying
References
- The Synchronization of Periodic Routing Messages by Sally Floyd and Van Jacobsen, Lawrence Berkeley Laboratory
- Exponential backoff and jitter by Marc Brooker, AWS
- Transient fault handling from Azure Architecture Center
- Retry pattern from Azure Architecture Center
Case studies
- How to avoid a self-inflicted DDoS attack: CRE Life Lessons by Dave Rensin and Adrian Hilton, Google
- Preventing DB failures - Exponential retries with Jitter by Krishnakumar Sathyanarayana
- Good Retry, Bad Retry: An Incident Story by Denis Isaev, Yandex