Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

attempt allows you to retry fallible commands with a delay. It accepts the command line of a child command. This command is run repeatedly, until it either completes without failing or the allowed number of retries is exhausted.

There are two cases when this is often used:

  • In an environment like docker-compose, where a service may need to wait on it's dependencies to start up (compare to wait-for-it.sh)
  • When writing shell scripts accessing remote resources which may become temporarily unavailable

Why does retrying work?

Many failures are transient

Many failures are transient, even flukes, and are unlikely to be rencountered if retried. For example, a system we integrate with may have a rare bug that impacts some small number of requests arbitrarily, without regards to their content. Such bugs may go a long time before they are fixed. In such a case, immediately retrying the request would succeed.

Many systems self-heal

Many systems are able to provision additional capacity in response to increased load. If given time to scale, these systems will process our requests successfully.

Retry strategies

attempt accepts various arguments to configure it's behavior and how it responds to failures. Together these form our retry strategy.

Retry policy

The retry policy allows us to distinguish between temporary and permanent error condtions, to determine whether we should try another attempt. We can look at a command's exit status, or we can inspect it's output for relevant messages.

Failures such as network timeouts or rate limiting errors (like the HTTP status code 429) should be retried. Failures that stem from the content of our requests (like the HTTP status code 422) will not succeed on a retry, and so we should terminate.

Backoff schedule

The backoff schedule determines how long we wait between successive attempts. The simplest schedule is a fixed delay, where we wait the same amount of time between all attempts. But it's often a good idea to wait longer the more failures we encounter to avoid sending requests to an already overloaded system.

We may want an aggressive backoff schedule with short delays, so that we recover from transient failures quickly. But longer delays allow systems to self-heal.

The exponential backoff strategy allows us to balance these concerns. In exponential backoff we double the amount of time we wait after each failure. This allows us to start with an aggressive delay but to fall back to a long delay if the outage proves persistent.

Retry limit

Retry logic without a retry limit is an infinite loop. We'll need to pick some sort of limit.

There is a tradeoff between how long we keep retrying and how quickly we can act on an irrecoverable failure. The longer we wait, the more likely we are to recover from a transient failure. But if our command is never going to succeed, we want to know sooner than later so that we can investigate and fix the problem. We'll need to set a retry limit that balances these concerns.

Jitter

An essential element of any retry strategy is random jitter. If we do not add randomness to the delays produced by our retry strategy, we will encounter emergent cyclic behavior. Concurrent retriers will syncronize with each other. This will create huge spikes in requests per second where all clients send requests at once.

Random jitter breaks up these emergent patterns and flattens the RPS curve. This helps systems that are having trouble or need to scale recover smoothly.

Lifecycle of an attempt

  1. The child command is executed.
  2. The retry policy is evaluated against the child command.
  3. If a stop rule is matched, we terminate the program immediately.
  4. If a retry rule is matched, and we have not reached the allowed number of retries, return to step 1.
  5. Otherwise, we terminate the program.

Peaking under the hood with --verbose

We can use the --verbose argument to expose these steps of the lifecycle. To show all the information available, we'll use --verbose --verbose, shortened to -vv.

Let's run attempt against /bin/true, a program that always succeeds.

> attempt -vv /bin/true
Starting new attempt...
Evaluating policy...
Command exited: exit status: 0.
Stop: Command was successful.
Terminated: Success.
>

true exited with a status of 0, indicating success. Because it was successful, it was not eligible to be retried. And so, attempt exits.

Now let's try running attempt against /bin/false, a program which always fails.

> attempt -vv /bin/false
Starting new attempt...
Evaluating policy...
Command exited: exit status: 1.
Retry: Command failed.
Command has failed, retrying in 1.00 seconds...
Starting new attempt...
Evaluating policy...
Command exited: exit status: 1.
Retry: Command failed.
Command has failed, retrying in 1.00 seconds...
Starting new attempt...
Evaluating policy...
Command exited: exit status: 1.
Retry: Command failed.
Terminated: Retries exhausted.
>

false always exits with a status of 1, and so attempt will always retry it if possible. By default, attempt waits for 1 second between attempts, and will retry a maximum of 3 times. After the third attempt, the program exits.

Example usage

Basic usage

# Basic example
attempt /bin/false
attempt --verbose /bin/false

Disambiguating child arguments

# Use `--` to disambiguate between arguments to `attempt` and arguments to
# it's child command
attempt -a 10 -- foo -a bar

A sqlx example

# Rerun database migrations if the server was not ready
# Useful for `docker-compose` and similar tools (any place
# you'd use `wait-for-it.sh` & aren't restricted to bash).
attempt --retry-if-contains "server not ready" sqlx migrate

Using an exponential backoff

# Use exponential backoff
attempt exp /bin/false
attempt exponential /bin/false

# Change the multiplier from 1 second to 100 milliseconds. Instead of starting
# with a 1 second delay (then 2s, 4s, 8s, ...), use a 100 millisecond delay
# (then 200ms, 400ms, 800ms, ...).
attempt exp -x 100ms
attempt exp --multiplier 100ms

# Change the base from 2 to 3. This will triple the delay between every
# attempt, rather than doubling it.
attempt exp -b 3
attempt exp --base 3

Change the number of attempts

# Change the number of attempts from 3 to 10
attempt -a 10 /bin/false
attempt --attempts 10 /bin/false

Add random jitter to the wait time

# Add 1 second of random jitter to wait time
attempt -j 1s /bin/false
attempt --jitter 1s /bin/false

Setting a minimum or maximum wait time

# Set a minimum wait time of 5 seconds between attempts
attempt -m 5s /bin/false
attempt --wait-min 5s /bin/false

# Set a maximum wait time of 15 minutes between attempts
attempt -M 15m exponential /bin/false
attempt --wait-max 15m exponential /bin/false

# Combine min and max wait times
attempt -m 5s -M 15m exponential /bin/false

Setting a timeout on child command runtime

# Kill the child command if it runs longer than 30 seconds
attempt -t 30s -- sleep 60
attempt --timeout 30s -- sleep 60

Retrying on status codes

# Retry on any failing status code (non-zero)
attempt -F /bin/false
attempt --retry-failing-status /bin/false

# Retry only on specific status codes
attempt --retry-if-status 1 /bin/false
attempt --retry-if-status "1,2,3" /bin/false # Retry on 1, 2, or 3
attempt --retry-if-status "1..5" /bin/false # Retry on any status between 1 and 5
attempt --retry-if-status "1..5,10,15..20" /bin/false # Combine rules to form complex patterns

# Retry connection timeouts with `curl`
attempt --retry-if-status 28 curl https://example.com

# Stop retrying on specific status codes
attempt --stop-if-status 2 command_with_permanent_errors

Retrying on child command output

# Retry if output contains a specific string
attempt --retry-if-contains "Connection timed out" curl https://example.com
attempt --retry-if-stderr-contains "Connection timed out" curl https://example.com

# Retry if output matches a regex pattern
attempt --retry-if-matches "error \d+" error_prone_command
attempt --retry-if-stdout-matches "failed.*retry" flaky_service

# Stop retrying if output indicates permanent failure
attempt --stop-if-contains "Authentication failed" secure_command
attempt --stop-if-stderr-contains "Authentication failed" secure_command

As a long-running parent process

# Always restart the child command if it exits, and do not wait before restarting
attempt --forever --wait 0 long_running_service

Installation

cargo install attempt-cli

Backoff schedule

The backoff schedule determines how long we wait between successive attempts. The simplest schedule is a fixed delay, where we wait the same amount of time between all attempts. But it's often a good idea to wait longer the more failures we encounter to avoid sending requests to an already overloaded system.

We may want an aggressive backoff schedule with short delays, so that we recover from transient failures quickly. But longer delays allow systems to self-heal.

The exponential backoff strategy allows us to balance these concerns. In exponential backoff we double the amount of time we wait after each failure. This allows us to start with an aggressive delay but to fall back to a long delay if the outage proves persistent.

Fixed schedule

Fixed delay is the default schedule. It sleeps for a constant amount of time between attempts.

AttemptDelay
01
11
21

Specifying durations

Durations in attempt can include units, such as 5min. Durations without units are assumed to be in DURATION.

The following units are supported:

  • Hours (h or hr)
  • Minutes (m or min)
  • Seconds (s)
  • Milliseconds (ms)
  • Nanoseconds (ns)

Multiple units can be used together such as 1hr 30m.

Example

# Because it is the default, specifying `fixed` is optional
attempt /bin/false
attempt fixed /bin/false

# Change the wait time from the default of 1 second to 15 DURATION
attempt fixed -w 15s /bin/false
attempt fixed --wait 15s /bin/false

Arguments

-w --wait <DURATION>

The amount of time to sleep between attempts.

Exponential schedule

Wait exponentially more time between attempts, using the following formula:

<multiplier> * (<base> ^ <attempts>)

AttemptDelay
01
12
24

The attempt counter starts at 0, so the first wait is is for <multiplier> seconds.

Specifying durations

Durations in attempt can include units, such as 5min. Durations without units are assumed to be in seconds.

The following units are supported:

  • Hours (h or hr)
  • Minutes (m or min)
  • Seconds (s)
  • Milliseconds (ms)
  • Nanoseconds (ns)

Multiple units can be used together such as 1hr 30m.

Example

attempt exponential /bin/false
attempt exp /bin/false

# Change the multiplier from the default of 1s to 2s
attempt exponential -x 2s /bin/false
attempt exponential --multiplier 2s /bin/false

# Change the exponential base from the default of 2 to 5
attempt exponential -b 5 /bin/false
attempt exponential --base 5 /bin/false

Arguments

-b --base <BASE>

The base of the exponential function. The default is 2, corresponding to doubling the wait time between attempts.

-x --multipler <MULTIPLIER>

Scale of the exponential function. The default is 1.

Linear schedule

Wait more time between attempts, using the following formula:

(<multiplier> * <attempts>) + <starting_wait>.

AttemptDelay
01
12
23

Specifying durations

Durations in attempt can include units, such as 5min. Durations without units are assumed to be in DURATION.

The following units are supported:

  • Hours (h or hr)
  • Minutes (m or min)
  • Seconds (s)
  • Milliseconds (ms)
  • Nanoseconds (ns)

Multiple units can be used together such as 1hr 30m.

Example

attempt linear /bin/false

# Change the multiplier from the default of 1 to 2
attempt linear -x 2 /bin/false
attempt linear --multiplier 2 /bin/false

# Change the starting wait from the default of 1 to 5
attempt linear -W 5s /bin/false
attempt linear --starting-wait 5s /bin/false

Arguments

-w --starting-wait <DURATION>

The number of seconds to wait after the first attempts.

-x --multiplier <DURATION>

The number of additional seconds to wait after each subsequent request.

Policy controls

Policy controls define the conditions under whether attempt decides to retry or stop retrying a command. They use predicates that examine the command's exit status, output, or signals.

Each predicate has two variants: retry and stop. Retry predicates cause attempt to retry the command, while stop predicates cause us to terminate.

Usage

Retry predicates are used to identify temporary error conditions, such as network timeouts or rate limiting errors, which may be resolved by retrying. attempt will try to rerun the child command if a retry predicate is matched. It will may be prevented from doing so if the maximum number of attempts has been reached, or if a stop predicate is matched.

Stop predicates are used to identify permanent error conditions, such as authentication errors or malformed content errors, which will never be resolved by retrying. If a stop predicate matches, attempt will imediately cease retrying the child command.

Precedence

Stop predicates always have precedence over retry predicates.

> attempt --stop-if-status 1 --retry-if-status 1 /bin/false
Command exited: exit status: 1.
Stop: Status matches.
Terminated: Command has failed, but cannot be retried.
>

The default policy

If no policy controls are specified, attempt will retry if the child command which exits with a status other than 0 or that is killed by a signal. It will retry a maximum of 3 times with a fixed delay of 1 second.

This is equivalent to the following arguments:

attempt \
    fixed --wait 1s \
    --attempts 3 \
    --retry-failing-status \
    --retry-if-killed \
   /bin/false

Retry control

-a --attempts

The number of times the child command is retried.

-U --unlimited-attempts

Continue retrying until the command is successful, without limiting the number of retries.

--retry-always

Always retry, unless a stop predicate is triggered.

--forever

This will continue retrying the command forever, regardless of whether it succeeds or fails. This is useful for long-running services you wish to restart if they crash.

Status predicates

Make policy decisions based on the child command's exit code. Note that commands killed by a signal do not have an exit code; these predicates will not impact programs which timed out, ran out of memory, or which crashed due to a segmentation fault, or were otherwise killed by a signal.

Performance

Status predicates are very cheap and are typically stable across versions of the child command. They should be preffered whenever possible.

Status patterns

Status predicates attempt to match the child's exit code against a pattern supplied in the argument. The syntax for patterns is as follows.

  • Individual codes: 1
  • Inclusive ranges: 1..5
  • Combinations of patterns: 1,2,3,10..15
  • Whitespace is allowed: 1, 2, 3
  • Note that valid status codes are in the range [0, 255]

Retry predicates

-F, --retry-failing-status

Retry when the command exits with any non-zero status code. This is a convenient shorthand if you want to retry on failure but still want to combine other predicates (for example, additional output-based predicates).

--retry-if-status <STATUS_CODE>

Retry if the child process exit code matches the given status pattern.

Stop predicates

--stop-if-status <STATUS_CODE>

Stop retrying if the child process exit code matches the given status pattern.

Output predicates

Output predicates examine the text written to stdout and stderr by the child command to determine whether to retry or stop. These predicates are useful when exit codes alone don't provide enough information about the failure condition.

Performance

Output predicates are often more convinient than status predicates, and users are encouraged to use them when in circumstances when developer time is the greatest concern (such as in command lines and for throw-away scripts). However, they are discouraged in scripting or CI use.

Output predicates are the most expensive. They require additional system calls to retrieve the output and additional computation to search it. They are typically unstable across different versions of the child process.

Status predicates should be prefferred when possible. If output predicates are used, the more specific predicates (eg --retry-if-stdout-contains vs --retry-if-contains) should be preferred.

Regular expression syntax

The regular expression predicates use the regex crate's syntax.

Retry predicates

--retry-if-contains <STRING>

Retry if either stdout or stderr contains the specified string.

--retry-if-matches <REGEX>

Retry if either stdout or stderr matches the specified regular expression.

--retry-if-stdout-contains <STRING>

Retry if stdout contains the specified string.

--retry-if-stdout-matches <REGEX>

Retry if stdout matches the specified regular expression.

--retry-if-stderr-contains <STRING>

Retry if stderr contains the specified string.

--retry-if-stderr-matches <REGEX>

Retry if stderr matches the specified regular expression.

Stop predicates

--stop-if-contains <STRING>

Stop retrying if either stdout or stderr contains the specified string.

--stop-if-matches <REGEX>

Stop retrying if either stdout or stderr matches the specified regular expression.

--stop-if-stdout-contains <STRING>

Stop retrying if stdout contains the specified string.

--stop-if-stdout-matches <REGEX>

Stop retrying if stdout matches the specified regular expression.

--stop-if-stderr-contains <STRING>

Stop retrying if stderr contains the specified string.

--stop-if-stderr-matches <REGEX>

Stop retrying if stderr matches the specified regular expression.

Timeout & signal predicates

Signal patterns

Some signal predicates attempt to match the signal number against a pattern supplied in the argument. The syntax for patterns is as follows.

  • Individual codes: 1
  • Inclusive ranges: 1..5
  • Combinations of patterns: 1,2,3,10..15
  • Whitespace is allowed: 1, 2, 3
  • Note that valid status codes are in the range [0, 255]

Retry predicates

--retry-if-timeout

Stop retrying if the command was killed specifically due to a timeout. This requires that the --timeout option is also specified.

--retry-if-killed

Retry if the command was killed by any signal. Note this implies --retry-if-timeout, because timeouts use signals to terminate processes.

--retry-if-signal <PATTERN>

Retrying if the command was killed by any signal matching the given pattern.

This is only available on Unix systems.

Stop predicates

--stop-if-timeout

Stop retrying if the command was killed specifically due to a timeout. This requires that the --timeout option is also specified.

--stop-if-killed

Stop retrying if the command was killed by any signal. Note this implies --stop-if-timeout, because timeouts use signals to terminate processes.

--stop-if-signal <PATTERN>

Stop retrying if the command was killed by any signal matching the given pattern.

This is only available on Unix systems.

Timing controls

Timing controls adjust how attempt manages time-related aspects of command retries. They determine the delays between attempts and the time limits for each attempt.

Wait control

Specifying durations

Durations in attempt can include units, such as 5min. Durations without units are assumed to be in seconds.

The following units are supported:

  • Hours (h or hr)
  • Minutes (m or min)
  • Seconds (s)
  • Milliseconds (ms)
  • Nanoseconds (ns)

Multiple units can be used together such as 1hr 30m.

-j --jitter <DURATION>

For a jitter value of n, adds a value in the interval [0, n] to the delay time. This is useful for preventing "thundering herd" issues. Jitter is always added last, so even if the delay was rounded by --wait-min/--wait-max, it will still be randomized.

-m --wait-min <DURATION>

Round any delay smaller than the specified minimum up to that minimum.

-M --wait-max <DURATION>

Round any delay larger than the specified maximum up to that maximum. This is useful when using the linear or exponential strategies, to ensure that you do not sleep for an unbounded amount of time.

--stagger <DURATION>

Stagger the delay by a random amount in the interval [0, n]. This is useful for desyncronizing multiple concurrent instances of attempt which start at the same time.

Timeout control

Specifying durations

Durations in attempt can include units, such as 5min. Durations without units are assumed to be in seconds.

The following units are supported:

  • Hours (h or hr)
  • Minutes (m or min)
  • Seconds (s)
  • Milliseconds (ms)
  • Nanoseconds (ns)

Multiple units can be used together such as 1hr 30m.

-t --timeout <DURATION>

Kill the child command if it does not complete within the timeout. This prevents attempt from waiting indefinitely on a child that may never exit. For instance, the child could be stuck in an infinite loop.

The child is polled using an exponential backoff with a base of 2 and a multiplier of 10ms, saturating at a maximum delay of 15s.

-R --expected-runtime <DURATION>

Specify how much time the command is expected to take. The child command will be polled slowly during this time (once per minute).

This is useful to reduce load on the system. An assumption made in the design of the timeout feature that most commands exit quickly, so the child is polled fairly aggressively. This may adversely impact performance for some use cases.

--retry-if-timeout

Stop the command if it was killed by a signal. This includes timeouts, since timeouts use signals to terminate processes.

--stop-if-timeout

Stop the command if it was killed by a signal. This includes timeouts, since timeouts use signals to terminate processes.

Exit codes

The following exit codes are used by attempt to indicate whether it failed and how. Scripts should use these exit codes and not the log messages exposed by --verbose; these exit codes will remain stable, but no such guarantee is made for the log messages.

Code numberDescription
0Command was run successfully within the allowed number of retries.
1I/O error (eg, command not found). An error message will be printed.
2Invalid arguments. An error message will be printed.
3The number of retries has been exhausted without the command ever succeeding.
4The number of retries has not been exhausted, but the command is no longer retryable because of a "stop" predicate.
101attempt has crashed. The most likely cause is using output predicates on data which is not UTF-8 encoded.

Advice for Scripting

Use an exponential backoff

Exponential backoff allows you to retry aggressively at first while quickly backing off to a significant wait time. Many errors are transient, and worked around by retrying quickly. Other errors may take a long time to resolve, and may not resolve if we stress the system with load.

Leave the base argument as it's default of 2, and use the multiplier argument to control how aggressively you retry. If you want it to be very aggressive, use a value of 0.050 (50 milliseconds), and if you want it to be very conservative, use a value of 60 (1 minute). The default of 1 second is a good balance overall, but if you are accessing public resources, consider using a value of 5 seconds or greater as a courtesy.

Set a max wait time

Set max wait to a reasonable value, like 900 (15 minutes), so that you do not wait an unbounded amount amount of time. This is especially important when using an exponential backoff.

Add jitter to the wait time

Random jitter will help you avoid emergent cyclic behavior. This may occur if your script is running on multiple systems concurrently, or if you are accessing a public resource and many programmers chose similar constants for their retry logic (eg, if everyone chooses round numbers, then they all will be multiples of 5, and at some point everyone's retry logic will sync up and make a request at the exact same time). A useful metaphor is metronome synchronization.

See also the thundering herd problem.

Set a timeout on the child command

Set a timeout on the child command so that you don't get stuck if there is an infinite loop, dead lock, or similar issue.

Avoid output predicates

Output predicates can create performance issues. Try to use status predicates whenever possible.

If you must use an output predicate, use the specific stdout or stderr variant. The generic variants are provided for convenience, but are not as performant.

Bibliography

This bibliography presents works for the following uses:

  • To fact check claims made in this manual.
  • To provide further guidance in cases this manual doesn't address.
  • To supply context about attempt's design.

Tenacity

Tenacity is a Python library for retrying. It is the primary inspiration for attempt's design.

Tenacity forked from Retrying in 2016 as it's original author and maintainer, Ray Holder, stopped responding to members of the community who reached out. Holder wrote Retrying in 2013.

Retrying is currently maintained by Greg Roodt in a seperate fork. Roodt organized the transfer of the retrying package name name in 2022. He attempting to transfer the name to Tenacity, but while he succeeded in taking over the name, the issue to transfer ownership to Tenacity was not followed up on. It remains open at the time of this writing. Roodt's fork recieves periodic updates.

Retrying established the core concepts of the architecture:

  • An @retry(...) decorator which retries a wrapped function
  • Three categories of rules, which together for the retrying strategy
    • Retry rules, determining which circumstances result in a retry
    • Stop rules, determining which cirucmstances terminate retrying
    • Wait rules, determining how long we sleep after an attempt

Retrying baked it's predicates into the arguments of the @retry(...) decorator, much like attempt.

Tenacity extended the architecture to support functions as arguments, unlocking arbitrary predicates, and to use context managers in addition to decorators. It added support for async contexts. Tenacity has also created a large library of utilities and extensive documentation.

References

Case studies