Are sleep and retries an example of a classic bandaid or resilient design ?

Rohit Talukdar
2 min readOct 18, 2021
https://institutesuccess.com/library/if-at-first-you-dont-succeed-try-try-try-again-william-edward-hickson-2/

Traditionally any part of production code we have written invariably involves more than one component. Needless to say, these components need to interact with one another. The interaction is usually in the form of:

  1. master-slave
  2. client-server
  3. publisher-subscriber

models. Lets narrow our discussion on the architectures that use the first two paradigms. For the purpose of this discussion, the first two paradigms — master-slave and client-server can be treated as the same. In the master slave relationship, say for example a scatter gather algorithm to distribute and do work by doing stuff in parallel, say a large number of parallel greps on a large log file, we have a master-slave or client-server relation.

In such cases, the master is dependent on the task being run by the slave.

The slave could take a short time or longer depending on the work that is needed to be accomplished and the availability of resources.

Now the master could be designed to wait for a fixed amount of worst case time and return failure if this timeout is exceeded.

Or, to make the code more resilient, it could implement a contruct of sleep and retry a number of times until the task is completed.

Some say that this leads to resilient design. The drawback however, is that sometimes the sleep-retry pattern could lead to :

  1. greater overhead in terms of resource consumption
  2. glossing over some of the inefficiencies in the system like which module depends on what
  3. overall task leading to longer execution times
  4. in case the worst case time is insufficent, the program will still fail and the invariably the next bandaid is to increase the sleep / retry combo even further.

While its ok for escalation and serviceability folks to use it to quickly provide a workaround in production, core developers should not be using this pattern to make fixes.

Overall, I think while the sleep-retry pattern is ok duting initial quick and dirty proof-of-concept stage, it is an anti-pattern and its use should be carefully evaluated when in production.

https://unsplash.com/photos/SwWjCbIIoFE

--

--

Rohit Talukdar

Just another guy striving to say hello to the world.