Importance of timeouts for webpages, APIs and SQLs micro service based application
In a micro service based application, we generally have independent applications (aka micro services) developed based on the single responsibility principle doing their primary job and exposing the same through API. These API interface is consumed by other micros services or front end/client to build workflows and present it to end user. Various communication among micro services create a mesh type structure and any communication break has a capability to bring down the whole application if not designed correctly. Such a behavior generally overshadows the basic advantage of micro service based architecture which provide fault tolerance and giving us capability to keep application alive even if one of the component goes down.
Below scenario will be taken as an example to understand how timeouts at various levels can help us make sure that single component failure does not has the cascading impact on others and sometime bring down the whole application.
In one of the workflow of micro service application, there are 6 API calls made by the client application namely A, B, C, D, E & F.
Api Order of call A B C D E F
Time taken 100 200 200 600 500 400
Micro service MS1 MS2 MS1 MS1 MS3 MS3
From throughput side, API A is called 180 times in a minute and normal average response time of the API is 100 ms. This means that API server remain busy for 18 seconds in a minute to server this API.
API B, is called after API A and dependent on the result of API A. Rest of the API are independent and called sequentially post the call to B. The overall web page load is 2 seconds (A contribute 100 ms out of it)
As few of the micro service goes down or start performing slow, it is important to make sure that we don’t cascade the problem to all the micro services and then start impacting the user experience. It is very important to make sure that all the API are responding timely or timeout so that workflow dependent on the same are not kept in waiting state (especially if the API call is in sequential mode for the workflow).
Considering the scenario, If the API A start slowing down and start taking 500 ms , the respective API server will never be able to server it completely (as now the inflow of 180 calls per minute is far more that the time required to fulfill all, 90 seconds or 1.5 minutes), While few of the calls (120 out of 180 ) will get the response, the other 2 API from the same server (C & D) in the workflow will be impacted and never get the server time ending up there response time getting slow/hung too. As all the APIs are in sequence and C and D are stuck due to no server time available, E & F will never be called. This will end up having complete web page down.
Now if I calculate a timeout based on the throughput, and decision that in worst case API should not go beyond 75% of the server usage for API A (45 seconds out of 1 minute max).
With this setting and throughput of 180 calls per minute, the API response should never go beyond 250 ms (45 seconds of total time in a minute for 180 calls). If I setup that as a timeout, only very few lucky ones of A will get response while most of the calls will start timing out post 250 ms, which will result into failure in B too (as it depends on the response of A), but rest of the call C, D, E and F will run and web page will load may be with partial information (and a user friendly user message for the component where A and B response need to be shown). If I make sure that dependent API are not called if their parent is failing, we will actually end up having faster response time of the webpage though some component is failing to load.
Timeout this way actually avoids cascading impact and make sure that we are at least surviving instead of completely die. Circuit breaker pattern further take it to next level by making sure that we stop calling the failing API completely on a threshold and make a provision to keep checking the API for its success before API calls are made again. Of course monitoring is required for such APIs and fix them to bring their response under control so that those APIs are getting called again.