'Does a java thread come back to running state soon after it receives the response in wait state after making rest template call?
I have a REST API service which makes an REST API to an external system using RestTemplate. Time out set is 3 sec.
To make this external call, I have an Executor service fixed thread pool created. Now, in normal testing everything works fine, even the time out set for the internal call. But when I do performance test of my system using jmeter, on one vm, having 4 cores, fixed thread pool size of 50, I start observing the below things.
- Before 30 req / sec throughput, everything works well.
- At 45 req / sec throughput, I observed increased latencies from the internal REST API call. Though the time out set was 3 sec, I observe 5 - 8 sec of latencies from the internal call.
Instant startTime = Instant.now();
ResponseEntity<String> responseEntity = restTemplate
.exchange(url, httpMethod, httpEntity, String.class);
Instant finishTime = Instant.now();
Log.info("Time diff is " + String.valueOf(Duration.between(startTime, finishTime).toMillis()));
- At 45 req / sec throughput, there was a delay at the point when the internal API call is made in a separate thread, looks like the system was waiting for the thread to be available in the pool.
On the 2nd point, wanted to understand how is this possible. Few theories that I have is,
- As external API call is an IO operation, the thread-pool thread executing it goes into "blocked for IO completion" or wait state till it receives the response from external API. And when it receives the response, the thread goes to "ready to run" state, waiting to run, at the mercy of the scheduler. Now, as the traffic was already so high, there were lot other threads already in "ready to run state" & some of these threads, even after receiving the external API response, get the CPU time very late, and that is when the finishTime gets calculated. That means, even if the restTemplate call got the response within 3 sec, executing the Log.info line took time, coz of which the time diff log printed late.
- This theory I don't think is right, but I am just letting out my thoughts. Is it possible that when the thread makes that REST API IO call, and goes in wait state, as the demand for threads from pool is high, the thread is pre-empted from waiting state to cater to another request? and once the response is received from API call, another thread gets assigned to receive that request, assigning of this another thread might have taken time causing the increased latency.
Please let me know if my 1st theory is right, or if not what exactly happens in high throughput cases here that might have been causing more than timeout value latency for the external API call.
Solution 1:[1]
Theory 1 is correct.
If we talk in simpler terms, thread preemption is done at the kernel level rather than at the JVM level. Once JVM assigns a task to the thread, it can no longer controls it. Its the kernel (depending upon how it is configured) which can preempt the running thread and put it into a waiting state if a higher priority thread needs a CPU time. So the theory which says JVM preempts the waiting thread for the next task is not correct.
Basically, the whole point of configuring a fixed thread pool of particular size is to put some limits on the usage of resources. Had it been the case like you mentioned in theory 2, you system will soon run OOM if each thread starts to go into indefinite waiting state due to one reason or another.
Solution 2:[2]
Theory 1 is correct. I'll explain it using Linux since that is most used and I'm familiar with it.
As external API call is an IO operation, the thread-pool thread executing it goes into "blocked for IO completion" or wait state till it receives the response from external API.
Yes. The thread is marked as TASK_INTERRUPTABLE/TASK_UNINTERRIBLE and is removed from the run-queue of the scheduler and added to a wait-queue. Which is just a double linked list of nodes with the task waiting to be notified (e.g. for I/O)
And when it receives the response, the thread goes to "ready to run" state, waiting to run, at the mercy of the scheduler.
Correct. So the task is removed from the wait-queue and reinserted back into run-queue of the scheduler. Linux makes use of the Completely Fair Scheduler and the run-queue isn't a traditional queue, but a red-black-tree and the key that is used is the time the the thread has spend running on the CPU (runtime). Tasks with little runtime end-up at the left-side of the tree and tasks which a large runtime, end up at the right-side. When a task is inserted when it was blocked, its runtime will be set to runtime of the left most task in tree to make sure that if a task was blocked for a long time, it will not be running for a long time. This also makes sure that the tasks will be next in line to get some CPU time.
The CPU running the task is periodically interrupted by a timer to check if the running task should be preempted. If that is the case, then:
- The runtime of the task is incremented by the time it has spend on the CPU
- Reinserted back into the tree
- The next task (leftmost node) is taken from the tree and will start to run on the CPU.
So this is where we get a context switch. I won't bother you with the details for priority of tasks, but it also fits very nice into the picture.
Now, as the traffic was already so high, there were lot other threads already in "ready to run state" & some of these threads, even after receiving the external API response, get the CPU time very late, and that is when the finishTime gets calculated. That means, even if the restTemplate call got the response within 3 sec, executing the Log.info line took time, coz of which the time diff log printed late.
In the run queue (there is a run queue per core), there are a bunch of threads waiting to be executed. They will be in the TASK_RUNNING state. Each task will get a some runtime assigned to run based on:
- the priority of the task
- the number of other tasks pending
- the priority of the other tasks.
By default the time to divide for all tasks in the run-queue is 24 ms. So if there are 4 tasks in the TASK_RUNNING state, then every task will get 6 ms on the CPU before it is preempted unless it blocks/yields/exits early. So normally, every 24ms, every task has an opportunity to run. I won't bother you with minimum runtime and when there are too many tasks on the run-queue.
Because it is unlikely that your tasks are CPU intensive, the scheduler should be pretty empty: most of the tasks are on a wait-queue; not on the run-queue of scheduler. So once a remote call completes, the scheduler should be able to process it very quickly to completion because there are no other tasks running.
So if the time to the remote service is constant, then having a bunch of threads (especially as few as you have), their execution-time should be roughly equal to the time to make the remote call since the scheduler is pretty idle. So IMHO it is unlikely that the scheduler that is causing the latencies to grow.
Perhaps there is some queueing/contention, e.g. there is a connection pool that limits the number of rest connects that can be made. Or perhaps the remote service is causing increased latency because it is under load.
Solution 3:[3]
Just wanted to add this because I had the same problem and took 2 days to find out why.
For me it was because the default connection pool size per route was 5, so when there are more than 5 requests at the same time, the later requests will need to wait for connections leased by previous requests to be released before they can be processed. So it may seem like the later requests took longer than their timeouts but actually they were just waiting for connection to become available.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Manish Bansal |
Solution 2 | |
Solution 3 | yingxuan |