'FastAPI runs api-calls in serial instead of parallel fashion

I have the following code:

import time
from fastapi import FastAPI, Request
    
app = FastAPI()
    
@app.get("/ping")
async def ping(request: Request):
        print("Hello")
        time.sleep(5)
        print("bye")
        return {"ping": "pong!"}

If I run my code on localhost - e.g., http://localhost:8501/ping - in different tabs of the same browser window, I get:

Hello
bye
Hello
bye

instead of:

Hello
Hello
bye
bye

I have read about using httpx, but still, I cannot have a true parallelization. What's the problem?



Solution 1:[1]

As per FastAPI's documentation:

When you declare a path operation function with normal def instead of async def, it is run in an external threadpool that is then awaited, instead of being called directly (as it would block the server).

Thus, def (sync) routes run in a separate thread from a threadpool, or, in other words, the server processes the requests concurrently, whereas async def routes run on the main (single) thread, i.e., the server processes the requests sequentially - as long as there is no await call to I/O-bound operations inside such routes, such as waiting for data from the client to be sent through the network, contents of a file in the disk to be read, a database operation to finish, etc. - have a look here. Asynchronous code with async and await is many times summarised as using coroutines. Coroutines are collaborative (or cooperatively multitasked): "at any given time, a program with coroutines is running only one of its coroutines, and this running coroutine suspends its execution only when it explicitly requests to be suspended" (see here and here for more info on coroutines). However, this does not apply to CPU-bound operations, such as the ones described here. CPU-bound operations, even if declared in async def functions and called using await, will block the main thread. This also means that a blocking operation, such as time.sleep(), in an async def route will block the entire server (as in your case).

Thus, if your function is not going to make any async calls, you could declare it with def instead, as shown below:

@app.get("/ping")
def ping(request: Request):
    #print(request.client)
    print("Hello")
    time.sleep(5)
    print("bye")
    return "pong"

Otherwise, if you are going to call async functions that you have to await, you should use async def. To demonstrate this, the below uses asyncio.sleep() function from the asyncio library. Similar example is given here and here as well.

import asyncio
 
@app.get("/ping")
async def ping(request: Request):
    print("Hello")
    await asyncio.sleep(5)
    print("bye")
    return "pong"

Both the functions above will print the expected output - as mentioned in your question - if two requests arrive at around the same time.

Hello
Hello
bye
bye

Note: When you call your endpoint for the second (third, and so on) time, please remember to do that from a tab that is isolated from the browser's main session; otherwise, the requests will be shown as coming from the same client (you could check that using print(request.client) - the port number would appear being the same, if both tabs were opened in the same window), and hence, the requests would be processed sequentially. You could either reload the same tab (as is running), or open a new tab in an incognito window, or use another browser/client to send the request.

Async/await and Expensive CPU-bound Operations (Long Computation Tasks)

If you are required to use async def (as you might need to await for coroutines inside your route), but also have some synchronous long computation task that might be blocking the server and doesn't let other requests to go through, for example:

@app.post("/ping")
async def ping(file: UploadFile = File(...)):
    print("Hello")
    try:
        contents = await file.read()
        res = some_long_computation_task(contents)  # this blocks other requests
    finally:
        await file.close()
    print("bye")
    return "pong"

then:

  1. Use more workers (e.g., uvicorn main:app --workers 4). Note: Each worker "has its own things, variables and memory". This means that global variables/objects, etc., won't be shared across the processes/workers. In this case, you should consider using a database storage, or Key-Value stores (Caches), as described here and here. Additionally, "if you are consuming a large amount of memory in your code, each process will consume an equivalent amount of memory".

  2. Use FastAPI's (Starlette's) run_in_threadpool() from concurrency module (source code here and here) - as @tiangolo suggested here - which "will run the function in a separate thread to ensure that the main thread (where coroutines are run) does not get blocked" (see here). As described by @tiangolo here, "run_in_threadpool is an awaitable function, the first parameter is a normal function, the next parameters are passed to that function directly. It supports sequence arguments and keyword arguments".

    from fastapi.concurrency import run_in_threadpool
    res = await run_in_threadpool(some_long_computation_task, contents)
    
  3. Alternatively, use asyncio's run_in_executor:

    loop = asyncio.get_running_loop()
    res = await loop.run_in_executor(None, lambda: some_long_computation_task(contents))
    
  4. You should also check whether you could change your route's definition to def. For example, if the only method in your endpoint that has to be awaited is the one reading the file contents (as you mentioned in the comments section below), FastAPI can read the bytes of a file for you (however, this should work for small files, as the whole contents will be stored in memory, see here), or you could even call the read() method of the SpooledTemporaryFile object directly, so that you don't have to await the read() method - and since you can now declare your route with def, each request will run in a separate thread.

    @app.post("/ping")
    def ping(file: UploadFile = File(...)):
        print("Hello")
        try:
            contents = file.file.read()
            res = some_long_computation_task(contents)
        finally:
            file.file.close()
        print("bye")
        return "pong"
    
  5. Have a look at this answer, as well as the documentation here, for more suggested solutions.

Solution 2:[2]

Q :
" ... What's the problem? "

A :
The FastAPI documentation is explicit to say the framework uses in-process tasks ( as inherited from Starlette ).

That, by itself, means, that all such task compete to receive ( from time to time ) the Python Interpreter GIL-lock - being efficiently a MUTEX-terrorising Global Interpreter Lock, which in effect re-[SERIAL]-ises any and all amounts of Python Interpreter in-process threads
to work as one-and-only-one-WORKS-while-all-others-stay-waiting...

On fine-grain scale, you see the result -- if spawning another handler for the second ( manually initiated from a second FireFox-tab ) arriving http-request actually takes longer than a sleep has taken, the result of GIL-lock interleaved ~ 100 [ms] time-quanta round-robin ( all-wait-one-can-work ~ 100 [ms] before each next round of GIL-lock release-acquire-roulette takes place ) Python Interpreter internal work does not show more details, you may use more details ( depending on O/S type or version ) from here to see more in-thread LoD, like this inside the async-decorated code being performed :

import time
import threading
from   fastapi import FastAPI, Request

TEMPLATE = "INF[{0:_>20d}]: t_id( {1: >20d} ):: {2:}"

print( TEMPLATE.format( time.perf_counter_ns(),
                        threading.get_ident(),
                       "Python Interpreter __main__ was started ..."
                        )
...
@app.get("/ping")
async def ping( request: Request ):
        """                                __doc__
        [DOC-ME]
        ping( Request ):  a mock-up AS-IS function to yield
                          a CLI/GUI self-evidence of the order-of-execution
        RETURNS:          a JSON-alike decorated dict

        [TEST-ME]         ...
        """
        print( TEMPLATE.format( time.perf_counter_ns(),
                                threading.get_ident(),
                               "Hello..."
                                )
        #------------------------------------------------- actual blocking work
        time.sleep( 5 )
        #------------------------------------------------- actual blocking work
        print( TEMPLATE.format( time.perf_counter_ns(),
                                threading.get_ident(),
                               "...bye"
                                )
        return { "ping": "pong!" }

Last, but not least, do not hesitate to read more about all other sharks threads-based code may suffer from ... or even cause ... behind the curtains ...

Ad Memorandum

A mixture of GIL-lock, thread-based pools, asynchronous decorators, blocking and event-handling -- a sure mix to uncertainties & HWY2HELL ;o)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 halfer