A misadventure in async programming

Feb 26, 2026

Recently, I began the long process of open-sourcing Exquisitor, a search engine I've (co-)designed and validated across multiple interactive search and retrieval benchmarks. With this release, my goal is to eventually provide an open foundation that anyone can use to build and evaluate different search algorithms on a single, battle-tested, and well-documented system.

On the road towards a stable release, I've been taking a close look at the many bits and pieces that make up the project. Today's story is about one of those pieces and personally for me, a cautionary tale in how there's never a single, simplex fix for complex design challenges. But before we get there, let me give you a short prologue.

Prologue

Exquisitor was developed in the context of video search and exploration benchmarks and competitions. We would physically attend the venue with our systems and perform live search tasks on-site. This setting shaped Exquisitor's design philosophy: a strict separation between offline and online (or live) operations.

Offline operations such encoding and embedding, indexing, and feature annotation, are designed to be comprehensive. They often involve long-running preprocessing and annotation steps. In contrast, the online or live components are deliberately kept small and are often the target of optimization efforts. We spend a lot of time designing pipelines where extensive offline data transformation reduces the cost of runtime search as much as possible. In traditional information retrieval parlance, we adopt a precomputation-centric design that aggressively shifts costs from query-time to index-time.

During benchmarks, executing live services as fast as possible is what matters. They're what must run flawlessly under pressure and where we continually focus our efforts to refine and optimize the system. Any form of delay in live interactions is either visible to the end user as interaction latency, a time delay between the cause and the effect of some physical change in the system being observed or as jank/UI freezing, a colloquial term for slow or unresponsive interfaces. Within an interactive search system like Exquisitor, such delays may manifest as a long pause between pressing the search button and seeing results, effectively breaking the user's perception of the system as an instantaneous, interactive tool. It can also modify user behavior by adding a static cost to certain decisions. In benchmarks such as the Video Browser Showdown, novice users often tried to compensate when they observed that certain interactions were slow. If a search took a considerable time to finish, they would attempt to write long, verbose queries hoping to offload their entire mental state into a single query, rather than iterating with smaller queries and improving results through incremental refinement or relevance feedback¹. In short, we try to minimize interaction latency at every step. Additionally, since a single backend can serve multiple users, it must support concurrent access.

Should this be asynchronous or not?

I was looking at the code for an endpoint that performs free-text search: it accepts a user query, encodes it via CLIP's text encoder, and returns its (approximate) nearest neighbours in the visual data. This is currently one of the most compute-intensive operations we perform, since it requires an inference pass through a hefty text encoder to generate the query embedding².

Once the request hits the API endpoint and passes the necessary checks, it ultimately ends up in a function like this:

 1async def search(
 2        self,
 3        collection: str,
 4        text: str,
 5        n: int,
 6        seen: List[int],
 7        excluded: List[int],
 8        filters: Optional[ActiveFilters] = None,
 9    ) -> List[int]:
10        """Execute CLIP text search."""
11        try:
12            # Encode text using CLIP
13            text_features = self._encode_text(text)
14
15            # Process exclusions
16            excluded_set = self._build_excluded_set(collection, excluded)
17            seen_set = set(seen)
18
19            # Search with expanding radius until we have enough results
20            return await self._search_with_expansion(
21                collection, text_features, n, seen_set, excluded_set, filters
22            )
23
24        except Exception as e:
25            raise SearchError(
26                f"CLIP search failed: {e}", {"collection": collection, "text": text}
27            )

At a high level, this function accepts the user query, encodes it via the CLIP model's text encoder, and then performs a search with an expanding radius to collect the desired number of items matching all filtering and exclusion criteria. In line 13, we call self._encode_text(text) to embed the query. That function looks like this:

 1def _encode_text(self, text: str) -> np.ndarray:
 2    """Encode text using CLIP model."""
 3    device = self.model_manager.device
 4
 5    with (
 6        torch.inference_mode(),
 7        (
 8            torch.amp.autocast("cuda")
 9            if torch.cuda.is_available()
10            # Note: MPS autocast support is still maturing; we skip it here
11            # and fall back to full precision on Apple Silicon.
12            else contextlib.nullcontext()
13        ),
14    ):
15        tokenized_text = self.model_manager.clip_text_tokenizer([text]).to(device)
16        text_features = self.model_manager.clip_text_model(tokenized_text)
17        text_features /= text_features.norm(dim=-1, keepdim=True)
18        return text_features.detach().cpu().numpy()

Nothing too complicated. But at this moment, I looked at this function and noticed that I had ended up with a synchronous, blocking call inside an asynchronous function. As I would later learn, this is a common mistake that many developers make when they first start working with async code. But in that moment, I thought: "Well, this is a blocking call. Couldn't I simply make _encode_text() asynchronous and await it in app/search.py?" It seemed like a simple fix: just slap an async in front of the function definition and an await in front of the call. So that's exactly what I did. I made _encode_text() asynchronous and awaited it in the search function. Problem solved, right? Not quite. And to understand why this didn't work, we need to take a step back and talk about Python's concurrency model.

Understanding Python's concurrency model (with stamppot)

To understand Python's concurrency model, it helpful to remember that the reference Python interpreter, CPython, allows only a single thread to execute Python bytecode at a time. This decision exists partly to avoid race conditions and ensure thread safety, and partly to simplify CPython's own memory management (which relies on reference counting and is particularly vulnerable to data races). It is enforced through a Global Interpreter Lock (GIL), a mutex that permits only a single thread to hold control of the interpreter at once. In a multi-threaded application, each thread must wait to acquire the GIL before it can resume execution.

To make execution models more concrete, consider a simplified example: a chef preparing stamppot, a traditional Dutch dish of potatoes mashed with vegetables and typically served with sausages. Here's my favorite one: with boerenkool (kale) and rookworst (smoked sausage).

The first model is purely sequential. The chef boils the potatoes, waits for them to finish, removes them from the stove, then cleans, chops, and blanches the vegetables, then cooks the sausages, each step fully completed before the next begins. This is fundamentally inefficient, because while the potatoes boil, the vegetables could already be cleaned and chopped. Here's how it looks:

Put potatoes in pot, stand there watching them boil for 20 minutes
Done! Now put veggies in pot, stand there watching for 15 minutes
Done! Now cook bacon in pan, stand there watching for 10 minutes
Done! Mash potatoes and veggies together (2 minutes)
Add lettuce (30 seconds)

We've spent nearly 47 imaginary minutes, most of them standing around waiting.

In an asynchronous model, the same chef multitasks intelligently while things cook:

Put potatoes in pot to boil (20 min) ← await boil_potatoes()
While potatoes are boiling: put veggies in another pot (15 min) ← await boil_veggies()
While both are boiling: start bacon in pan (10 min) ← await cook_bacon()
Check what's ready, drain what's done
Mash potatoes and veggies together (2 min) ← actual work, can't multitask this
Add lettuce (30 seconds)

Now we're down to just nearly 22 imaginary minutes with everything cooking simultaneously. Crucially though, the async version isn't working harder. We still have a single chef who can only actively do one thing at a time but can have multiple things cooking in parallel. In essence it is not working harder, it is just waiting less.

Python's async is cooperative multitasking on a single CPU core. Only one piece of code actually runs at any moment and async just switches between tasks quickly when they're waiting. For I/O-bound operations such as saving to a database, logging, making network requests, it makes a lot of sense to hand off the task and await its completion from the relatively slower storage or network device. Async shines when you have many such operations that spend most of their time waiting. Making 100 API calls, for instance: a synchronous approach does them one by one, while async can fire all 100 and handle responses as they arrive.

But what if I put a task that does heavy CPU-bound computation behind an await statement. If a task is doing heavy computation rather than waiting, it monopolizes that CPU core and nothing else can run. Putting a heavy embedding operation behind an await call for a process that never yields control back is functionally identical to calling it in a blocking manner. The event loop will simply wait while your process computes the embeddings. The async event loop keeps checking "can I switch tasks?" but it can't, because the CPU is busy. And while the CPU is busy hogging that single thread, everything else is stalled and everything else must wait.

Multithreading

Now imagine you hire four chefs. They all share the same kitchen with the same tools and ingredients. They can all see the same ingredients, the same countertops, the same pots. This is great for efficiency, but there's a rule: only one chef can read the recipe book at a time. That's the GIL. Each thread runs in the same process and shares memory, so no copying costs are incurred but the GIL means pure Python code is effectively serialized across threads.

Let's try this then. Instead of running the embedding operation in the same thread, we spin up a thread pool executor, hand it the task, and await the result:

 1async def search(self, collection, text, n, seen, excluded, filters=None):
 2    """Execute CLIP text search."""
 3    try:
 4        # Offload the CPU-bound encoding to a thread pool.
 5        # This frees the event loop to handle other requests while we wait.
 6        loop = asyncio.get_running_loop()
 7        text_features = await loop.run_in_executor(
 8            self._executor, self._encode_text, text
 9        )
10
11        excluded_set = self._build_excluded_set(collection, excluded)
12        seen_set = set(seen)
13
14        return await self._search_with_expansion(
15            collection, text_features, n, seen_set, excluded_set, filters
16        )
17
18    except Exception as e:
19        raise SearchError(
20            f"CLIP search failed: {e}", {"collection": collection, "text": text}
21        )

But wait a sec, Ujjwal. Didn't your repeatedly just say that only one thread can run Python at a time? Where's the GIL now?

Yes, and this is where the final piece of the puzzle falls into place. Underneath the embedding function is PyTorch's C extension code, which can voluntarily release the GIL. When a C function knows it won't be touching any Python objects for a while, it can essentially say: "I don't need the GIL, let other threads can run Python code while I'm busy. I'll let you know when I'm done" By moving the embedding into a separate thread, we avoid GIL contention, because PyTorch relinquishes it almost immediately once it enters the C layer³.

Does this actually make things faster?

The system is not inherently faster because of this change. The run_in_executor approach does not make a single encoding request complete sooner as the computation takes exactly the same amount of time either way. What it does change is the behavior of the event loop for every other request that arrives while an encoding is in progress. We've managed to ensure that the event loop does not lock up when it receives a large number of encoding requests: other requests can make progress and connections can be managed, all while the encoding runs in the background.

Conclusion

When you encounter a blocking call in an async codebase, the instinct to slap async def on it and call it a day is understandable but for CPU-bound work, it changes nothing. The event loop still waits, users still queue up behind each other, and the system only appears asynchronous.

The correct approach depends on why the function blocks. For I/O-bound work, async/await is exactly the right tool but for CPU-bound work where the underlying library releases the GIL, a better approach is to push it onto a separate thread via run_in_executor: it moves the work off the event loop thread so the rest of the system can breathe. The result isn't a faster individual response, but a system that minimizes jank for all users, even while doing expensive work for some of them.

This was suboptimal not only because they spent an excessive amount of time writing their query but also because the CLIP's encoder would truncate the query to first 77 tokens.

We take appropriate measures to offload this to the correct device (CUDA, Apple's MPS, or CPU as a fallback), but it remains, by far, the heaviest operation in our live services, making it a frequent target of optimization efforts.

There is some nuance to this as some preprocessing/tokenization frameworks may be Python-based and may not immediately release the GIL. However, most heavy PyTorch kernels will release the GIL.