Running LLM inference sequentially on even modest datasets can be painfully slow due to I/O-bound bottlenecks like network latency and API response time. Using parallel processing - via thread pools or async requests - lets you issue multiple requests concurrently, dramatically reducing runtime (we saw a 20x speedup). Key points: checkpoint progress to handle failures, tune the number of workers to avoid hitting Azure OpenAI rate limits, and remember that parallelism saves time, not money - billing is still based on tokens, not wall-clock time.
Tamara Gaidar, Data Scientist, Defender for Cloud Research
Maor Nissan, Principal Data Scientist, Defender for Cloud Research
The situation I’m about to describe is one that will feel familiar to many data scientists. We were asked to run a quick comparative analysis and report the results to stakeholders. However, the dataset we received was incomplete, so we needed to enrich it before proceeding.
Since the data was primarily textual, we decided to rely on LLM-based classifiers to fill in the missing information. At first glance, the dataset didn’t seem particularly large -about 3.5K rows. But once we started running the first classifier, we quickly realized a problem: each inference call (using o3-mini with a prompt per row) took around 2.5 seconds. That meant processing the entire dataset would take roughly 146 minutes.
For an analysis that was expected to be completed within a couple of working days, this was far too slow. Moreover, we needed to run the classification multiple times to extract different fields. And as anyone who has worked with prompts knows, the first iteration is rarely perfect, meaning even more runs were required.
At that point, it became clear: we needed a different approach. This is exactly where parallel processing can make a difference - even in what seems like a routine task. Let’s take a deeper look at the problem we faced.
When working with relatively small datasets, data scientists don’t usually think about scale or parallel inference - simply because “scale” doesn’t feel like an issue. But then the obvious question arises: why are we still paying such a high cost in inference time?
Inference latency per row is driven by several factors:
- Model choice
We chose o3-mini because it is a strong reasoning model, and from experience, it performs well on classification tasks involving security-related text. We were aware that it uses internal chain-of-thought reasoning before producing a response - which is precisely why we selected it.
Switching to a faster model like gpt-4o or gpt-4.1 could reduce latency, but potentially at the cost of quality. Since output quality was critical for our use case, this wasn’t a lever we were willing to pull. - Network round-trip time
Each row required a full request cycle: connect → send request → wait for inference → receive response. This introduces a typical latency of 200–500 ms purely from network overhead. This is a hard constraint that we couldn’t meaningfully optimize. - Sequential token generation
LLMs generate tokens one at a time. The more tokens generated, the longer the response takes.
In our case, however, outputs were short (usually 1–5 words, such as cloud service names), so this wasn’t a significant contributor - and not something we could optimize further. - Server-side load (Azure OpenAI)
Response times also depend on system load. While this can introduce variability, it’s not something we control, so we treat it as a given.
There are, of course, additional factors - like temperature (higher values introduce more variability and sometimes longer reasoning paths) and prompt length (more tokens = more processing time). But the key idea is this: even if we accept all these constraints, we can still dramatically improve performance.
How? By applying parallel processing.
Let’s simplify things. Assume we are working on a single-core CPU. The same principles apply in multi-core environments, but starting simple helps build intuition.
So, what is parallel processing?
In simple terms, it’s the ability to perform multiple tasks at the same time instead of one after another.
Before we go further, let’s clarify a few key concepts.
- Concurrency vs. Parallelism
Concurrency is about managing multiple tasks at once. Think of it as task juggling. Tasks can start, run, and complete in overlapping time periods - but not necessarily at the exact same moment.
For example, on a single CPU:
|
Thread 1: [====work====] ........................[====work====] |
The CPU rapidly switches between tasks. They appear to run simultaneously, but at any given instant, only one is actually executing.
Parallelism, on the other hand, is truly doing multiple things at once. Tasks are executed simultaneously, which requires multiple CPU cores.
- CPU-bound vs. I/O-bound
A task is CPU-bound when computation is the bottleneck—the CPU is fully utilized (close to 100%). In such cases, true parallelism (multiple cores) is the best way to improve performance.
A task is I/O-bound when it spends most of its time waiting on external systems (e.g., network, disk, APIs). In this case, the CPU is mostly idle.
[send]..........waiting for network/disk..........[receive] done
Now, let’s return to our original problem: LLM-based inference over a dataset.
This is a classic I/O-bound workload. Each row required:
- formatting the prompt
- sending the request
- waiting for the model response
- receiving the output
- writing the result back
Most of this time is spent waiting - not computing. The CPU is idle for the majority of the process, while the real bottleneck is the external API call.
And this is exactly where concurrency shines.
Instead of waiting for each request to complete before sending the next one, we can issue multiple requests concurrently - keeping the system busy and dramatically reducing total runtime.
At this point, it’s helpful to visualize what is actually happening under the hood.
In a naive implementation, requests are sent one after another. Each request waits for a response before the next one is issued. This creates a strictly sequential flow:
|
Request 1 → wait → Response 1 |
However, once we introduce concurrency, multiple requests are sent almost simultaneously. From the client side, this looks like a burst of outgoing calls rather than a queue.
But here’s the interesting part: while requests are sent from a single machine, they are handled on the Azure side by a distributed system capable of processing multiple requests in parallel.
This creates a mismatch:
- Client side: can send many requests concurrently
- Server side (Azure OpenAI): processes them in parallel - but under strict limits
At this point, a new bottleneck emerges - not latency per request, but rate limits.
In particular, Azure enforces limits such as:
- Tokens per minute (TPM)
- Requests per minute (RPM)
This means that even though multiple requests are processed in parallel, the total throughput is capped.
In other words, the system shifts from being latency-bound to being throughput-bound.
Picture title: Parallelism doesn’t remove limits - it helps you hit them faster.
Let’s now look at how we implemented this approach in Python.
We use ThreadPoolExecutor, which is well-suited for I/O-bound workloads like API calls to LLMs:
First, we define a prompt template for our classification task:
Next, we define a function that processes a single row. This function will be executed in parallel across multiple threads:
Making the Pipeline Resilient
Before running the parallel jobs, we add a simple but important improvement: checkpointing.
If the process fails (which can easily happen when working with APIs), we don’t want to lose progress.
Running in Parallel
Now comes the core part: running multiple requests concurrently.
We use a thread pool to dispatch tasks:
Finally, we save the completed results:
Let’s analyze the performance gains we achieved.
We ran the classification task above on a sample of 50 rows to demonstrate the time savings.
The total runtime for 50 samples, with an average of 2.53 seconds per API call, was 126.4 seconds. The parallel runtime, with 100 workers (threads), was 6.2 seconds in total.
🚀 SPEEDUP: 20.4x faster with parallel processing!
⏱️ Time saved: 120.2s (95% reduction)
EXTRAPOLATION TO FULL DATASET (3412 rows):
Estimated sequential time: 143.7 minutes
Estimated parallel time: 7.1 minutes
Estimated time saved: 136.7 minutes
We recommend experimenting with the number of parallel workers/threads. It can be worth pushing until you hit the “rate-limit wall,” where adding more threads stops helping. Then dial it back.
A reasonable question is: what are the implications for the cloud bill?
The truth is: there is no impact on the bill. Whether we perform sequential or parallel processing over the DataFrame rows, the same number of tokens is consumed. Therefore, there is no difference in cost. Billing is based on tokens—not time or connections. Parallel processing saves time, not money. In both approaches, we make 3,412 API calls and consume the same number of tokens.
Conclusions:
- Per-row LLM inference is usually I/O-bound, so concurrency (not more CPU) is the fastest path to shorter end-to-end runtime.
- Use a thread pool (or async) plus checkpointing to make long-running enrichment pipelines both faster and restart-safe.
- Tune parallelism to your Azure OpenAI RPM/TPM limits—once you hit the “rate-limit wall,” extra workers won’t help and can increase failures.
- Parallel processing changes throughput and time-to-result, not token-based cost: you pay for tokens, not wall-clock minutes.