Bulkhead Concurrency Limits with Semaphores in Spring Boot

Mar 12, 2026

One slow downstream dependency can put an entire Spring Boot service under pressure if nothing limits how much work reaches it at the same time. Putting a hard cap on in-flight calls for a specific downstream target gives you a fixed boundary, then the service can either wait briefly for capacity or reject extra requests right away after that limit is hit. Semaphores fit that job nicely because they hold a set number of permits and let only that amount of traffic pass at one time. In the current Spring stack, that usually means a servlet Filter or OncePerRequestFilter on the MVC side, and a WebFilter or reactive operator-based limit on the WebFlux side.

How Semaphore Bulkheads Work

Concurrency limits help only when the boundary is firm and easy to reason about. That is why semaphores fit this job well. They give you a fixed count of permits, they let callers enter until that count is exhausted, and they force a decision for everyone who arrives after that point. That decision is where bulkhead behavior starts to matter. Some services reject extra work right away so latency does not stretch upward. Others let callers wait briefly in case a permit frees up soon. Either choice come from the same starting point, which is a fixed ceiling on work already in progress.

Permits Set the Hard Ceiling

A permit is simply one slot for active downstream work. If a semaphore starts with 20 permits, then no more than 20 callers can pass through that gate at the same time. Caller 21 does not get special treatment just because it arrived a fraction of a second later. It has to wait, or it has to be rejected, based on the policy wrapped around the semaphore.

That hard ceiling is what gives bulkheads their value. Without a limit, a slow dependency can keep accepting more work until the calling service is tied up with rising response times, backed up threads, or a growing pile of unfinished reactive work. With a semaphore in front of that dependency, the service stops admitting extra load past a fixed point. The downstream call may still be slow, but the amount of damage it can spread is bounded by the permit count.

Java’s Semaphore keeps the mechanics small and direct. acquire() waits until a permit is available. tryAcquire() returns right away with true or false. release() gives the permit back. That compact API is enough to build several different admission styles.

This basic class makes the idea easy to follow:

import java.util.concurrent.Semaphore;

public final class InventoryGatewayLimit {

    private final Semaphore permits = new Semaphore(12, false);

    public boolean tryEnter() {
        return permits.tryAcquire();
    }

    public void leave() {
        permits.release();
    }

    public int availablePermits() {
        return permits.availablePermits();
    }
}

Nothing in that class calls a downstream service yet, but the main rule is already in place. Only 12 callers can be inside the guarded area at one time. availablePermits() can help with visibility during testing or metrics work, but the number that matters most is the starting permit count because that is the upper limit on concurrent access.

Fairness changes how waiting callers are handed permits when one becomes free. Fair mode tries to honor arrival order among waiting threads. Nonfair mode allows barging, so a later caller can take a permit ahead of an older waiter. The cap itself stays the same. What changes is the handoff order when there is contention.

We can see in this shorter example that makes the distinction more visual:

import java.util.concurrent.Semaphore;

public final class BillingLane {

    private final Semaphore fairPermits = new Semaphore(3, true);
    private final Semaphore nonFairPermits = new Semaphore(3, false);

    public Semaphore fairPermits() {
        return fairPermits;
    }

    public Semaphore nonFairPermits() {
        return nonFairPermits;
    }
}

Both semaphores cap active work at 3. The difference appears only when callers are waiting. Fair mode favors first-in, first-out ordering at the semaphore boundary. Nonfair mode can move work through with a bit less coordination cost. Capacity does not change there. Order does.

One rule matters a great deal when code is built around permits. Every successful acquire needs a matching release. Lose track of a permit and the gate slowly shrinks. Nothing external changed, yet the service starts acting like its limit got tighter and tighter. That is why the release step belongs in a finally block whenever a permit has been taken.

This version shows the release in the place where it belongs:

import java.util.concurrent.Semaphore;

public final class CatalogLookup {

    private final Semaphore permits = new Semaphore(8, false);

    public String fetchItem(String itemId) throws InterruptedException {
        permits.acquire();
        try {
            return callDownstream(itemId);
        } finally {
            permits.release();
        }
    }

    private String callDownstream(String itemId) throws InterruptedException {
        Thread.sleep(75);
        return "item-" + itemId;
    }
}

The finally block is doing the protective work there. If the downstream call throws, returns early, or gets interrupted after admission, the permit still goes back. That keeps the ceiling stable.

Permit count selection deserves careful thought too. Semaphores with 500 permits still count as limits, but that is not much of a boundary if the downstream starts struggling after 30 in-flight calls. Setting the value too low can reject traffic earlier than needed. Setting it too high gives the downstream room to pull the caller into the same slowdown you were trying to contain. The cap should reflect how much work that dependency can tolerate while latency and error rates remain acceptable.

Fail Fast Versus Queueing

Admission policy starts right after the last permit is gone. One option is fail fast. The caller tries tryAcquire(), gets false, and the service refuses the work immediately. The other option is queueing. That caller waits for a permit and enters later if capacity opens up in time.

Fail fast keeps pressure from stacking up inside the service. Requests that cannot enter do not sit around holding thread state, request data, timeouts, and downstream hopes. It gets a quick answer, and the service holds the line on in-flight work. Latency usually stays more predictable that way because excess traffic is not quietly turned into waiting traffic.

Take this class for example that shows the basic fail-fast style:

import java.util.concurrent.Semaphore;

public final class ShippingBulkhead {

    private final Semaphore permits = new Semaphore(15, false);

    public boolean tryStartCall() {
        return permits.tryAcquire();
    }

    public void finishCall() {
        permits.release();
    }
}

A caller can check tryStartCall(), and if it returns false, reject the work right away. That style is easy to reason about. No hidden line forms behind the limit. Capacity either exists right now or it does not.

Queueing changes the behavior in a very real way. Instead of refusing entry immediately, the caller waits for a permit. That can help during brief bursts where one permit is about to free up and the wait will be short. Waiting still has a cost. The service is now carrying extra callers that are not doing useful downstream work yet. They are parked at the gate, hoping for admission. If that wait grows, request latency grows with it, and upper-layer timeouts can start firing before the downstream call has even begun.

Timed waiting is usually the safer form of queueing because it puts a bound on how long a caller can remain at the gate. Untimed waiting is harder to control and can turn pressure into long stalls.

For example this uses a short wait window:

import java.util.concurrent.Semaphore;
import java.util.concurrent.TimeUnit;

public final class PartnerApiGate {

    private final Semaphore permits = new Semaphore(10, true);

    public boolean tryStartCall() throws InterruptedException {
        return permits.tryAcquire(150, TimeUnit.MILLISECONDS);
    }

    public void finishCall() {
        permits.release();
    }
}

That code says a caller can wait up to 150 milliseconds for a permit. If no permit appears during that window, admission fails. Short waits like this can smooth out brief bursts, though the tradeoff is still there. The service is spending part of the request time budget at the gate before the downstream call has even started.

Fairness matters more when queueing is involved. If callers are waiting, fair ordering gives older waiters a better chance to move forward in arrival order. Nonfair mode can admit newer callers ahead of them. That can raise throughput a bit, but it can also leave older waiters behind for longer. In fail-fast mode, fairness has less room to matter because most callers are not waiting at all. They either get in immediately or they do not.

Something else worth knowing is that tryAcquire() without a timeout does not respect fairness in the same way waiting acquisition does. That matters because fail-fast bulkheads commonly rely on plain tryAcquire(). So a service can declare a fair semaphore and still see immediate attempts succeed out of arrival order if they hit the gate at the right moment. That is one reason fair mode matters most in queue-heavy admission rather than strict fail-fast admission.

Service behavior changes a great deal depending on which side you choose. Fail fast favors quick refusal and tighter latency control. Queueing favors giving a burst one more chance to pass. Neither choice changes the permit ceiling itself. The semaphore still caps active work. The difference is what the service does with work that arrives after the gate is full.

Spring Boot Implementations

Spring Boot supports two different web stacks, and bulkhead placement depends on which one is handling the request and what needs to be capped. Servlet applications process requests through request threads, so a filter is a natural gate before controller work begins. WebFlux handles requests through a reactive, non-blocking model, so request admission can still live in a web filter, while downstream fan-out inside a reactive chain is usually better capped inside the pipeline itself. Current Spring APIs line up with that split through OncePerRequestFilter for servlet applications and WebFilter for reactive applications.

Servlet Stack Bulkheads with OncePerRequestFilter

Servlet-based Spring MVC applications already have a request interception layer in front of controllers, and OncePerRequestFilter fits that boundary well. It is built to run a filter one time per request dispatch and exposes doFilterInternal(HttpServletRequest, HttpServletResponse, FilterChain) for request processing. That makes it a natural place to cap in-flight requests before controller logic starts expensive work or calls a dependency that is already under pressure.

Spring Boot 3 and 4 use the Jakarta Servlet API, so code in this part of the stack should import jakarta.servlet types rather than the older javax.servlet package names. That matters because it affects copy-paste accuracy for current applications. OncePerRequestFilter also remains part of Spring’s current org.springframework.web.filter package and still fits normal servlet filter work.

This fail-fast filter keeps the gate narrow for one request path family and rejects extra work immediately when all permits are in use:

package com.example.bulkhead;

import java.io.IOException;
import java.util.concurrent.Semaphore;

import jakarta.servlet.FilterChain;
import jakarta.servlet.ServletException;
import jakarta.servlet.http.HttpServletRequest;
import jakarta.servlet.http.HttpServletResponse;

import org.springframework.http.MediaType;
import org.springframework.stereotype.Component;
import org.springframework.web.filter.OncePerRequestFilter;

@Component
public final class PaymentBulkheadFilter extends OncePerRequestFilter {

    private final Semaphore permits = new Semaphore(20, false);

    @Override
    protected boolean shouldNotFilter(HttpServletRequest request) {
        return !request.getRequestURI().startsWith("/api/payments");
    }

    @Override
    protected void doFilterInternal(
            HttpServletRequest request,
            HttpServletResponse response,
            FilterChain filterChain) throws ServletException, IOException {

        if (!permits.tryAcquire()) {
            response.setStatus(HttpServletResponse.SC_SERVICE_UNAVAILABLE);
            response.setContentType(MediaType.TEXT_PLAIN_VALUE);
            response.getWriter().write("Too many payment requests in flight");
            return;
        }

        try {
            filterChain.doFilter(request, response);
        } finally {
            permits.release();
        }
    }
}

Two parts matter a lot in that filter. shouldNotFilter() narrows the cap to the request group that actually needs protection, which keeps unrelated endpoints from competing for the same permits. The finally block gives the permit back no matter how the request finishes. If that release is skipped after an exception or early return, the bulkhead slowly tightens until the application starts rejecting traffic that should have been allowed. This example assumes the guarded endpoint finishes within the initial servlet dispatch. If the endpoint enters servlet async processing, permit release has to be tied to async completion rather than the return of the initial filter chain.

Short timed waiting is possible in the servlet stack too. That gives a request a brief chance to enter if a permit is about to free up, but it also means a servlet request thread is sitting at the gate rather than doing useful work. This version shows that style:

package com.example.bulkhead;

import java.io.IOException;
import java.util.concurrent.Semaphore;
import java.util.concurrent.TimeUnit;

import jakarta.servlet.FilterChain;
import jakarta.servlet.ServletException;
import jakarta.servlet.http.HttpServletRequest;
import jakarta.servlet.http.HttpServletResponse;

import org.springframework.stereotype.Component;
import org.springframework.web.filter.OncePerRequestFilter;

@Component
public final class InventoryBulkheadFilter extends OncePerRequestFilter {

    private final Semaphore permits = new Semaphore(12, true);

    @Override
    protected boolean shouldNotFilter(HttpServletRequest request) {
        return !request.getRequestURI().startsWith("/api/inventory");
    }

    @Override
    protected void doFilterInternal(
            HttpServletRequest request,
            HttpServletResponse response,
            FilterChain filterChain) throws ServletException, IOException {

        boolean acquired = false;
        try {
            acquired = permits.tryAcquire(100, TimeUnit.MILLISECONDS);
            if (!acquired) {
                response.setStatus(HttpServletResponse.SC_TOO_MANY_REQUESTS);
                return;
            }
            filterChain.doFilter(request, response);
        } catch (InterruptedException ex) {
            Thread.currentThread().interrupt();
            response.setStatus(HttpServletResponse.SC_SERVICE_UNAVAILABLE);
        } finally {
            if (acquired) {
                permits.release();
            }
        }
    }
}

That version can help with very brief bursts, but it changes the tradeoff. Part of the request time budget is now spent waiting for admission, and the request thread remains occupied during that wait. For servlet applications, that is one of the biggest reasons fail-fast admission stays attractive for bulkheads tied to strained downstream work.

Picking Status Codes for Failure Responses

Status code choice shapes how clients interpret the refusal. Most bulkhead rejections end up as either 429 Too Many Requests or 503 Service Unavailable, and both can make sense depending on what you want the response to say.

429 is a good fit when the response should read as an admission limit being hit. That can work well when the service is intentionally capping request entry and the client should treat it as a load-related refusal at the edge of the application.

503 fits well when the refusal is really about temporary service capacity, such as a downstream dependency that is already saturated or a request class that the service cannot safely admit right now. That framing keeps the message tied to temporary unavailability rather than client behavior.

Utility code can keep that decision consistent across filters and handlers:

package com.example.bulkhead;

import org.springframework.http.HttpStatus;

public final class BulkheadResponses {

    private BulkheadResponses() {
    }

    public static HttpStatus paymentRefusal() {
        return HttpStatus.SERVICE_UNAVAILABLE;
    }

    public static HttpStatus searchRefusal() {
        return HttpStatus.TOO_MANY_REQUESTS;
    }
}

Consistency matters more than forcing every endpoint into the same status code. Payment requests tied to a fragile downstream provider may read best as temporary unavailability, while a search endpoint with a deliberately narrow concurrency cap may read more naturally as too many requests.

Headers deserve a brief note too. Retry-After can help if clients are supposed to retry, but it should not be attached casually. Automatic retries from large groups of callers can keep pressure high long after the first overload event. A quick refusal without aggressive retry hints is sometimes the safer choice when downstream capacity is the real problem.

Error body wording should stay short and factual. Clients usually do not need a long explanation to act on a concurrency refusal. Stable status codes, a compact message, and predictable behavior from one call to the next are what matter most.

WebFlux Request Caps with `WebFilter`

Reactive request handling changes the runtime model, but request admission still needs a fixed boundary when a certain route should not accept unlimited in-flight work. WebFilter is Spring’s interception contract for cross-cutting request processing on the reactive side, so it serves as the direct counterpart to servlet filtering in WebFlux. Because WebFlux runs on a non-blocking model, a request-level bulkhead should avoid blocking waits on request-processing threads. That is why fail-fast admission with tryAcquire() maps nicely to a WebFilter. The request either gets a permit immediately or it is rejected right away. No thread is parked waiting for capacity to appear.

For example this filter caps reactive request entry for one route group:

package com.example.bulkhead;

import java.util.concurrent.Semaphore;

import org.springframework.http.HttpStatus;
import org.springframework.stereotype.Component;
import org.springframework.web.server.ServerWebExchange;
import org.springframework.web.server.WebFilter;
import org.springframework.web.server.WebFilterChain;

import reactor.core.publisher.Mono;

@Component
public final class ReactivePaymentBulkheadFilter implements WebFilter {

    private final Semaphore permits = new Semaphore(20, false);

    @Override
    public Mono<Void> filter(ServerWebExchange exchange, WebFilterChain chain) {
        String path = exchange.getRequest().getPath().value();

        if (!path.startsWith("/api/payments")) {
            return chain.filter(exchange);
        }

        if (!permits.tryAcquire()) {
            exchange.getResponse().setStatusCode(HttpStatus.SERVICE_UNAVAILABLE);
            return exchange.getResponse().setComplete();
        }

        return chain.filter(exchange)
                .doFinally(signalType -> permits.release());
    }
}

doFinally handles the lifecycle work there. Reactive request processing can complete normally, fail, or be canceled. Permit release should happen for all of those terminal outcomes so the gate stays accurate.

Timed waiting with a semaphore is much less attractive in WebFlux than in servlet applications. Blocking permit acquisition does not fit naturally with a non-blocking request pipeline. Thread offloading can make it possible, but that adds moving parts and can blur the bulkhead’s intent. For request-level gating in WebFlux, immediate admission or immediate refusal usually stays much easier to reason about.

Per Downstream Call Caps Inside a Reactive Pipeline

Request-level filters cap whole requests. That works particularly well when each request leads to one downstream call or when the request itself is the right boundary to protect. Some endpoints behave differently. One incoming request can fan out into dozens or hundreds of downstream calls through WebClient. In that case, the request may be allowed in, but the downstream fan-out still needs its own cap.

WebClient is Spring’s reactive HTTP client, and Reactor gives you a direct concurrency cap through flatMap overloads that accept a concurrency value. That gives you a natural way to limit in-flight downstream calls inside a reactive chain without turning the request admission layer into a waiting line.

This service caps downstream detail lookups at 10 active calls per request:

package com.example.bulkhead;

import java.util.List;

import org.springframework.stereotype.Service;
import org.springframework.web.reactive.function.client.WebClient;

import reactor.core.publisher.Flux;
import reactor.core.publisher.Mono;

@Service
public final class ItemDetailsService {

    private final WebClient webClient;

    public ItemDetailsService(WebClient.Builder builder) {
        this.webClient = builder.baseUrl("https://details.example.com").build();
    }

    public Mono<List<ItemDetails>> fetchDetails(List<String> ids) {
        return Flux.fromIterable(ids)
                .flatMap(this::loadDetails, 10)
                .collectList();
    }

    private Mono<ItemDetails> loadDetails(String id) {
        return webClient.get()
                .uri("/items/{id}", id)
                .retrieve()
                .bodyToMono(ItemDetails.class);
    }
}

In that code, it caps downstream concurrency without rejecting the entire request at the door. The request enters, but only 10 of its downstream lookups can be active at a time. If the input list contains 200 IDs, the remaining calls wait inside the reactive sequence until one of the active slots finishes. For fan-out work, that usually fits better than a request-level semaphore because the thing being limited is not whole-request entry. It is the number of active downstream calls inside the request.

Ordering matters too, plain flatMap merges results as inner publishers complete, so result order can differ from source order. If order matters, Reactor also provides flatMapSequential with a maximum concurrency value. That keeps a concurrency cap while preserving source ordering in the merged output.

Take this example that shows where that becomes useful:

package com.example.bulkhead;

import java.util.List;

import org.springframework.stereotype.Service;
import org.springframework.web.reactive.function.client.WebClient;

import reactor.core.publisher.Flux;
import reactor.core.publisher.Mono;

@Service
public final class ReportAssemblyService {

    private final WebClient webClient;

    public ReportAssemblyService(WebClient.Builder builder) {
        this.webClient = builder.baseUrl("https://pricing.example.com").build();
    }

    public Mono<List<PricingView>> loadPricingViews(List<String> skuIds) {
        return Flux.fromIterable(skuIds)
                .flatMapSequential(this::fetchPrice, 6)
                .collectList();
    }

    private Mono<PricingView> fetchPrice(String skuId) {
        return webClient.get()
                .uri("/prices/{skuId}", skuId)
                .retrieve()
                .bodyToMono(PricingView.class);
    }
}

flatMapSequential keeps six calls active at a time while still emitting results in the original input order. That small change matters when the caller expects output to stay aligned with the original sequence.

Fairness Tradeoffs in Real Services

Fairness affects who gets admitted next when callers are waiting, and first-come, first-served can sound attractive right away. The tradeoff is that fairness is about admission order, not extra capacity. The semaphore still allows only the configured number of active holders. What changes is how waiting callers get newly freed permits.

Queue-heavy admission is where fairness matters most. If a bulkhead allows waiting, fair mode gives older waiters a better shot at moving first. Nonfair mode can let a newer arrival take a permit ahead of them. That can reduce coordination cost a bit, but it can also leave earlier waiters around longer than expected. Fail-fast bulkheads reduce the practical weight of fairness because there is little waiting to order. Immediate tryAcquire() either gets a permit right now or it does not. That means fairness is usually a more visible choice in timed-wait servlet bulkheads than in reactive fail-fast filters or servlet fail-fast filters.

Seeing code of this can make that contrast easy to understand:

Semaphore fairGate = new Semaphore(10, true);
Semaphore nonFairGate = new Semaphore(10, false);

The two gates cap active work at 10. Under contention, the first favors waiting order and the second favors faster handoff freedom. There is no universal winner there. Bulkheads attached to a route with short waits and tight latency goals are commonly left nonfair. Bulkheads where queued callers should move in arrival order can justify fair mode, particularly if timed waiting is part of the admission policy.

There is another subtle point that can surprise people the first time they rely on fair mode. Immediate tryAcquire() does not behave the same way as blocking acquisition with respect to fairness. That means a service built around fast rejection can still see immediate arrivals succeed out of turn if a permit becomes free at just the right moment. For that reason, fairness is most visible when callers are actually waiting, not when the service is built around instant admission decisions.

Conclusion

Bulkhead concurrency limits with semaphores work because they turn shared capacity into a fixed gate that every request or downstream call has to pass through before work begins. That fixed gate is what keeps pressure from spreading past the point you chose. On the servlet side, that can live at request entry in a filter, while WebFlux can apply the same idea at request entry or deeper in a reactive chain when fan-out work needs its own cap. When permit count, release handling, and admission behavior are set carefully, the service stops taking on more in-flight work than that boundary allows, which keeps overload from quietly piling up in the background.