Reservoir Sampling in Java

Apr 10, 2026

Sampling from a stream gets tricky when items arrive one at a time, the total length is unknown, and storing every item in memory is not practical. You run into that with file lines, database cursors, log events, iterators, and network feeds. Reservoir sampling handles it by keeping only a fixed number of items while still giving every item seen so far the same chance of ending up in the final sample. That balance is what makes it useful for large input or data sources that do not have a known end.

Why Reservoir Sampling Exists

Streaming input creates a hard sampling problem. Items arrive one at a time, the total count may stay unknown, and storing every value can be too expensive. Reservoir sampling was created for exactly that setting. It keeps a small sample in memory while input keeps moving forward, and it does so without giving early or late arrivals an unfair edge. Low memory use and fair selection are the whole reason the method exists.

A Fixed Memory Sample From a Growing Input

Memory pressure is the first issue reservoir sampling handles. Think about a source that keeps handing you values, such as file lines, rows from a database cursor, log entries, or data from a network feed. You can read each item as it arrives, but keeping all of them may not be realistic. Some inputs are huge. Some keep flowing for so long that reading everything first defeats the point of sampling. In that setting, a random sample still has value, while a full copy of the input does not.

Reservoir sampling handles that by picking a sample size up front. Call it k. The method keeps a container with exactly k slots after the first fill phase. The first k items go into those slots directly. After that, every new item faces a decision. It does not automatically enter the sample, and it is not automatically rejected. Instead, it gets a chance to replace one of the stored items.

That replacement rule is what keeps memory fixed. No matter how long the stream gets, the reservoir never grows past k. If k is 100, then the sample storage stays at 100 items after the initial fill. The stream length could be 1,000, 1,000,000, or still rising, and the storage tied to the sample stays the same.

Take a short trace to make that flow easier to follow:

int k = 3;
int[] reservoir = {12, 19, 25};

// item 4 arrives with value 31
// pretend the random draw is 1 from the range 0..3
reservoir[1] = 31;   // reservoir becomes [12, 31, 25]

// item 5 arrives with value 44
// pretend the random draw is 4 from the range 0..4
// no replacement happens, reservoir stays [12, 31, 25]

// item 6 arrives with value 58
// pretend the random draw is 0 from the range 0..5
reservoir[0] = 58;   // reservoir becomes [58, 31, 25]

That trace captures the flow of the method. The sample does not grow as input grows. It stays fixed, and incoming items compete for a slot. Early values are not locked in place. Later values still get a chance to enter. That balance is why the method stays useful on long input instead of turning into a biased record of only what arrived first.

Storage is only part of the story, though. A method that keeps memory low would not help much if it quietly favored one part of the stream over another. Reservoir sampling was built so the sample stays fair while the stream moves forward. Fairness comes from the replacement probability, not from storing everything first and picking later.

There is also a count-only view that helps make the memory side feel more tangible:

int processed = 0;
int k = 5;

// after reading 5 items
processed = 5;   // reservoir size is 5

// after reading 500 items
processed = 500; // reservoir size is still 5

// after reading 50_000 items
processed = 50_000; // reservoir size is still 5

The value in that count view is not the arithmetic itself. What stands out is the fact that processed input keeps rising while reservoir size stays flat. Reservoir sampling does not try to keep a large share of the stream. It tries to keep a fair sample of a chosen size without letting memory grow with the stream length.

Why Every Item Gets the Same Chance

Fairness is the second issue reservoir sampling handles, and this is where the logic gets more interesting. The method is built so every item seen so far has the same final chance to still be in the reservoir. That statement sounds compact, but walking through it slowly helps because the replacement rule can feel unusual at first.

Start with the smallest case, where the reservoir size is 1. The first item becomes the current sample. The second item replaces it with probability 1 / 2. The third item replaces the current sample with probability 1 / 3. The fourth item gets probability 1 / 4, and that same rule keeps going as the stream grows.

That can feel uneven at first because later items get smaller entry chances. The balancing force comes from survival. Early items get into the sample sooner, but they also face more replacement rounds. Later items get fewer entry chances, but they also face fewer chances of being removed after entry. Those two effects balance out.

Take the second item in a stream of five items. It gets selected when it arrives with probability 1 / 2. After that, it has to survive item three, item four, and item five. Its survival chances across those steps are 2 / 3, 3 / 4, and 4 / 5. Multiply those terms and the result is 1 / 5.

Now look at the fifth item. It enters with probability 1 / 5, and there are no later items left to remove it. Its final probability is also 1 / 5. The first, second, third, fourth, and fifth items all end with the same final chance.

Quick numeric checks put numbers on that idea:

double secondItemFinalChance =
        (1.0 / 2.0) *
        (2.0 / 3.0) *
        (3.0 / 4.0) *
        (4.0 / 5.0);

double fifthItemFinalChance = 1.0 / 5.0;

System.out.println(secondItemFinalChance); // 0.2
System.out.println(fifthItemFinalChance);  // 0.2

Both printed values are 0.2, which is the same as 1 / 5. That is the fairness rule in action for the size 1 case.

Take a reservoir with size k next. For an item arriving at position i, there are i possible random outcomes at that moment, and only k of them lead to entry into the reservoir. So the chance to enter is k / i.

After that item enters, later arrivals can remove it, but not all later arrivals will do that. At position i + 1, a replacement happens with chance k / (i + 1), and only one reservoir slot gets replaced. That means a stored item survives that step with probability i / (i + 1). That same survival factor keeps repeating as more items arrive.

By the time the stream reaches position n, the full probability for an item that arrived at position i becomes

(k / i) × (i / (i + 1)) × ((i + 1) / (i + 2)) × ... × ((n - 1) / n)

Most of the middle terms cancel, leaving k / n.

That cancellation is the heart of the method. A later item gets a smaller entry chance, but it also faces fewer future chances to be removed. An earlier item gets into the reservoir sooner, but it faces more future chances to be replaced. The arithmetic balances those effects so the final probability stays the same.

The first k items follow that same logic. They enter immediately because the reservoir still has open space. After the reservoir fills, they begin facing replacement risk from later arrivals. By the time the stream ends at position n, each of those first k items also ends with final probability k / n. So the early part of the stream and the late part of the stream finish on equal footing.

That is why random replacement is not a rough shortcut. It is the reason the sample stays fair. Fixed-size storage by itself would not be enough. Fair storage comes from letting new arrivals challenge older entries with exactly the right probability at each step.

Reservoir Sampling in Modern Java

Java fits reservoir sampling well because the language already gives you the parts this algorithm needs. You need sequential access to incoming values, a bounded random draw after the reservoir fills, and a place to hold only the chosen sample. Current Java covers that through Iterator, generic collections, and the RandomGenerator family. An implementation can stay fairly small, but the details still deserve careful attention, particularly around counting, bounded random selection, and the generator passed into the method.

Sample of One Item

Start with the smallest form first, reservoir sampling with size 1 keeps only one current choice while the source moves forward. The first item becomes the current value right away. After that, each new item gets a replacement chance of 1 / n, where n is the number of items seen so far. In Java, that maps nicely to a bounded random draw. If the random value comes back as zero, the new item replaces the stored item. If not, the stored item stays in place. RandomGenerator fits well here because it gives bounded methods such as nextLong(long bound), and that method returns a value from zero inclusive up to the bound exclusive.

This compact implementation keeps the full rule in one place:

import java.util.Iterator;
import java.util.Optional;
import java.util.random.RandomGenerator;

public final class SingleItemSampler {
    private SingleItemSampler() {
    }

    public static <T> Optional<T> sampleOne(Iterator<T> source, RandomGenerator random) {
        T chosen = null;
        long seen = 0;

        while (source.hasNext()) {
            T item = source.next();
            seen++;

            if (random.nextLong(seen) == 0) {
                chosen = item;
            }
        }

        return seen == 0 ? Optional.empty() : Optional.of(chosen);
    }
}

The seen counter is doing more than keeping a tally. It is part of the probability rule itself. On item 1, nextLong(1) can only return 0, so the first item is stored. On item 2, the method draws from 0 through 1, so replacement happens half the time. On item 3, replacement happens a third of the time. That same flow continues through the entire input. Returning Optional.empty() for an empty source also keeps the result honest instead of inventing a placeholder value that never appeared.

Take a short call site to make that flow easier to read:

import java.util.List;
import java.util.SplittableRandom;
import java.util.random.RandomGenerator;

List<String> events = List.of("login", "search", "checkout", "logout");
RandomGenerator random = new SplittableRandom(2026L);

var chosen = SingleItemSampler.sampleOne(events.iterator(), random);
System.out.println(chosen.orElse("no event"));

Passing the generator in from the outside keeps the sampling method focused on the replacement rule rather than generator creation. That also gives you repeatable runs during tests when you use a fixed seed, or thread-local generation when sampling happens inside concurrent code. SplittableRandom(long seed) is handy for repeatable output, while RandomGenerator keeps the method open to several generator types without changing the algorithm.

Sample of `k` Items

Move from size 1 to size k, and the structure gets wider without changing the probability rule. The first k items fill the reservoir directly. After the reservoir is full, each new item gets a random slot from the range of items seen so far. If the chosen number falls inside the reservoir range, replacement happens at that slot. If the number falls outside that range, the new item is skipped. Java collections fit this very naturally, and ArrayList is a strong choice here because indexed replacement through set is direct.

This generic method covers that version:

import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.random.RandomGenerator;

public final class ReservoirSampler {
    private ReservoirSampler() {
    }

    public static <T> List<T> sample(Iterator<T> source, int k, RandomGenerator random) {
        if (k < 0) {
            throw new IllegalArgumentException("k must be at least 0");
        }

        List<T> reservoir = new ArrayList<>(k);
        long seen = 0;

        while (source.hasNext() && reservoir.size() < k) {
            reservoir.add(source.next());
            seen++;
        }

        while (source.hasNext()) {
            T item = source.next();
            seen++;

            long slot = random.nextLong(seen);
            if (slot < k) {
                reservoir.set((int) slot, item);
            }
        }

        return reservoir;
    }
}

Two counters are active in that method. reservoir.size() tells you how full storage is during the first fill phase. seen tracks how several items have passed through the sampler in total. That second value is the part tied to probability. If seen is 10 and k is 3, then only three random outcomes out of the ten possible values from nextLong(10) lead to entry into the reservoir. That gives the current item its correct 3 / 10 entry chance.

Choosing long for seen is helpful for long-running input. The reservoir still relies on indexed collection access, so k remains an int, but the count of processed items does not have to stop at Integer.MAX_VALUE. The cast from slot to int is safe here because the cast only happens after the code checks slot < k, and k already fits in int.

Edge behavior deserves a short note too. If the source ends before k items arrive, the method returns all available items. That matches the problem naturally because there is no way to return a sample of ten distinct items from an input that only had six.

Sampling from file lines gives this method a natural place to run:

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.List;
import java.util.SplittableRandom;

Path logFile = Path.of("events.log");

try (var lines = Files.lines(logFile)) {
    List<String> picked = ReservoirSampler.sample(
            lines.iterator(),
            5,
            new SplittableRandom(9001L)
    );

    for (String line : picked) {
        System.out.println(line);
    }
}

That example stays close to what reservoir sampling is built for. The file can be read as a stream of lines, the sampler stores only five of them, and the method does not need the full line count ahead of time. Files.lines also makes the sequential nature of the source very visible, which pairs nicely with a one-pass sampling method.

Picking a Generator in Current Java

Generator choice affects the call site more than the reservoir logic itself, but it still deserves attention because current Java gives you several strong options. RandomGenerator is the shared interface, so the sampling method can accept that type and stay open to several generator classes. That keeps the algorithm from being tied to any single generator implementation.

ThreadLocalRandom is a strong fit when each thread is sampling independently. ThreadLocalRandom.current() returns the generator for the current thread, which avoids sharing a single mutable generator across threads. That is useful when sampling runs inside concurrent code and each thread has its own input flow. SplittableRandom is also a very good fit. It was built with independent random generation in mind, and it supports splitting into new generators that can continue separately. It also has a constructor that takes a seed, which is useful when you want repeatable output during tests or while checking example runs.

A short comparison at the call site makes those choices easier to see:

import java.util.List;
import java.util.SplittableRandom;
import java.util.concurrent.ThreadLocalRandom;
import java.util.random.RandomGenerator;

List<Integer> values = List.of(3, 8, 13, 21, 34, 55, 89);

RandomGenerator seeded = new SplittableRandom(13579L);
List<Integer> seededSample = ReservoirSampler.sample(values.iterator(), 3, seeded);

RandomGenerator perThread = ThreadLocalRandom.current();
List<Integer> threadSample = ReservoirSampler.sample(values.iterator(), 3, perThread);

The first call is repeatable for the same seed within the same generator class, which is useful in tests. The second call suits per-thread activity better. Neither choice changes the reservoir algorithm itself. The algorithm only needs bounded random selection through the RandomGenerator API.

Security-sensitive code can call for a different generator family. SecureRandom belongs in that category. Because SecureRandom extends Random, and Random implements RandomGenerator, it can also be passed to the same method signature when that kind of generator is needed.

Cost on Large Input

Performance is one reason reservoir sampling remains attractive in Java. The sampler reads each item exactly one time in the basic form, so time cost rises linearly with the number of processed items. Storage stays tied to the sample size k, not to the full input length. In big-O terms, that gives O(n) time for a pass over n items and O(k) space for the reservoir. If the sample size is 1, storage drops to O(1).

What counts in practice is that the method does not need random access into the original source and does not need to buffer the full stream before deciding. Iterator<T> is enough. That matches file lines, cursor-backed results, stream iterators, and any other source that is already moving forward item by item.

There is also a useful boundary to keep in mind. Reservoir sampling gives a uniform sample from the sequence that actually reaches the sampler. If earlier filtering has already changed that sequence, the sampler will faithfully sample the filtered sequence rather than the raw source. So the fairness guarantee belongs to the data that entered the algorithm, not to data that never reached it.

Very large sampling jobs have inspired skip-based variants that reduce per-item random decisions. Those versions belong to the broader family, but the plain reservoir replacement rule is still the best starting point for most Java articles and codebases because it is short, readable, and lines up well with the standard library tools already in place.

Conclusion

Reservoir sampling comes down to a small set of moving parts. You keep a fixed reservoir, fill it with the first k items, then let each later item compete for a slot based on how far into the stream it arrived. That replacement rule is what keeps memory flat while still giving every seen item the same final chance to remain in the sample. Java fits that flow well through Iterator, bounded random draws, and a small collection for storage, which is why the method holds up so well for long streams and unknown input sizes.

Share Alexander Obregon's Substack

Alexander Obregon's Substack

Discussion about this post

Ready for more?