Java Aho Corasick for Pattern Matching

May 11, 2026

Looking through text for several search terms at the same time gets expensive fast if every term starts a fresh scan. Aho-Corasick avoids that by building one finite-state matcher from the full keyword set, then moving through the text from left to right in one pass. In the original paper, the matcher is built around three functions named goto, failure, and output, with the goal of finding every hit for every keyword, including overlaps, without jumping back to the start after a miss. In Java, that usually means building a trie, linking states with failure links, then advancing through the input with one current state while reporting hits as characters arrive. String positions in Java are based on UTF-16 code units, so a matcher built on char values follows that same rule unless it is rewritten around Unicode code points.

Building the Matcher

Before we scan any text, we need a structure that can remember partial matches without falling all the way back to the root after every miss. That build phase is what gives Aho-Corasick its value. Instead of treating every search term as a separate search, we fold the full keyword set into one shared structure. Prefixes that begin with the same characters live on the same branch, which cuts repeated storage and gives us a state we can continue from later. After that shared structure is in place, failure links connect related states so partial progress is not thrown away after a mismatch.

What we build first is not the search loop itself. We start by storing the keyword set as prefixes in a trie. After that, we connect those trie states with failure links so the matcher can fall back to the best shorter prefix already known to it. With that order in mind, the section reads more naturally. We first store the keywords, then we attach the links that let the machine recover from misses.

Trie States

Everything begins with the trie because we need a place to store the search terms in a shared form. If one term is he and another is hers, the opening h and he portion does not need to be stored twice. One branch can cover both. That is the first big idea behind Aho-Corasick. States are based on prefixes, not only on complete words. Every node stands for the text seen so far while moving from the root down to that node.

Take a small keyword set such as he, hers, his, and she. The root has no character attached to it. Moving from the root through h leads to the prefix h. Moving from that state through e leads to he. If a word ends there, the node has to remember that fact. Then, if the same branch continues into r and s, the trie can store hers without rebuilding the earlier characters. Shared prefixes are the reason the structure stays compact relative to storing every term as a separate chain with no overlap.

What helps most at this stage is thinking about a trie state as the full prefix that got us there, not just the current character. That detail becomes important later, because suffix links are based on the full prefix represented by the state. If a node stands for she, the machine has to remember that he is a suffix worth caring about. That suffix relation belongs to the next subsection, but the trie has to exist first before those links can be attached.

In Java, a node usually stores three pieces of data. We need outgoing transitions to child states, a place to store completed keywords that end at the node, and a slot for the failure link that we attach after the trie is built. We can start with a small node type:

import java.util.*;

final class Node {
    final Map<Character, Node> next = new HashMap<>();
    final List<String> outputs = new ArrayList<>();
    Node failure;
}

next stores outgoing transitions by character. outputs records which keywords end at that node. failure stays unset at first because those links depend on the full trie already being present.

Storage choice affects speed and memory use, but the trie idea stays the same. Map<Character, Node> keeps the structure readable and handles sparse alphabets well. If we know ahead of time that the keyword set contains only lowercase English letters, an array can remove hash lookups and trim object overhead. That choice ties the node to a narrow alphabet, so a map-based version is a good fit for teaching the core mechanics.

Building the trie means taking each keyword one character at a time and either following an existing edge or creating a new one. At the last character, the node marks that the keyword ends there. That does not mean the branch stops. Longer words can continue past the same state. he can end at one node while hers keeps going through more transitions.

We can build that insertion step like this:

public final class KeywordTrie {
    private final Node root = new Node();

    public void addKeyword(String word) {
        Objects.requireNonNull(word, "word");
        if (word.isEmpty()) {
            throw new IllegalArgumentException("Empty keywords are not supported.");
        }

        Node current = root;
        for (int i = 0; i < word.length(); i++) {
            char ch = word.charAt(i);
            current = current.next.computeIfAbsent(ch, unused -> new Node());
        }
        current.outputs.add(word);
    }

    public Node root() {
        return root;
    }
}

Walking through addKeyword lets us see what the trie is really storing. If we insert he first, the trie gets a child for h, then a child for e, and the final node records he in outputs. If we insert hers after that, we reuse the existing h and e states, then add r and s below them. Prefix reuse is exactly why tries fit multi-keyword search so well.

Storage cost for the trie is usually written as O(P), where P is the total number of characters across all keywords. That summary comes from the fact that each inserted character creates at most one new state and one connecting edge. Shared prefixes can lower the actual state count below the raw character total, but O(P) is the usual upper bound. Put another way, trie size grows with the total keyword content, not with the input text we will scan later.

We should also settle the empty-keyword case up front. Most implementations reject empty strings. Empty text would match at every position, which changes match reporting right away and adds a special case to nearly every later step. Rejecting empty strings keeps match reporting tied to characters we have actually consumed.

Because Java String indexing is based on UTF-16 code units, a char based trie follows the same rule. Supplementary Unicode characters take two char positions rather than one code point. For ordinary ASCII and a large share of day-to-day text, that does not cause trouble. If full Unicode code points need to be treated as single units, the trie has to be built from code points rather than char.

Sometimes it helps to inspect what we inserted before moving on to failure links. A small helper can print terminal prefixes after the trie is built:

public static void printPrefixes(Node node, String prefix) {
    if (!node.outputs.isEmpty()) {
        System.out.println(prefix + " -> " + node.outputs);
    }

    for (Map.Entry<Character, Node> entry : node.next.entrySet()) {
        printPrefixes(entry.getValue(), prefix + entry.getKey());
    }
}

Running that after inserting he, hers, his, and she lets us verify that shorter words and longer words can share part of the same branch without conflict. By the end of this stage, we have a trie whose states stand for prefixes from the keyword set. That gives us the storage layer we need, but it does not yet tell the matcher where to fall back after a mismatch.

Failure Links

Once the trie exists, every state knows what prefix it stands for, but it still does not know where to go after a mismatch. Without failure links, the matcher would jump back to the root too frequently and lose partial progress that could still help. Failure links fix that by connecting a state to the longest proper suffix of its prefix that is also present as a trie prefix.

Start with the state for hers. If a later step needs to fall back from that state, we do not want to throw away all four letters right away. The suffixes of hers are ers, rs, and s. Out of those, the suffix that is also a trie prefix is s, assuming the trie has a branch beginning with s. That means the failure link from hers points to the state for s. The point is to keep the best still-valid prefix available after a mismatch.

The same idea becomes more interesting with a state such as she. Proper suffixes there are he and e. If he exists as a trie prefix, then the failure link from she points to the state for he. That is why overlapping matches fall out naturally. The machine does not forget that he is already a meaningful suffix inside she. Through that suffix relation, the matcher can later report more than one keyword ending at the same text position.

Output handling belongs here too because it is attached directly to failure links. If the state for she falls back to the state for he, then a scan arriving at she has also matched he. One common choice is to copy the outputs from the failure target into the current node during the build phase. With that choice, the node for she can directly report both she and he later, without walking the failure chain every time a match is found.

Before we look at full link construction, it helps to isolate the suffix idea in a small helper:

public static String longestProperSuffixPrefix(String value, Set<String> prefixes) {
    for (int i = 1; i < value.length(); i++) {
        String suffix = value.substring(i);
        if (prefixes.contains(suffix)) {
            return suffix;
        }
    }
    return "";
}

If prefixes contains h, he, her, hers, s, sh, and she, then calling that helper with she returns he, while calling it with hers returns s. That is the exact relation the failure link is meant to encode.

Breadth-first traversal is the usual way to attach those links because shallower states have to be finished before deeper states can refer to them. Direct children of the root form the first easy case. Their failure link points back to the root because there is no shorter proper suffix to use. From there, we move outward level by level. For each child reached by character ch, we check the parent’s failure link and keep following failure links until we find a state that has an outgoing ch edge or we arrive back at the root.

We can isolate that fallback search in one helper:

static Node findFailureTarget(Node root, Node parentFailure, char ch) {
    Node fallback = parentFailure;

    while (fallback != root && !fallback.next.containsKey(ch)) {
        fallback = fallback.failure;
    }

    if (fallback.next.containsKey(ch)) {
        return fallback.next.get(ch);
    }
    return root;
}

Walking through that helper makes the rule easier to follow. Suppose the current parent state stands for sh and the child edge is e, so the child state stands for she. If the parent failure points to h, and h has an outgoing e edge, then the child failure becomes he. If no usable edge is found while we follow fallback links, the root becomes the failure target.

What matters most in that step is that the failure link depends on the full prefix represented by the state, not only on the current character. Two states ending in the same letter can have different failure targets because their full prefixes differ. That is why trie states have to be read as prefix strings, not as isolated character labels.

Output inheritance is the last piece to attach. If she fails to he, then the he keyword ends at the same final character as she. Copying outputs during link construction means the later reporting phase does not need extra suffix checks. The machine already knows which keyword endings belong at the state. We can attach inherited outputs with a helper:

static void inheritOutputs(Node node) {
    if (node.failure != null && !node.failure.outputs.isEmpty()) {
        node.outputs.addAll(node.failure.outputs);
    }
}

That merge changes what a node can report without changing any trie edges. Prefix storage still comes from the trie itself. Suffix awareness comes from failure links. Reported keyword endings come from the node’s own terminal status along with any endings inherited from the failure target.

Build cost for creating the trie and attaching failure links is O(P), where P is the total number of characters across all keywords. The base machine also uses O(P) space for trie states, edges, and failure links. In this version, inherited outputs are copied into states during construction, so those copied entries also count toward the final cost. If R is the total number of stored output entries after inheritance, this implementation uses O(P + R) build time and O(P + R) space.

By the time this phase finishes, the keyword set has turned into more than a tree. It has become a connected state machine where every state knows how to continue on a matching character and how to fall back to the best shorter prefix after a miss. That completes the matcher-building phase and leaves the structure ready for the text scan later.

Running the Machine in Java

With the trie and failure links already built, we can turn that structure into a matcher that moves through a String from left to right without restarting for every search term. During the scan, we keep very little live state. We track the current node, read the next input character, and report whatever term endings belong to the node we reach. From there, Java code becomes a matter of storing transitions in a form we can read quickly, filling failure links in breadth-first order, then reporting matches at the moment the scan reaches a term ending. The classic Aho-Corasick machine is still centered on goto, failure, and output, while Java’s built-in substring search stays centered on literal search methods such as indexOf() and contains().

A Java Node Layout

For Java code, node storage has a direct effect on readability, allocation count, and transition lookup cost. A frequent starting point is Map<Character, State> for outgoing edges, because it handles sparse alphabets without forcing us to reserve space for characters that never appear. That keeps the state structure readable and flexible, which is a good fit for a first implementation that teaches how the machine moves. We also need to remember that String indexing in Java is based on UTF-16 code units, so a char based matcher follows that same rule unless we rebuild it around Unicode code points.

We can package runtime data into a state type that stores outgoing transitions, a failure link, and a small record of which search terms end at that state. Storing ids instead of full strings trims repeated references when the term text already lives elsewhere:

import java.util.*;

final class State {
    final Map<Character, State> next = new HashMap<>();
    State failure;
    int[] outputIds = new int[0];

    void addOutputId(int id) {
        int[] grown = Arrays.copyOf(outputIds, outputIds.length + 1);
        grown[grown.length - 1] = id;
        outputIds = grown;
    }
}

In that layout, next holds the normal transitions, failure points to the fallback state, and outputIds tells us which term endings belong to the node. During scanning, that means we do not have to walk backward through the trie to ask what ended at the current position. The state already carries that information.

Storage can change when the alphabet is tightly bounded. If we know the matcher will only read lowercase English letters, an array-backed transition table removes hash lookups and avoids Character boxing. That cuts constant costs, but it also ties the code to a narrow input set. A map-backed state is more forgiving, while an array-backed state can be faster for restricted input because moving from one state to the next becomes direct array indexing.

We can make a narrower version like this:

final class LowercaseState {
    final LowercaseState[] next = new LowercaseState[26];
    LowercaseState failure;
    int[] outputIds = new int[0];

    static int slot(char ch) {
        return ch - 'a';
    }
}

That form only fits when the input alphabet is tightly controlled. Broader input is handled by a map-based version with far less rewriting. Either way, the machine logic stays the same. We still try the normal transition first, then follow failure links if the needed edge is missing.

Runtime layout also affects how we think about reported positions. If we scan a Java String by char, then every reported start and end index refers to UTF-16 code units. Supplementary Unicode characters therefore occupy two positions in the string. For ASCII and a large share of everyday text, that detail does not cause trouble. If full Unicode code points need to be treated as single units, we would rebuild the matcher around code points rather than char.

Building Links with Breadth First Search

After the trie states already exist, breadth-first traversal is the usual way to fill failure links because shallow states have to be ready before deeper states can point back to them. Children of the root form the first easy case. Their failure link goes to the root. From there, we move outward layer by layer and attach links for depth two, then depth three, and so on. That ordering matters because a deeper state can only fall back to prefixes that were already processed earlier.

What we are doing during that pass stays consistent from node to node. We take a parent state, look at one outgoing character, then ask where the parent would fall back if that character could not continue from its current location. We keep following failure links until we find a state that has a usable edge for the same character or we return to the root. Breadth-first order gives us those earlier failure values in time for the next layer.

Queue-based Java code mirrors that plan:

import java.util.ArrayDeque;
import java.util.Arrays;
import java.util.Map;

public static void buildFailures(State root) {
    ArrayDeque<State> queue = new ArrayDeque<>();
    root.failure = root;

    for (State child : root.next.values()) {
        child.failure = root;
        queue.addLast(child);
    }

    while (!queue.isEmpty()) {
        State current = queue.removeFirst();

        for (Map.Entry<Character, State> edge : current.next.entrySet()) {
            char ch = edge.getKey();
            State child = edge.getValue();
            State fallback = current.failure;

            while (fallback != root && !fallback.next.containsKey(ch)) {
                fallback = fallback.failure;
            }

            if (fallback.next.containsKey(ch)) {
                child.failure = fallback.next.get(ch);
            } else {
                child.failure = root;
            }

            if (child.failure.outputIds.length != 0) {
                int oldLen = child.outputIds.length;
                child.outputIds = Arrays.copyOf(
                    child.outputIds,
                    oldLen + child.failure.outputIds.length
                );
                System.arraycopy(
                    child.failure.outputIds,
                    0,
                    child.outputIds,
                    oldLen,
                    child.failure.outputIds.length
                );
            }

            queue.addLast(child);
        }
    }
}

What deserves attention in that pass is not only the failure pointer itself. Output inheritance is folded into the build step too. When a state fails to an earlier state that already marks completed search terms, those term ids belong to the current state as well. Copying those inherited outputs during breadth-first construction keeps the later scan loop much smaller, because match reporting becomes a direct read from the current state rather than a separate trip through the failure chain.

Build cost for creating the trie and attaching failure links is O(P), where P is the total number of characters across all search terms. The base machine also uses O(P) space for trie states, transitions, and failure links. This implementation copies inherited output ids into states during breadth-first construction, so those copied ids also count toward the final cost. If R is the total number of stored output ids after inheritance, this version uses O(P + R) build time and O(P + R) space.

Scanning the Text

With the machine ready, scanning turns into one left-to-right pass through the input text. We keep a current state, read the next character, follow a normal transition if it exists, and fall back through failure links if it does not. After the transition settles, every search term recorded in the state’s output list ends at the current text index. That is the full runtime loop in compact form.

Java code for that scan can stay fairly small:

import java.util.ArrayList;
import java.util.List;

record MatchHit(int start, int endExclusive, int termId) {}

public static List<MatchHit> scan(State root, String text, String[] terms) {
    List<MatchHit> hits = new ArrayList<>();
    State current = root;

    for (int i = 0; i < text.length(); i++) {
        char ch = text.charAt(i);

        while (current != root && !current.next.containsKey(ch)) {
            current = current.failure;
        }

        State nextState = current.next.get(ch);
        current = (nextState != null) ? nextState : root;

        for (int termId : current.outputIds) {
            int start = i - terms[termId].length() + 1;
            hits.add(new MatchHit(start, i + 1, termId));
        }
    }

    return hits;
}

Reading through that loop in order helps fix the mechanics in place. We begin at the root. For each char, we try the next transition. If that edge is missing, we keep following failure until we either find a usable edge or return to the root. After we land on the next state, we emit every term id stored there. Reported positions are based on the current text index and the matched term length, so the start index is computed from the ending position we just reached.

Two runtime details deserve extra attention. Overlapping matches come out naturally because a single state can carry more than one output id. If that happens, we report all of them at the same text position. The other detail is that the scan never rewinds the input. Only the machine state moves backward through failure links. The text index keeps moving forward by one char at a time, which is why the non-output part of the scan stays linear in the length of the input.

A print helper can make emitted matches easier to inspect while we test the scan:

public static void printHits(List<MatchHit> hits, String[] terms) {
    for (MatchHit hit : hits) {
        System.out.println(
            terms[hit.termId()] +
            " [" + hit.start() + ", " + hit.endExclusive() + ")"
        );
    }
}

Now we can read what the matcher reports without digging through raw ids. If the text is ushers and the term set includes he, she, his, and hers, then the machine can report she and he at the same ending position, followed later by hers. That falls out of the state outputs we built earlier, not from extra substring checks after every character.

Search cost is usually considered O(T + M), where T is the input text length and M is the number of reported matches. Put plainly, scanning the text itself stays linear in text size, and reporting adds time only for the hits we actually emit. Just like the node layout discussion earlier, the reported indices are still UTF-16 code unit positions if we scan the String by char.

Choosing Built In Search or Aho Corasick

Not every substring search job needs a prebuilt machine. In current Java, contains(CharSequence) is a thin wrapper over indexOf, and indexOf(String) goes through internal literal-search logic inside String. That means Java already gives us a tuned path for literal substring search with no machine construction cost up front.

That changes which tool fits the job. If we are checking for a single literal substring, or only a very small set of literals, repeated calls to indexOf() can be faster in practice because we skip trie creation, failure-link construction, and output storage. contains() fits cases where we only need a yes-or-no answer for one literal substring. There is no benefit in building an Aho-Corasick machine to answer a one-off question such as whether a line contains ERROR.

Aho-Corasick starts to pay for itself when the job changes from literal search to multi-term scanning. If we need to search for dozens or hundreds of fixed terms across large text and report every hit, building the machine up front can pay off because the later scan stays near linear in the text length plus emitted matches. That is also where overlapping matches become important. Built-in literal search is excellent for literal lookup. Aho-Corasick is built for searching a reusable term set in one pass.

Java search choices are best framed around the search need in front of you. indexOf() fits when you have one target substring, while contains() reads well when the only question is yes or no for one literal substring. For a reusable set of fixed search terms, Aho-Corasick fits full hit reporting, including overlaps, from one pass through the text. The better choice comes from that search need, not from a blanket rule that one tool always wins.

Conclusion

Aho-Corasick builds a trie from the full search set, links related states with failure links, then moves through the text with a single current state while reporting matches at the point they end. That is what lets the matcher keep advancing through the input, recover from partial misses without starting over, and report overlapping hits from the same scan. In the context of Java, the tradeoff is paying the build cost up front so repeated multi-term searches can stay efficient, while String.indexOf() and String.contains() still fit smaller literal checks better.

Java String Class Documentation
Java Character Class Documentation
Java Map Interface Documentation
Java ArrayDeque Class Documentation
Java Arrays Class Documentation
Java Objects Class Documentation

Share Alexander Obregon's Substack

Alexander Obregon's Substack

Discussion about this post

Ready for more?