Suffix Array Construction in Java

May 05, 2026

Searching the same text for new substrings can get expensive fast if you start from scratch every time. A suffix array solves that by sorting suffix starting positions first, then reusing that sorted order for later lookups. The structure is an array of indexes, and those indexes are stored in lexicographic order based on the suffix at each position. Manber and Myers introduced the idea in the original suffix array paper, and it still fits modern Java well because String, charAt, and Arrays.sort are part of the current JDK. One Java-specific point does matter right away. String indexes are based on UTF-16 char units, so a supplementary Unicode character takes two index positions. Because of that, a beginner-friendly suffix array built with charAt should be treated as operating on UTF-16 units rather than full Unicode code points.

What a Suffix Array Stores

A suffix array belongs to one text and records where every suffix begins. That may sound small, yet it gives a full ordering of the text’s tail segments without copying the text into a pile of separate strings. For text length n, there are n suffixes, each beginning at a valid position. The array holds those start positions in lexicographic order, so the stored numbers carry more than location data. They capture the sorted order of every possible tail of the text.

That ordering is the whole point of the structure. Raw text presents characters in their original sequence. A suffix array presents the same text through suffix order instead of left-to-right position. You still keep the original String, and the suffix array points back into it. Nothing in the main structure duplicates the text, so the array stays fairly small while still acting as an index over the original text.

Every Suffix Starts at One Index

Let’s use banana as a short text. The first suffix starts at index 0, which gives banana. The next suffix starts at index 1, which gives anana. Move right one position at a time and you get every suffix the text can produce. For a text of length 6, there are 6 suffixes. For a text of length 1000, there are 1000 suffixes. Each valid starting position maps to one suffix, and no position is skipped.

Below, a short Java loop makes that visible:

public class SuffixView {
    public static void main(String[] args) {
        String text = "banana";

        for (int i = 0; i < text.length(); i++) {
            System.out.println(i + " -> " + text.substring(i));
        }
    }
}

That output reflects original text order rather than sorted suffix order:

0 -> banana
1 -> anana
2 -> nana
3 -> ana
4 -> na
5 -> a

Suffix arrays do not keep those suffix strings as their main stored form. They keep only the starting indexes. That choice saves space and keeps the structure tied closely to the original text. For banana, the sorted suffix order comes from these suffix values. a comes first, then ana, then anana, then banana, followed by na and nana. From that ordering, the suffix array becomes 5, 3, 1, 0, 4, 2. Those entries are only indexes, yet each index points to one suffix in sorted order.

Lexicographic order follows the same rule used in ordinary string comparison. Characters are checked from left to right until something differs. If one string ends while all checked characters still match, the shorter string comes first. That is why ana comes before anana. Their first three characters match, then ana ends, so it sorts earlier.

Direct Java sorting makes that stored-index idea easier to see:

import java.util.Arrays;

public class SortedSuffixIndexes {
    public static void main(String[] args) {
        String text = "banana";
        Integer[] starts = new Integer[text.length()];

        for (int i = 0; i < text.length(); i++) {
            starts[i] = i;
        }

        Arrays.sort(starts, (left, right) ->
            text.substring(left).compareTo(text.substring(right))
        );

        System.out.println(Arrays.toString(starts));
        for (int start : starts) {
            System.out.println(start + " -> " + text.substring(start));
        }
    }
}

Read that result from top to bottom:

[5, 3, 1, 0, 4, 2]
5 -> a
3 -> ana
1 -> anana
0 -> banana
4 -> na
2 -> nana

Viewed that way, the array becomes much easier to read. Entry 0 in the suffix array does not mean the suffix starting at 0. It means the first suffix in sorted order starts at 5. Entry 1 means the next suffix in sorted order starts at 3, and so on. The array stores order, not merely position.

Space cost follows directly from that choice. The text already exists as a String. The suffix array adds one integer per suffix. That puts the main stored structure at O(n) extra space, not counting the text itself. Trying to keep a separate copied string for every suffix would drive memory use much higher because the suffixes overlap so heavily. Storing positions instead of copied strings avoids that waste.

Why Sorted Starting Positions Help

Sorted suffix positions turn substring lookup into an ordered search problem instead of a full left-to-right scan across the whole text. That change gives the structure its value. Without preprocessing, checking for a search term can mean testing start position after start position and comparing characters again and again. With a suffix array, the suffixes already appear in lexicographic order, so the search space has structure.

Stored order lets search logic ask better questions. If a search term is lexicographically smaller than the suffix in the middle of the array, any possible match has to be to the left. If it is larger, any possible match has to be to the right. The text itself has not changed, yet the stored suffix order changes how the text can be searched.

banana helps again. Its sorted suffixes are a, ana, anana, banana, na, and nana. If the search term is ana, that term belongs near the front of the sorted sequence, not near na or nana. Order gives direction immediately. Plain scanning across the original text does not give that benefit, because positions 0, 1, 2, and onward are arranged by physical location in the text rather than by lexicographic relationship.

Character comparison still happens after preprocessing, but those comparisons carry more value. During a plain scan, a failed comparison usually rules out a single starting position. During a search over sorted suffixes, a failed comparison can rule out a whole range of suffixes. That difference is why preprocessing pays off when the same text will be searched repeatedly.

Time cost reflects that change in search behavior. Let text length be n and search term length be m. Basic suffix array lookup is commonly described as about O(m log n). The log n part comes from binary search over the sorted suffix positions. The m part comes from comparing the search term against a suffix prefix character by character. Preprocessing takes extra time up front, then later lookups drop far below the cost of checking every position from scratch.

Viewed only as storage, an array of integers does not look like much. Viewed as an index over the text, that same array becomes far more useful. Each entry points back into the original String, and the sorted order gives later search logic a much smaller search range to deal with. That is why sorted starting positions help so much. The array does not alter the text itself, yet it changes how efficiently the text can be searched after preprocessing.

Building the Array in Java

Construction relies on temporary ranks instead of full suffix string comparisons. Rather than sorting the whole tail at every position character by character, the doubling method keeps a compact summary of what is already known and carries that forward on the next pass. Java fits that idea well because arrays give fast indexed access, and Arrays.sort can reorder suffix starts based on the current rank data. That repeated cycle of sorting, rank rewriting, and doubling the compared span is what turns raw text into a finished suffix array.

Starting With Single Character Ranks

Construction begins with the smallest useful unit, which is one char at each position. For banana, each suffix gets a temporary rank from its first character. Positions holding a share one rank, the position holding b gets a larger rank than a, and the positions holding n get a larger rank than both. Rank data such as [1, 0, 2, 0, 2, 0] captures that ordering. Exact numbers do not matter. Relative order is what counts. Equal characters receive equal ranks, and smaller characters receive smaller ranks.

That opening pass does not finish the suffix array. It only tells you how suffixes compare by their first character. Still, it gives later passes something reusable. After that point, a suffix no longer has to be treated as raw text during every comparison. It already has a compact label tied to its opening character, and later passes build on that label instead of starting over.

Java code makes that first rank assignment visible:

import java.util.Arrays;

public class InitialRanks {
    public static void main(String[] args) {
        String text = "banana";
        int[] rank = new int[text.length()];

        for (int i = 0; i < text.length(); i++) {
            rank[i] = text.charAt(i);
        }

        System.out.println(Arrays.toString(rank));
    }
}

Printed values there are the numeric char values from the string. A full suffix array build usually compresses them into smaller consecutive rank values after sorting, yet the first pass still comes from character order. That is the base the doubling passes reuse.

Java adds a detail that belongs here. String.charAt reads UTF-16 char units, not full Unicode code points. That means a supplementary character occupies two indexed positions in the string. For text limited to the Basic Multilingual Plane, this detail usually stays out of the way. For broader Unicode text, a suffix array built from charAt should be treated as operating on UTF-16 units. If full code point handling is needed, the text can be turned into an int[] from codePoints() before construction begins.

Doubling the Compared Length

From there, construction moves into its main loop. Pass length starts at 1, then grows to 2, then 4, then 8, and so on. Every new pass doubles the number of characters represented by the comparison. On the first sorting pass, two suffixes are compared by rank pairs that stand for the first two characters. On the next pass, those pairs stand for four characters. Later, they stand for eight, then sixteen, until the compared span reaches or passes the text length. banana keeps the arithmetic manageable. Starting from ranks [1, 0, 2, 0, 2, 0], the pass with length 1 compares suffixes by (rank[i], rank[i + 1]). That gives these pairs. Index 0 gets (1, 0). Index 1 gets (0, 2). Index 2 gets (2, 0). Index 3 gets (0, 2). Index 4 gets (2, 0). Index 5 gets (0, -1). That -1 acts as a sentinel when the second half would fall past the end of the text. Shorter suffixes then sort before longer ones when the available characters match.

This short helper can print those first-pass pairs:

public class RankPairs {
    public static void main(String[] args) {
        String text = "banana";
        int[] rank = {1, 0, 2, 0, 2, 0};
        int len = 1;

        for (int i = 0; i < text.length(); i++) {
            int second = i + len < text.length() ? rank[i + len] : -1;
            System.out.println(i + " -> (" + rank[i] + ", " + second + ")");
        }
    }
}

Those pairs are enough to sort suffix starts for that pass. Full suffix strings are not compared during the build loop. Pair comparison replaces long string comparison with fixed-size rank comparison, and that difference keeps the method fast.

Code for the comparator usually follows this form:

Arrays.sort(order, (left, right) -> {
    if (rank[left] != rank[right]) {
        return Integer.compare(rank[left], rank[right]);
    }

    int leftSecond = left + len < n ? rank[left + len] : -1;
    int rightSecond = right + len < n ? rank[right + len] : -1;

    return Integer.compare(leftSecond, rightSecond);
});

That comparator first checks the current rank at each suffix start. If those values differ, ordering is decided immediately. If they match, it checks the rank len positions ahead. By the time len has reached 4, each pair already stands for the order of an eight-character prefix. Construction keeps lifting short comparisons into longer comparisons by carrying forward the result of earlier passes.

Cost follows from that repeated sorting. Each pass sorts n suffix starts, which is about O(n log n), and the number of passes is about O(log n) because the compared span doubles each time. That puts this Java version around O(n log^2 n) time. Extra memory stays around O(n) for the rank arrays and suffix order arrays.

Rewriting the Ranks After Each Pass

Sorting suffix starts is only half of a pass. New ranks have to be written back after the sort, because the next pass depends on updated rank values. Walk through the sorted suffix starts from left to right. If the current suffix has the same pair as the previous suffix, it receives the same new rank. If the pair differs, the new rank increases.

That rewrite step turns sorted order into reusable numeric labels. Early in construction, a rank means opening character order. After the first full sort, a rank means order for a two-character prefix. Later it means order for a four-character prefix, then an eight-character prefix, and so on. The number itself is not important. Its meaning comes from how much prefix information it now stands for.

Take a possible sorted order after the first pair sort on banana, such as 5, 1, 3, 0, 2, 4. Rewriting ranks from that order can produce [2, 1, 3, 1, 3, 0]. Positions 1 and 3 still share a rank there because anana and ana are tied under the current compared span. Later passes separate that tie when the compared span grows and the shorter suffix runs out earlier.

The rewrite loop in Java is compact, yet it carries a lot of weight:

nextRank[order[0]] = 0;

for (int i = 1; i < n; i++) {
    int prev = order[i - 1];
    int curr = order[i];

    boolean sameFirst = rank[prev] == rank[curr];
    int prevSecond = prev + len < n ? rank[prev + len] : -1;
    int currSecond = curr + len < n ? rank[curr + len] : -1;

    if (sameFirst && prevSecond == currSecond) {
        nextRank[curr] = nextRank[prev];
    } else {
        nextRank[curr] = nextRank[prev] + 1;
    }
}

What that loop does is compact but important. It converts sorted pair order into a new summary of the text. Equal pairs stay grouped under one rank. Different pairs split into separate ranks. By the time the last suffix in sorted order has rank n - 1, every suffix has a unique rank and the build can stop.

That is also the point where ana and anana finally separate. Earlier passes do not have enough compared span to split them fully. When the compared length reaches far enough, the shorter suffix sorts earlier because it ends while the shared prefix still matches. Final order for banana becomes 5, 3, 1, 0, 4, 2.

Searching After Preprocessing

Finished construction gives a sorted array of suffix starts, and that sorted order is what later search uses. Search no longer checks text positions from left to right. It performs binary search over suffix starts and compares the query against the suffix at the middle position. If the query is smaller, search moves left. If it is larger, search moves right. Full match means the query is present in the text.

Character comparison still happens during that lookup. Binary search narrows the suffix range, while character comparison decides direction at each step. If text length is n and query length is m, this basic lookup is about O(m log n). The log n part comes from binary search over the suffix array. The m part comes from comparing query characters against the chosen suffix prefix.

This comparison helper can look like this:

private int compareSuffixWithQuery(String text, int start, String query) {
    int i = 0;

    while (start + i < text.length() && i < query.length()) {
        char textChar = text.charAt(start + i);
        char queryChar = query.charAt(i);

        if (textChar != queryChar) {
            return Character.compare(textChar, queryChar);
        }

        i++;
    }

    if (i == query.length()) {
        return 0;
    }

    return -1;
}

Return value 0 means the query matched fully. Negative means the suffix is lexicographically smaller than the query at the point where comparison stopped. Positive means the suffix is larger. Binary search uses that result to cut the remaining search range down quickly.

Wrapped into lookup logic, it reads like this:

public boolean contains(String text, int[] sa, String query) {
    int left = 0;
    int right = sa.length - 1;

    while (left <= right) {
        int mid = left + (right - left) / 2;
        int cmp = compareSuffixWithQuery(text, sa[mid], query);

        if (cmp == 0) {
            return true;
        }

        if (cmp < 0) {
            left = mid + 1;
        } else {
            right = mid - 1;
        }
    }

    return false;
}

Preprocessing is what makes that search worthwhile. Build time is spent only once for a given text, then later queries reuse the finished suffix order. That trades more cost up front for much faster substring lookup when the same text will be searched repeatedly.

Conclusion

Suffix array construction in Java comes down to turning suffixes into sortable rank data. Start with first-character ranks, sort suffix positions by rank pairs, rewrite those ranks after every pass, and keep doubling the compared span until the full order is resolved. After that preprocessing, the finished array gives binary search a sorted view of the text instead of forcing a left-to-right scan across every starting position. That is what gives the structure its value. The text stays the same, but the stored suffix order changes how quickly repeated substring lookups can run.

Java String Class Documentation
Java Arrays Class Documentation
Java Comparator Interface Documentation
Java Character Class Documentation
Java String.codePoints() Documentation

Share Alexander Obregon's Substack

Alexander Obregon's Substack

Discussion about this post

Ready for more?