SQL Text Sorting With Collations

Jun 25, 2026

Text sorting can feel small until two databases give different answers for the same text. Reports can place uppercase names ahead of lowercase names, searches can treat Jose and José as equal, and migrations can change the order of customer names without changing any stored rows. Collation is the rule set behind that behavior, telling the database how text gets sorted, compared, matched, grouped, and displayed. The characters stay the same, but the comparison rules change the result a query returns. That is why collation reaches past display order into queries, indexes, imports, exports, search screens, dashboards, and reports that pull text from more than one source.

Collation Rules Behind Text Results

SQL text comparisons follow rules that can come from the database default, the column definition, the query expression, or a session setting, depending on the database engine. The visible string is only part of the story because the database also needs instructions for letter order, capitalization, accent marks, punctuation, and language-specific sorting. Those comparison rules explain why the same rows can sort differently, match differently, or group differently after a migration, import, report refresh, or database change.

Sort Order

Text ordering depends on more than alphabetic placement because the database has to rank uppercase letters, lowercase letters, accented letters, unaccented letters, punctuation, and characters from different languages. We can write a plain ORDER BY, but the final order still comes from the collation attached to that expression.

SELECT display_name
FROM customers
ORDER BY display_name;

This query sorts by display_name, yet the returned order is not controlled by the column name alone. If the column follows a case-insensitive collation, Alex, alex, and ALEX can be treated as equal for the main comparison, with the database applying its own tie-breaking rules after that. If the column follows a case-sensitive or binary collation, those same values can appear in a different order.

Binary sorting compares stored code values, which can fit technical text because the result follows the encoded values closely. Human-facing lists can read strangely with binary sorting, though, because uppercase letters can sort apart from lowercase letters and accented characters may not appear near their unaccented base letters. Linguistic sorting follows language-aware rules. Customer names, city names, product titles, and article titles usually read better with a linguistic collation because the sort order is closer to how people scan text. This is why a list of names can look natural in one database and strange in a second database without any row data changing.

SQL Server collation names expose some of the rules directly. In Latin1_General_100_CI_AI, CI means case-insensitive, and AI means accent-insensitive. MySQL collation names also carry comparison behavior, such as _ci for case-insensitive, _cs for case-sensitive, _ai for accent-insensitive, _as for accent-sensitive, and _bin for binary comparisons. PostgreSQL can rely on operating system collations or ICU collations, while Oracle supports binary, linguistic, multilingual, and UCA-based collation rules.

For a technical export, we may want the same column sorted through a binary SQL Server collation for that result:

SELECT display_name
FROM dbo.Customers
ORDER BY display_name COLLATE Latin1_General_100_BIN2;

That query asks SQL Server to sort the expression through Latin1_General_100_BIN2 for this result. It does not change the column definition or rewrite stored values, which makes the query-level rule practical for a targeted sort while leaving the normal column behavior alone.

Case Sensitivity

Capitalization rules affect equality and ordering, so the same search can return different rows under different collations. With a case-sensitive collation, Alex, alex, and ALEX are different during comparison. With a case-insensitive collation, those values can compare as equal, while the stored text keeps its original capitalization.

SELECT *
FROM customers
WHERE display_name = 'alex';

This filter can return different rows across databases or columns. Under a case-sensitive rule, only the exact lowercase value matches. Under a case-insensitive rule, rows storing Alex or ALEX can match too. The database does not rewrite the value in storage, and a report can still display the original text after the comparison has been made.

Search boxes, email lookups, usernames, report filters, and duplicate checks all depend on this rule. People usually expect a name search to ignore capitalization, while technical identifiers can need stricter treatment. Codes such as ab12 and AB12 may need to stay separate because outside systems can treat those as different values.

Case sensitivity also affects grouping and uniqueness. If a column has a case-insensitive collation, values that differ only by capitalization can fall into the same comparison bucket. That can change the result of DISTINCT, GROUP BY, joins, and unique constraints. The same text values can appear separate under one collation and equal under a different collation.

For stricter comparison rules at the expression level, the query can name the collation directly:

SELECT *
FROM dbo.Customers
WHERE DisplayName COLLATE Latin1_General_100_CS_AS = N'Alex';

The CS portion makes the expression case-sensitive, and AS makes it accent-sensitive. We are not changing the stored column here. We are only changing the comparison rule for this predicate, which can help when a report or cleanup query needs stricter matching than the normal column rule.

Accent Sensitivity

Marked characters need their own comparison rules because Jose and José are related visually but not identical text. With an accent-sensitive collation, those two values are different during comparison. With an accent-insensitive collation, they can compare as equal. That choice affects names, addresses, imported files, customer search, product titles, and text that may arrive with inconsistent accent usage.

Someone may type a name without an accent because their keyboard makes it inconvenient. Imported text may drop accents during export from a source file, while a different source may preserve them. Collation decides how much that difference changes matching. For user-facing search, accent-insensitive matching can be helpful. For legal names, language-sensitive text, or exact external values, accent-sensitive matching can be the better fit.

MySQL makes the choice visible in the collation name. This column stores Unicode text with utf8mb4, while the collation controls accent and case behavior:

CREATE TABLE customers (
    display_name varchar(100)
        CHARACTER SET utf8mb4
        COLLATE utf8mb4_0900_ai_ci
);

With utf8mb4_0900_ai_ci, the comparison is accent-insensitive and case-insensitive. Searches for Jose can match José, and searches for alex can match Alex, depending on the rest of the query and the stored values.

Storage and comparison are separate concerns. The character set controls which characters can be stored, while the collation controls how stored text is compared and sorted. Columns can store José correctly while still comparing it as equal to Jose under an accent-insensitive rule.

Accent sensitivity also changes grouped results:

SELECT display_name, COUNT(*) AS name_count
FROM customers
GROUP BY display_name
ORDER BY display_name;

Under an accent-sensitive rule, Jose and José can appear as separate groups. Under an accent-insensitive rule, they can be treated as the same group for comparison. The displayed value depends on the database engine and selected expression, but the comparison rule controls which rows belong in the same group.

The same idea applies to joins. If two tables store names with different accent habits, an accent-insensitive comparison can match rows that an accent-sensitive comparison would keep apart. That can help search-style matching, while exact data rules still need stricter treatment when the stored spelling carries meaning.

Expression Scope

Comparison rules can come from several places across the database stack. Server or instance defaults may supply the starting collation. Database defaults can supply their own rule. Text columns can declare a collation directly. Query expressions can override the normal rule with COLLATE. Some systems also allow session-level settings that affect text operations. The closest rule to the expression usually controls the comparison being evaluated. Column-level collation is more specific than a database default, and an expression-level COLLATE clause is more specific than the column rule for that comparison or sort. This is why a single query can temporarily sort or match text differently without changing the table.

SQL Server can apply a collation to a single comparison:

SELECT customer_id, display_name
FROM dbo.Customers
WHERE display_name COLLATE Latin1_General_100_CI_AI = N'jose';

That predicate compares display_name through a case-insensitive and accent-insensitive rule. It can match values such as Jose, José, or JOSE for that expression, depending on the stored values and the full collation behavior.

PostgreSQL also supports expression-level collation. The C collation gives a byte-oriented ordering style, which can help when text needs technical ordering rather than language-aware ordering:

SELECT display_name
FROM customers
ORDER BY display_name COLLATE "C";

That sort can differ from a locale-based or ICU-based collation. PostgreSQL also supports ICU collations, including nondeterministic collations that can treat different byte sequences as equal in selected cases, such as Unicode normalization differences. That kind of collation can help with Unicode-heavy data, but it should match the comparison rule the column or query truly needs.

Scope becomes important when a query combines text from different sources. Comparisons between two text columns can fail or return unexpected results if their collations conflict. Literal strings can also take their default rule from the database or session context. Adding COLLATE to the expression tells the database which rule to apply for that specific operation.

Expression-level collation is helpful for targeted searches, migration checks, data cleanup, and report fixes. Column-level collation is still the better home for normal behavior because it keeps the rule attached to the data instead of spreading comparison choices across individual queries.

Database Choices That Change Reports

Reports make collation choices visible because they turn comparison rules into output people read. Search screens, CSV exports, dashboards, and grouped summaries can all change when text rules change. The stored rows may be identical, but the report can still show a different order, a different match count, or a different set of grouped labels. That is why column definitions, report queries, and database boundaries all need the same text rules the data is meant to follow.

Column Defaults

Normal comparison behavior usually belongs on the column. If customer names should match without case sensitivity, the name column can carry that rule directly. If a product code or account token must treat capitalization as meaningful, that column can carry a stricter rule. Placing those choices in the schema makes the behavior part of the data definition instead of leaving every report query to restate it.

SQL Server lets a text column declare its collation directly:

CREATE TABLE dbo.CustomerLookup (
    CustomerId int IDENTITY PRIMARY KEY,
    DisplayName nvarchar(120)
        COLLATE Latin1_General_100_CI_AI NOT NULL,
    EmailAddress nvarchar(254)
        COLLATE Latin1_General_100_CI_AI NOT NULL,
    AccountCode varchar(40)
        COLLATE Latin1_General_100_BIN2 NOT NULL
);

We can read this table definition as three text decisions. DisplayName and EmailAddress follow a case-insensitive and accent-insensitive collation, which fits search and reporting rules where Alex, alex, Jose, and José may need forgiving comparison behavior. AccountCode follows a binary collation, giving stricter comparison rules for a technical value that may need exact text matching.

Column defaults also affect indexes. Report filters have the best chance of matching indexed text columns naturally when the query compares through the same rule already attached to the column. If a query applies a different collation at expression level, the database may need extra processing before it can compare values. The exact plan depends on the engine, query, index, and predicate, but the column collation is still part of how indexed text is ordered and searched.

The same column rule affects uniqueness. If an email column has a case-insensitive collation, a unique index treats case-only changes as the same value for comparison. That can stop Alex@example.com and alex@example.com from becoming separate customer records.

CREATE UNIQUE INDEX UX_CustomerLookup_EmailAddress
ON dbo.CustomerLookup (EmailAddress);

The unique index follows the comparison rule already attached to EmailAddress. Stored text does not get lowercased by the index. The index only follows the column’s comparison behavior when deciding if two values conflict.

Technical values need separate treatment from names and labels. Reports can still display external account codes, but the database should not compare those values like person names. If ab12 and AB12 refer to different values in a partner system, a case-insensitive collation would not fit that column. We can still display both values in reports, but the comparison rule should match the data contract behind the value.

MySQL follows the same general idea through character sets and collations on columns. The character set controls storage, while the collation controls comparison and sorting:

CREATE TABLE customer_lookup (
    customer_id bigint PRIMARY KEY,
    display_name varchar(120)
        CHARACTER SET utf8mb4
        COLLATE utf8mb4_0900_ai_ci NOT NULL,
    account_code varchar(40)
        CHARACTER SET utf8mb4
        COLLATE utf8mb4_0900_bin NOT NULL
);

With this definition, the name column gets case-insensitive and accent-insensitive comparison, while the account code column gets binary comparison. That split keeps report search flexible for names and strict for codes. The report does not need to treat every text column the same way just because the values are all strings.

Column defaults also help during imports. If a CSV brings in names from several sources, the destination column decides how those names compare after they arrive. The import can preserve the original characters, while filters, joins, grouping, and report sorting follow the destination column’s collation unless a query overrides it.

Cross Database Output

Report output gets harder to predict when the same dashboard pulls text from different databases, schemas, warehouses, or export files. Customer names sorted in SQL Server may not appear in the same order after the same rows move to PostgreSQL, MySQL, Oracle, a reporting warehouse, or a spreadsheet. The data can be copied correctly while the comparison rules still change.

Different systems expose collation in different ways. SQL Server commonly shows sensitivity choices in collation names, such as case-sensitive, case-insensitive, accent-sensitive, and accent-insensitive suffixes. MySQL ties text comparison to character set and collation names such as utf8mb4_0900_ai_ci or utf8mb4_0900_bin. PostgreSQL can rely on operating system or ICU collations. Oracle supports binary and linguistic comparison options. The syntax varies, but the report concern stays the same because text values need rules before sorting, filtering, joining, or grouping can produce consistent output.

Cross-database reports usually run into trouble when a value means one thing as human-facing text and a different thing as technical text. Customer names, article titles, product labels, and city names usually need language-aware comparison. Codes, hashes, IDs, tags from external services, and integration values usually need stricter comparison. Treating all text as friendly search text can merge values that should stay separate, while treating all text as exact technical text can make reports harder to search and read.

A SQL Server report query can hit collation conflicts when joining text from different databases or temporary objects. We can pick the comparison rule for that operation instead of letting the conflict control the result:

SELECT c.CustomerId, c.DisplayName, r.RegionName
FROM SalesDb.dbo.Customers AS c
JOIN ReportDb.dbo.CustomerRegions AS r
    ON c.RegionCode COLLATE Latin1_General_100_CI_AS =
       r.RegionCode COLLATE Latin1_General_100_CI_AS;

The join compares both sides through the same collation for that predicate. Neither table definition changes. The query only tells SQL Server how to compare the two expressions during the join, which can help when a report has to combine text from databases with different defaults.

Cross-database output also affects grouping. Reports that group customer names can produce different totals when one system treats Jose and José as equal while a second system keeps them separate. The same issue can appear with capitalization, punctuation, whitespace rules, or locale-specific sorting. Changed totals after a migration do not always mean rows are missing because the comparison rule can be the reason.

Report queries can name the grouping rule directly when the result needs a specific comparison rule:

SELECT
    DisplayName COLLATE Latin1_General_100_CI_AI AS ReportName,
    COUNT(*) AS CustomerCount
FROM dbo.CustomerLookup
GROUP BY DisplayName COLLATE Latin1_General_100_CI_AI
ORDER BY ReportName;

That query groups and sorts names through the same case-insensitive and accent-insensitive rule. It can help a report show a search-style rollup, but it should be used with care when exact spelling has meaning.

Exports add another layer because the database may sort rows before the file is created, and the destination tool may sort the same text again with a different locale. CSV exports can look correct when the database creates the file, then appear in a different order after a spreadsheet or BI tool applies its own sorting rules. The row values did not change, the sorting rule changed.

Reliable report output comes from deciding which text values should be human-facing and which should be exact. Names, labels, and descriptions can use language-aware rules that match the audience. Codes and identifiers can use strict rules that match upstream systems. After that decision is part of the database or report query, the output is much less likely to drift when data crosses database boundaries.

Conclusion

Collation is the comparison layer that decides how SQL sorts, matches, groups, and joins text. The stored characters can stay the same while case rules, accent rules, binary ordering, or language-aware sorting change the result a query returns. When we place the right rule on the column, and use query-level COLLATE only when a specific comparison needs it, reports become more predictable across searches, exports, imports, indexes, and database boundaries.

Share Alexander Obregon's Substack

Alexander Obregon's Substack

Discussion about this post

Ready for more?