Finding Duplicate Records with SQL

Clean up repeated rows with simple queries

May 20, 2025

Data slipping into a table more than once can throw off reports, skew analytics, and even break parts of your app. It usually comes from user mistakes, messy imports, or bugs in the code. Duplicates show up all the time, but spotting them isn’t hard. SQL already gives you what you need. Grouping, filtering, and counting make it clear which rows got repeated.

How Duplicates Show Up and What the Database Sees

Duplicate data doesn’t always jump out when you’re looking at a table. On the surface, every row might look valid. But as far as the database is concerned, anything not blocked by a constraint is fair game. It doesn’t ask questions. If you don’t define what makes a record unique, it’s perfectly happy storing the same values over and over.

Take a customer order table. You’ve got an id column, which is usually a primary key and unique by design. But let’s say you also store the customer’s ID, the product they bought, and the date of the order. If nothing stops it, that same customer can appear to place the same order multiple times, whether it was by accident or due to a problem in the import or form handling.

Here, only id is protected by a rule. The rest of the fields are wide open. So you could have a dozen rows with the same customer_id, product_id, and order_date without the database flinching.

Let’s say your logic says one customer should only be able to buy a given product once per day. That makes any rows repeating those three values feel like copies, even though the id column makes them technically distinct. This is the sort of thing that breaks reports or leads to customers getting double-charged.

The database doesn’t track that on its own. It doesn’t try to connect the dots unless you write a query that shows you which dots to care about.

What Makes a Row a Duplicate

A duplicate depends on your rules, not the database’s. If you think of a duplicate as a row that matches on every column, that’s one case. But in practice, most duplicates are about a set of fields that repeat in a way they shouldn’t.

Say you’re importing a CSV file and each line has the same email address and signup date. Those would look like duplicates based on your business rules. But the database wouldn’t know that unless you either create a unique constraint or write a query that watches for repeats based on those fields.

Here’s a case where only part of the row matters for spotting duplicates:

If this returns more than one row, you’ve probably got a duplicate, at least based on how you’re thinking about uniqueness.

You can lock that down with a unique constraint, but in many systems, those rules aren’t in place. Sometimes it’s because the logic is handled in the app layer. Other times it’s because you’re dealing with imported legacy data that never had checks to begin with. That’s when you need a query to surface the repeats.

Using GROUP BY to Find Repeats

Let’s write a query that finds repeat orders, based on the same customer buying the same product on the same day.

The GROUP BY step buckets together all rows that share the same values across those three columns. Then COUNT(*) tallies how many there are in each group. The HAVING clause strips away the ones with only one record, leaving you with the repeat offenders.

Think of it like stacking index cards. Each card has the same customer, product, and date. You’re asking the database to make piles of those cards and show you only the piles with more than one. You could just as easily run this with different columns depending on your use case. If the issue is duplicate signups based on email and date, you’d group by email and signup_date instead.

Here’s another version using a different table structure, just to show it’s not tied to one shape:

You don’t need to know the exact number of duplicates ahead of time. This setup tells you there’s more than one. That’s often all you need to know.

Why the ORDER of Operations Matters

This kind of query works because of how the SQL engine moves through the different parts. It starts by filtering rows with WHERE, then groups with GROUP BY, and only after that does it apply the HAVING clause. That order matters. If you try to put a condition in the wrong place, it either won’t work or will give you results you didn’t expect.

Suppose you want to find repeat orders, but only from this year. You’d need to trim the rows before the grouping step starts. That means using WHERE, not HAVING.

This works because the WHERE clause runs before anything gets grouped. So by the time the engine starts looking for matches, it's already thrown out the older data. If you tried to put that date check in HAVING, it wouldn’t work, because at that point, the individual row values are gone. The database is only looking at group-level values now.

You can think of WHERE as filtering raw rows and HAVING as filtering the grouped results.

Here’s a quick counterexample. It works, but the engine first groups all rows and only then removes the older dates, so it does more work than necessary:

That won’t throw an error, but the order_date >= '2025-01-01' part is trying to check a column that’s already been grouped. It only works here because order_date is in the GROUP BY, but if you try the same test with a column that isn’t in the GROUP BY, the database raises an error because that column no longer exists in the grouped rows. So getting the order right isn’t just about syntax. It’s about what data the engine still has in view at each step.

You’ll see a big difference in performance and clarity if you line these parts up the right way. Let WHERE handle row-level filters and keep HAVING focused on group-level counts.

Techniques for Isolating and Deleting the Extras

Finding duplicate groups is useful, but spotting them is just the first part. What most people actually want is a clean version of the data where one valid copy stays and the rest go away. That has to be done carefully. You don’t want to wipe everything or keep the wrong ones. The trick is to label each row in a way that makes picking out the extras feel more like a clean cut than a guess. That’s where window functions come in.

Window Functions That Help Identify Row Position

A window function works by running across a set of rows that are grouped in memory. It doesn’t shrink the number of rows like GROUP BY does. Instead, it leaves them all there and adds an extra value to each one. That added value can be a rank, a row number, or a running total. For duplicates, row numbers give the cleanest way to spot extras without losing track of what came first.

Here’s how to use ROW_NUMBER() to assign each row a position inside its duplicate group:

Each group here is built from the same three columns that were used earlier to define a duplicate. The PARTITION BY part tells the engine which rows belong together. The ORDER BY part tells it how to decide who comes first. In this case, rows with lower id values get the lower numbers. Every row gets a number, starting at 1 for each group.

If a customer placed the same order three times, you’d see row_num values of 1, 2, and 3 in that group. That first row is the one you usually want to keep. Now that the extras are labeled, you can wrap the query in a common table expression and delete everything where the row number is greater than one.

This lets you delete duplicates cleanly without writing a loop or stepping through each row manually. You keep the first row in each group and remove the rest.

How the Engine Executes It Behind the Scenes

When a window function runs, the engine handles it after filtering and grouping are done. It gathers all the matching rows into memory and starts assigning values based on the window definition. That part’s important. It’s not like GROUP BY where the rows get collapsed down into one. Every row stays in place, but now it’s marked with a number you can use to sort or filter.

For each group defined by PARTITION BY, the engine builds an internal frame. Inside that frame, it walks through the rows in the order you gave it and applies the function. With ROW_NUMBER(), that means it just counts them up from one.

What makes this fast is that the engine only has to make one pass through each group. It doesn’t need to jump around or keep checking other parts of the table. It builds the groups in memory, assigns the numbers, and moves on. That keeps it efficient even on large datasets. When the outer DELETE runs, it uses the list of id values from the subquery to find and remove the extra rows. This part depends on the id being unique. If your table doesn’t have a unique column, you’d have to use ctid in PostgreSQL or row-specific identifiers in other systems to do the delete safely.

The important thing here is that the engine’s job is split cleanly. First it builds the ranked list with row numbers. Then it picks which ones to keep or discard. That separation keeps the logic clear and avoids messy updates or side effects.

Why ROW_NUMBER Is Safer Than COUNT or MAX

You might think about using something like COUNT(*) or MAX(id) to spot extras. It feels like it should work, and in some cases it does. But it doesn’t give you the same control or safety. Those functions collapse the group into a single value. That makes it harder to reach back and delete the specific rows that weren’t picked.

Let’s say you try this:

This gives you a list of one id per group, usually the latest row if id grows over time. But if you want to delete the others, you’d now have to write a second query that pulls all rows except those values. That creates more complexity and leaves room for mistakes if something doesn’t line up. Using ROW_NUMBER() keeps all the rows visible while still tagging them with enough detail to know which ones to remove. You can sort by whatever column makes sense in your case. If you want the earliest record kept, order by id ASC. If you want the most recent, use id DESC. That makes it flexible but still precise.

You could try RANK() or DENSE_RANK() too, but they come with extra rules around how ties are handled. For duplicate cleanup, you usually want exactly one record marked as first. That’s what ROW_NUMBER() guarantees. Each row gets a unique number inside its group, no ties allowed. That makes the next step, the deletion, much more predictable.

Conclusion

Catching and cleaning up duplicates is all about using the right pieces in the right order. GROUP BY helps you spot repeated values by counting grouped rows. WHERE and HAVING each play their part depending on what stage of the query you’re filtering. And when it’s time to remove the extras, window functions like ROW_NUMBER() give you a clear way to mark what stays and what goes. The engine handles each step in sequence, from filtering to grouping to numbering, so your logic stays clean and easy to follow.

Alexander Obregon's Substack

Discussion about this post