That's Pretty Slick - Finding duplicates with Python

Duplicate records is a common problem in data analysis. With Python, you can find duplicates in just 3 lines of code.

Apr 14, 2023

The situation: Your team has critical reporting that uses data from a source file. Unfortunately, the data load is failing because of duplicates in the file. The owner will fix it, but they don’t know which records are duplicates.

Why there are duplicates (and why the owner can’t find them) is a great topic for another day. Right now, your team needs to get that data loaded for reporting. The error message describes the first duplicate, but there might be more. The file is 1.2GB - not a great candidate for Excel.

With Python, you can find those duplicates in 3 lines of code.

Demo: Take a look at this demo on YouTube to see you how it works.

Try it yourself: You can try it yourself with this Jupyter Notebook and a sample file on Github.

Are you interested in career coaching tailored to your career path and goals? Check out my career coaching offering at The Data Edge.

The Data Edge

That's Pretty Slick - Finding duplicates with Python

Duplicate records is a common problem in data analysis. With Python, you can find duplicates in just 3 lines of code.

Discussion about this post