How to Clean Data in Python: Best Practices and Tips

Before you can delve into any data analysis, there is one step that you cannot bypass: cleaning the data. In this guide to data cleaning in Python, you are brought through how to deal with dirty datasets by discovering and correcting typical issues such as missing values, duplicates, outliers, and inconsistent formatting. When you have clean data, your findings are easier to understand and your outputs are more trustworthy.

=> See more: Professional Data Cleansing Services | Improve Data Accuracy and Business Efficiency

How to Clean Data in Python Best Practices and Tips

What Is Data Cleaning In Python?

When you’re beginning a new data project, there’s a strong likelihood that the dataset you’re dealing with isn’t going to be ready for use straight away. So, the first thing any analysis has to do is data cleaning in Python.

Data cleaning is all about correcting errors, deleting duplicates and outliers, and ensuring that everything is consistent and well formatted. If you skip this step, you are left with dirty data, and that can actually give you false conclusions.

Some of the new data pros may believe scrubbing is not a huge deal. But the fact is, it’s one of the most critical aspects of the whole process. Without clean data, your models can provide incorrect answers, your graphs can be misleading, and your statistics can be way off.

The more sophisticated your data analysis is, the less easy it is to identify bugs when something goes wrong, so cleaning is still more crucial. If you don’t know what you’re supposed to be receiving as outputs, you won’t even notice mistakes.

This is the reason why a good Python data cleaning habit, with good testing habits, is the secret to accurate outputs. In this article, we shall illustrate how to leverage it effectively for data cleaning in Python.

=> See more: What Are Data Cleansing Services? Definition, Benefits, and How They Work

What Is Data Cleaning In Python?

Data cleaning is all about correcting errors, deleting duplicates, and outliers

How to Clean Data in Python?

In the sections below, we’ll walk through the most important techniques for cleaning data using Python, including handling missing values, dealing with duplicates, fixing formatting issues, identifying outliers, and validating your dataset.

Managing Missing Values

When working with big data sets, there’s a virtual guarantee that some of the entries will have missing values. The blank spaces aren’t merely lost data — they can render certain Python functions inoperable, which could produce wrong results in your model or analysis.

Once you’ve identified missing values (usually displayed as NaN in Python), you will need to determine what to do with them. The two choices are: either delete the rows containing missing values or impute the missing values with something reasonable — an operation commonly referred to as imputation.

If your data set is not excessively large, you may wish to examine the missing value rows before deciding. You can do that in Pandas:

import pandas as pd # Find and view rows with NaN values

rows_with_nan = df[df.isnull().any(axis=1)]

print(rows_with_nan)

Removing entries is the fastest fix, and sometimes it’s the right call. For example, if a row only contains a date and nothing else useful, deleting it won’t hurt your analysis. But be careful: deleting data means losing potential insights.

To drop rows with missing values, you can use:

df.dropna(inplace=True)

Still, deletion isn’t always ideal. A more common approach is to fill in the missing values with something that makes sense based on your data. This is known as imputation, and it helps keep your dataset intact.

The Pandas library makes it easy to fill in missing values with a column’s mean, which helps maintain the overall distribution:

df.fillna(df.mean(), inplace=True)

For more control, you can use the SimpleImputer from scikit-learn:

from sklearn.impute import SimpleImputer

import pandas as pd imputer = SimpleImputer(strategy=’mean’)

df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

If your analysis needs more sophisticated techniques, techniques such as K-Nearest Neighbors (KNN) or regression imputation will give you smarter value approximations. The optimal technique is based on the basis of your dataset, what you are trying to solve, and how much time and computational resources you have available to sanitize your data.

How to Clean Data in Python?

 There’s a virtual guarantee that some of the entries will have missing values

Detecting and Treating Outliers

Outliers can be tricky. Occasionally, an outlier is really a great piece of your data, like how the stock market reacted to the COVID-19 pandemic or the 2008 recession. Occasionally, outliers are merely mistakes or unexplained events that aren’t beneficial to your analysis.

Determining whether or not an outlier is significant is really a matter of being familiar with your data and what you’re trying to accomplish with your analysis. Your previous data exploration should give some background to these values.

You may frequently identify outliers visually from plots, but statistical procedures can also be used to locate them more systematically.

One method is the use of the Z-score, which measures how many standard deviations away from the mean a value is.

import numpy as np

import pandas as pd

# Generate sample data

np.random.seed(0)

data = np.random.randint(0, 11, 1000)

# Add outliers

data[0] = 100

data[1] = -100

# Calculate Z-scores

z_scores = (data – np.mean(data)) / np.std(data)

# Identify outliers (e.g., |Z| > 3)

threshold = 3

outliers = np.where(np.abs(z_scores) > threshold)[0]

print(“Outliers identified using Z-score:”)

print(data[outliers])

Another method is the Interquartile Range (IQR), which focuses on the middle 50% of your data. Any values below Q1 – 1.5*IQR or above Q3 + 1.5*IQR are considered outliers.

q1 = np.percentile(data, 25)

q3 = np.percentile(data, 75)

iqr = q3 – q1

# Calculate bounds

lower_bound = q1 – 1.5 * iqr

upper_bound = q3 + 1.5 * iqr

# Identify outliers

outliers = np.where((data < lower_bound) | (data > upper_bound))[0]

print(“Outliers identified using IQR:”)

print(data[outliers])

Once you’ve found outliers, what should you do?

  • Correct errors: If the outlier is a data entry mistake, fix it if possible.
  • Remove it: If the value is irrelevant or clearly wrong, you might delete it.
  • Cap the value: Also known as winsorizing, this method sets upper/lower limits and replaces extreme values with those thresholds

data = {

    ‘A’: [100, 90, 85, 88, 110, 115, 120, 130, 140],

    ‘B’: [1, 2, 3, 4, 5, 6, 7, 8, 9]

}

df = pd.DataFrame(data)

# Capping using 5th and 95th percentiles

lower = df.quantile(0.05)

upper = df.quantile(0.95)

# Apply capping

capped_df = df.clip(lower=lower, upper=upper, axis=1)

print(“Capped DataFrame:”)

print(capped_df)

Be careful when you transform your data — you don’t want to make more problems than you solve. Before you decide to transform your data, there are a couple of things to keep in mind:

  • Know your data’s distribution: Know how your data is distributed. Some transformations will only work for specific distributions.
  • Choose the correct method: Choose a transformation that suits your data. For instance, logarithmic transformations are best suited to skewed data.
  • Be careful of zeroes and negatives: Most of these transformations, such as logs or square roots, do not like zeroes or negatives. In such cases, you will have to add a very small constant to the data.
  • Check the results: After transforming your data, take time to review it. Make sure it still meets the requirements of your analysis.
  • Think about clarity: Transformed data can be harder to explain. Be sure that your team or stakeholders can still understand what the results mean after the transformation.
Detecting and Treating Outliers

Determining whether or not an outlier is significant is really a matter

Handling Duplicate Entries

And we’ve discovered that duplicates can ruin your analysis. Fortunately, you can easily find and delete them in Python using the pandas library.

First, you can use the .duplicated() function and simply search for duplicate rows in your DataFrame:

import pandas as pd

# Find duplicate rows

duplicate_rows = df[df.duplicated()]

This returns all rows that are exact duplicates. If you don’t have too many, it’s a good idea to inspect them manually to understand what’s going on.

If the duplicates are exact copies, you can safely remove them with:

# Drop duplicates

cleaned_df = df.drop_duplicates()

This is the simplest and most common approach.

In some situations, duplicates might not be mistakes. Instead, they might represent repeated entries for the same person, product, or event. In this case, you might merge the duplicates by combining their values with functions like sum or mean.

Here’s an example using aggregation:

# Sample data

data = {

    ‘customer_id’: [102, 102, 101, 103, 102],

    ‘product_id’: [‘A’, ‘B’, ‘A’, ‘C’, ‘B’],

    ‘quantity_sold’: [5, 3, 2, 1, 4]

}

df = pd.DataFrame(data)

# Merge duplicates by summing the quantity sold

merged_df = df.groupby([‘customer_id’, ‘product_id’]).agg({‘quantity_sold’: ‘sum’}).reset_index()

This groups the data by customer and product, then adds up how many of each item they bought.

Handling Outliers

Outliers — those aberrant data points that don’t fall in the general trend — can pull your results considerably. Occasionally, they are valuable (unusual market behavior during COVID-19), but usually they are mistakes caused by typos or bizarro, irrelevant occurrences.

Outlier detection is particularly applicable in the cleaning of the data stage in Python. You can detect extreme values using statistical techniques such as the Z-score or the interquartile range (IQR). For example, numbers with a Z-score greater than 3 or less than -3 tend to be outliers.

import numpy as np

# Sample data with an outlier

data = np.array([10, 12, 11, 13, 15, 100, 14, 13, 12])

# Calculate Z-scores

z_scores = (data – np.mean(data)) / np.std(data)

# Filter out outliers with |Z| > 2

outliers = data[np.abs(z_scores) > 2]

print(“Outliers:”, outliers)

Once you’ve isolated an outlier, you have various options:

  • Correct the outlier if it’s a mistake that you’re familiar with.
  • Omit it from your dataset if it’s not relevant
  • Cap it by substituting it with the top or bottom value (winsorizing),
  • Or rescale your data (e.g., log or square root) in order to reduce the effect of the outlier.
Handling Outliers

Outlier detection is particularly applicable in the cleaning of the data stage in Python

Data Standardization and Formatting

When data is drawn from more than one source or typed in manually, there are always tend to be inconsistencies in format. Standardizing your data so that everything looks the same way makes it a lot easier to deal with and analyze later on. In Python data cleaning, some of the most important formatting tasks are:

  • Standardizing date formats, in which all values are converted into datetime objects,
  • Standardizing text, e.g., converting all strings to lowercase and stripping unnecessary whitespace
  • Data type conversions, e.g., converting digit strings to integers or floats,
  • Unit conversion, e.g., converting all weights to kilograms or all money to dollars

One common method is using the Z-score, and here’s how to do it: 

import pandas as pd

# Sample data

df = pd.DataFrame({

    ‘Date’: [‘2024/01/05′, ’05-01-2024’, ‘Jan 5, 2024’],

    ‘Price’: [‘1,000’, ‘2000’, ‘3,500’]

})

# Convert date to consistent datetime format

df[‘Date’] = pd.to_datetime(df[‘Date’])

# Remove commas and convert to numeric

df[‘Price’] = df[‘Price’].str.replace(‘,’, ”).astype(float)

print(df)

Data Aggregation and Grouping

Most real data sets will have duplicate records that refer to the same object. Rather than removing them, it would be better to combine and indicate them through consolidation. It is extremely easy with Python’s Pandas module with its groupby() and agg() functions. For instance:

merged_df = df.groupby([‘customer_id’]).agg({‘purchase_amount’: ‘sum’})

This will count duplicate records for every customer and add up their purchases. You may also aggregate averages, counts, etc., based on what you’re attempting to pull out. Proper aggregation is a significant part of Python data cleaning, particularly when handling business measurements, customer activity, or revenues.

Data Aggregation and Grouping

Most real data sets will have duplicate records that refer to the same object

Data Validation

The last step in any cleaning process is to validate your data so that it’s accurate, consistent, and ready for analysis. This will confirm your dataset is adhering to the logical principles you’re looking for it to accept. The most significant validation phases are:

  • Validating data types to ensure all columns have been formatted correctly (e.g., no text in a numeric column),
  • Validating that required fields are not blank or filled with null values,
  • Range of values checking, e.g., no one is 500 years old,
  • Foreign key checking to ensure that table relationships (e.g., each order has a valid customer number) are proper.

Python data cleaning validation ensures you catch errors sooner and prevent making incorrect conclusions. You can use Pandas methods like .info(), .isnull(), and your conditional logic to streamline much of this. Here is some sample code:

# Check data types

print(df.dtypes)

# Check for missing values

print(df.isnull().sum())

# Check ranges

assert df[‘purchase_amount’].min() >= 0, “Negative purchase amount found!”

Data Cleaning in Python: Best Practices, Tips, and Examples

When performing any machine learning or data science work, one of the most important first steps is data cleaning. Data cleaning guarantees that your output is reliable and accurate. A few data cleansing best practices to follow when performing data cleaning in Python are described below.

Store a Separate Copy of Raw Data

Save your raw data in a copy first before you alter it. This provides a safe backup to which you can go back in case something goes wrong with your data cleaning. It is a good practice to append the suffix “-RAW” to the filename, thereby leaving no margin for error that this file has not been worked on. Having raw data in a separate directory from cleaned versions will prevent you from unknowingly deleting or duplicating them.

Data Cleaning in Python: Best Practices, Tips, and Examples

Save your raw data in a copy first before you alter it

Document Your Data Cleaning Process

As you go through every step of cleaning, add good comments in your code. This helps you (and other programmers) to easily understand why you have made certain modifications. It is especially useful when you revisit your project after a while or share your project with your colleagues.

When performing data cleaning in Python, it’s common to apply filters, remove duplicates, fix formatting, or handle missing values. Explaining each step in a short comment makes your work easier to follow and debug.

Watch for Unintended Changes

Be aware that your cleaning does not create new issues. For instance, taking out too many outliers or imputing missing values wrongly can mislead your data. It is a good habit to verify your data after every significant cleaning activity to ensure that patterns have not altered in an unanticipated way.

Watch for Unintended Changes

Be aware that your cleaning does not create new issues

Maintain a Data Cleaning Log

If you have a manual or automated cleaning process, keep a separate log. The log should have the date, what you did, and any unexpected problems you encountered. Having this log in your project directory or as part of your data cleaning pipeline using Python can be extremely useful for future tracking of changes or debugging problems.

Create Reusable Python Functions for Cleaning

A majority of the time, you will be performing the same cleaning operations on various projects or datasets. If this is the case, then define functions for them so that you may accomplish them in a snap. Thus, for instance, if your organization uses special abbreviations or universal formatting discrepancies, then define a function to correct them automatically. That saves your time and ensures uniformity in your data cleansing procedure.

=> See more: Top CRM Data Cleaning Solutions to Improve Customer Data Accuracy

Create Reusable Python Functions for Cleaning

You will be performing the same cleaning operations on various projects or datasets

Conclusion 

Data cleaning in Python is not just a routine task—it’s a critical step that sets the stage for accurate analysis and reliable machine learning results. Clean data helps ensure your insights are based on facts, not flaws.

Unclean data can easily lead to misleading conclusions or poor decisions. Whether you’re analyzing a small dataset or managing enterprise-scale information, skipping the cleaning process can cost you time, resources, and trust. That’s why investing in solid data cleaning practices from the start is always a smart move.

Need help handling large or complex datasets? DIGI-TEXX provides professional data services, including expert support with data cleaning in Python, so you can focus on insights instead of errors.

=> Read more:

SHARE YOUR CHALLENGES