When beginning a new data project, the data that data scientists collect is rarely ideal for your analysis right away. As a result, it is critical that the data is cleaned before beginning every new project. Cleaning your data is the process of removing errors, outliers, and inconsistencies and ensuring that all of your data is in the correct format for the research at hand. Data that contains many errors or has not gone through this data-cleaning process is referred to as messy data. This is a crucial step. In this article, we will learn what messy data is and its types, why Pandas library in Python is such an amazing tool for cleaning data in Python, and how we can utilize its tools and libraries to get our data ready to use.
Table of Contents
- What is data cleaning in data science?
- What is messy data?
- The powerhouse of data cleaning: Pandas
- Handling missing data
- Feature engineering: Shaping data for analysis
- Cleaning data in Python: Best practices and tips
- Conclusion
What is data cleaning in data science?
Whether you are doing a data analysis report, building a machine learning model, or performing any statistical analysis, data preprocessing is the most important step in data science and analytics because, without good data, you can never have a good model or proper insights.
Python is the best language for working with data because it has many useful tools, such as Pandas and NumPy. This in-depth guide goes over the most important data cleaning (a.k.a. data cleansing) methods you can use with Python to turn your datasets from wild animals into well-behaved study partners.
What is messy data?
Many different forms of errors and anomalies can cause messy data. Let’s discuss some of the more prevalent categories and why they are troublesome.
Missing values
Incomplete rows in datasets are very common and can result in several missing data points in multiple fields. These missing values can significantly impact the analysis, leading to biased results.
Furthermore, NaN values or empty cells in a DataFrame can disrupt Python code, causing frustration during model development. Handling missing values is a crucial part of data preprocessing, and Pandas provides several functions and methods to detect, handle, and manage these missing values.
Duplicates
It refers to Pandas DataFrame with rows that have the same values across one or more columns. Handling duplicates is essential to ensuring data quality and accuracy in analysis. Furthermore, duplicate records or values can lead machine learning models to overfit the training data.
The Pandas library in Python gives us some functions to detect duplicates. For example, df[df.duplicated()] Returns only the duplicate rows. The df.duplicated().sum() function counts the number of duplicate rows.
Error values
Datasets may have values that are completely wrong, incorrect, or cause problems in analysis. It could also be values in a column of different data types. For example, columns that are supposed to have decimal values might have special characters or non-numerical values. Another classic scenario is the date column in a dataset. It may have incorrect formatting that causes problems in data analysis.
Furthermore, there can be logical errors in a dataset. For example, in a real estate dataset, you might see a 1500-square-foot apartment with ten bathrooms. Now, that doesn’t make any sense, does it? Python provides several tools and methods to detect and handle these error values. For instance, using regular expressions with the str.contains method can detect patterns in string data. Another common way is to check values in a particular range.
You can use the snippet below.
df[(df['column'] < min_value) | (df['column'] > max_value)]Code language: Python (python)
Finally, you can set some custom conditions to handle logical errors. For example, a start day will never come after the end day.
Outliers
They are data points very different from the rest of the data. They can be caused by mistakes in measuring or entering the data or be real differences. They can mess up statistical studies, so finding and dealing with them is important.
Z-scores and the Interquartile Range (IQR) are two common ways to find outliers. Z-scores measure standard deviations from the mean, and points with Z-scores above a certain level, usually 3 or -3, are called outliers. When using IQR, points outside the range of Q3 + 1.5 * IQR (upper limit) and Q1 – 1.5 * IQR (lower limit) are called outliers. In Python, we can calculate both the Z-score and IQR. The first approach is the Z-score given in the snippet below.
from scipy import stats
df['z_score'] = stats.zscore(df['column_name'])
outliers = df[(df['z_score'] > 3) | (df['z_score'] < -3)] Code language: Python (python)
The second statistical approach is to use IQR or interquartile range. You can calculate the interquartile range (IQR) of the distribution and classify any values that are Q1-(1.5 x IQR) or Q3 + (1.5 x IQR) as potential outliers.
In Python, you can find outliers using the following script.
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['column_name'] < (Q1 - 1.5 * IQR)) | (df['column_name'] > (Q3 + 1.5 * IQR))] Code language: Python (python)
Aside from statistical analysis, you can find outliers using visualization methods like boxplots and scatter plots. A box plot visually represents the distribution of data and highlights outliers as points outside the “whiskers.”
In Python, you can visualize it using a boxplot with the following snippet.
df['column_name'].plot.box() Code language: Python (python)
Additionally, you can visualize outliers using a scatter plot.
df.plot.scatter(x='column1', y='column2')Code language: Python (python)
The powerhouse of data cleaning: Pandas
Pandas is one of the most important libraries in Python for data science, especially for cleaning up data. It features two main types of data structures: Series, which are labeled one-dimensional groups, and DataFrames, which are labeled two-dimensional data with rows and columns. These designs facilitate working with tabular data, making it easy to change and analyze.
Pandas’ intuitive indexing and selection capabilities allow users to target rows, columns, or specific parts of their data using names, positions, or Boolean indexing, enabling focused data cleaning. Additionally, Pandas integrates seamlessly with other popular Python tools like Scikit-learn for machine learning and NumPy for numerical computing, enhancing the scope for further research and modeling after data cleaning.
The library boasts an extensive arsenal for data cleaning, addressing a range of issues such as type conversion, where data can be easily transformed from one type to another, like converting text data into numbers. It also provides powerful string manipulation methods like strip() to remove leading and trailing whitespaces and replace() to alter unwanted characters, with support for regular expressions to clean up text data.
Pandas allows for consistent formatting and splitting of date and time data for date and time wrangling. Moreover, the duplicated() and drop_duplicates() methods help identify and eliminate duplicate rows, ensuring data integrity.
Handling missing data
Pandas, a popular Python library, provides the functions and resources to clean a dataset. Let’s discuss some of the ways we can clean our data.
Missing values
There are a few ways to deal with missing values. The most intuitive way is to drop all the values, but we must be careful. It’s okay if only a couple of hundred values are missing in a DataFrame with thousands of data points, but we have to ensure we are not removing most of the data. You can also fill in the missing values by taking the mean of other values or imputing them somehow. The code snippet below will help you understand it.
import pandas as pd
df=pd.read_csv('input.csv')
# To check missing value
print(df.isnull())
# Drop missing values
df=df.dropna(inplace=True)
# Using mean value to fill them up
df=df.fillna(df.mean(), inplace=True)Code language: Python (python)
Duplicates
You can use the drop_duplicates() function to remove duplicate values in a Pandas DataFrame. For example
df=df.drop_duplicates()Code language: Python (python)
Error values
The brute force approach would be to replace Error Values by
df['column'] = df['column'].replace({'error_value': 'correct_value'})Code language: Python (python)
Another brute force method is to remove error values. You can achieve this by custom filtering to remove rows with error values. If there is a type mismatch, correct the data types after detecting errors by using type coercion as needed.
Outliers
The easiest way to remove outliers in Python is to get the IQR and drop all the values out of range. To do so, execute the snippet below.
import pandas as pd
Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)
# Calculate the interquartile range (IQR)
IQR = Q3 - Q1
# Define a lower and upper bound for outlier detection
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Filter out rows where the values are outside the lower and upper bounds
cleaned_df = df[(df['column'] >= lower_bound) & (df['column'] <= upper_bound)]Code language: Python (python)
Aside from this, you can cap the values by setting extreme values to a specified percentile value.
cap_value = df['column_name'].quantile(0.95)
df['column_name'] = np.where(df['column_name'] > cap_value, cap_value, df['column_name'])Code language: Python (python)
It can also be done by applying transformations such as log or square root to reduce the impact of outliers.
df['transformed'] = df['column_name'].apply(np.log)Code language: Python (python)
Another way is by replacing outliers with a statistical measure like the mean, median, or a specific value.
df.loc[(df['z_score'] > 3) | (df['z_score'] < -3), 'column_name'] = df['column_name'].median()Code language: Python (python)
Feature engineering: Shaping data for analysis
Often, cleaning the data makes way for feature engineering, a crucial step that transforms your data into a format suitable for analysis, particularly for machine learning models. Feature engineering involves modifying and creating new features to enhance the dataset.
This process is vital for several reasons. First, it improves model performance by providing the model with more informative features, enabling it to detect patterns and make accurate predictions. Second, feature engineering enhances the readability of your data, allowing you to uncover hidden relationships and gain deeper insights into the underlying factors at play. Generating more useful features facilitates better trend learning for your model.
Moreover, improved readability means you can identify and interpret significant patterns and correlations within your data, leading to a more comprehensive understanding of the subject matter. Thus, feature engineering is an indispensable step in the data preparation pipeline, bridging the gap between raw data and actionable insights and ensuring that your machine learning models and analytical efforts yield the most accurate and meaningful results possible. Let’s discuss some key feature engineering techniques you can leverage with Python.
Creating new features
Use arithmetic processes or logical expressions to combine existing features to make new features that might be better at predicting the future. For example, if you have a collection with a list of everything each customer has bought, you could add a new feature that shows how much each customer has spent by adding up all of their `price` columns.
import pandas as pd
# Sample data on customer purchases
data = {'CustomerID': [100, 100, 101, 102, 102], 'Product': ['A', 'B', 'A', 'C', 'D'], 'Price': [20, 15, 30, 50, 25]}
df = pd.DataFrame(data)
# Create a new feature 'TotalSpent' by summing 'Price' for each customer
df['TotalSpent'] = df.groupby('CustomerID')['Price'].transform('sum')
print(df)Code language: Python (python)
Encoding categorical variables
Many machine learning models, like text titles and colors, can’t work directly with categorical data. This is called “encoding categorical variables”. One-hot encoding and other feature engineering methods turn categorical variables into numbers that models can understand. With one-hot encoding, a new binary feature is made for each category. A value of 1 means that the item is in that category, and a value of 0 means that it is not.
# Sample data with a categorical feature 'Color'
data = {'Product': ['Shirt', 'Dress', 'Pants', 'Hat'], 'Color': ['Red', 'Blue', 'Red', 'Black']}
df = pd.DataFrame(data)
# One-hot encode the 'Color' column using pd.get_dummies
df_encoded = pd.get_dummies(df, columns=['Color'])
print(df_encoded)Code language: Python (python)
Feature scaling
Different features can have different scales, which can affect the performance of some machine learning models. Feature scaling methods, such as normalization (which changes features to a range of values, usually 0 to 1) or standardization (which changes features to a mean of 0 and a standard deviation of 1), can fix this problem and ensure that all features contribute equally when the model is being trained.
from sklearn.preprocessing import StandardScaler
# Sample data with features on different scales
data = {'Age': [25, 30, 18, 42], 'Income': [50000, 78000, 32000, 95000]}
df = pd.DataFrame(data)
# Standardize features using StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
print(df_scaled)Code language: Python (python)
Using these feature engineering methods, you can turn your unstructured data into a well-organized and useful picture. This will allow your machine learning models to make accurate predictions and find useful insights in your data.
Cleaning data in Python: Best practices and tips
Cleaning up the data is an important part of any project that uses data analysis or machine learning. As you try to make your data cleaning process run more smoothly, here are some tips to keep in mind: Always keep raw data separate from cleaned and processed data. Ensure the original files are kept separate from the cleaned and processed ones. This ensures there is a starting place and makes it easy to return to the original data if needed.
Before making any changes, I always copy the raw data file and add the word “-RAW” to the original file’s name to tell it apart. During any change, you should also use df.copy to make a copy of the DataFrame. Include notes in your data cleaning code that explain each cleaning process’s goal and any assumptions made. Watch out for unintended effects to ensure that the way you clean your data doesn’t change the distribution in a big way or add any biases you didn’t mean to. Exploring the material repeatedly after cleaning it up helps you stay on track.
If your cleaning process takes a long time or is automated, you need to keep logs. Write down the details of each step in a separate document and keep it where it is easy to find, like in the same project folder as the data or code. If your pipeline is automated and updated regularly, you might want to make this log an automatic part of the data-cleaning process. This way, you can check in and make sure everything is going well.
Finally, write reusable functions by gathering common data cleaning jobs into reusable functions. This way, you can use the same cleaning methods on multiple datasets. This is very helpful if you need to map company-specific terms.
Taming the chaos: The final word on data cleaning with Python
Cleaning data, the unsung star of data science is the first and most important step in turning raw data into a base for analysis. By eliminating problems like missing values, outliers, inconsistencies, and duplicates, you can use your data to find hidden patterns and make smart choices.
Thanks to its large collection of libraries, Python gives you the tools to face these problems head-on. Pandas is the workhorse for cleaning data because it has easy-to-understand data structures, strong selection tools, and a huge number of cleaning functions. For certain needs, specialized frameworks like Data Cleaner and Open Refine offer extra features and easier processes.
Don’t miss these
Follow our blog
Be the first to know when we publish new content.
- Mastering the Data Science Interview - October 5, 2024
- Data Science Career Path: My Journey - September 19, 2024
- Feature Engineering: ML’s Secret Sauce - September 13, 2024