data-cleaning-tutorial - Data Cleaning: How the pros do the dirty work

“Data is the new oil!”, they say, painting it as the big-shot driver of the economy. But when you hear that, you’re probably picturing slick, refined gasoline, right?

Wrong picture.

In the real world, raw data is like crude oil – unrefined, messy, and full of crud. If not properly cleaned, guess what? You risk clogging the very engine of decision-making you set out to fuel.

When it comes to data analysis, the adage “garbage in, garbage out” couldn’t be more relevant!

Consider this: bad data costs the US a staggering 3 trillion USD annually! Without thorough cleaning, data becomes not just useless but potentially detrimental, leading to misguided insights and costly business blunders. It’s no surprise, then, that data cleaning is often the first, non-negotiable step in any data science or machine learning endeavor.

So in this tutorial, you’ll learn the basics of scrubbing your data clean of inaccuracies and inconsistencies. You’ll explore techniques that cut through the crap, using tools in Python that do the heavy lifting for you. Let’s begin!

Tutorial prerequisites

I’m assuming that you already have some knowledge of programming. Some programming knowledge of Python is necessary, so if you know it, you’ll find this tutorial relatively simple. If you don’t, I highly recommend that you check out this free course on Introduction to Python.

Why Python? Because it is fast emerging as the preferred choice of language for data science. It is fairly easy to pick up and learn, and the Python ecosystem has a lot of tools and libraries for virtually building anything — ranging from web servers, packages for statistics, data cleaning, and machine learning. Python has also one of the most active communities on the internet, like stack overflow.

Launch the code

Clone this repository, and follow the instructions to install and run the code. But if that’s too cumbersome -

Just click the button below, which will automatically open an interactive executable environment, right within your browser! Much better, right?

Binder

The scenario

Imagine that the data science and machine learning team at FutureBank want to understand the key factors influencing a customer’s decision to respond positively to term deposit subscriptions. By analyzing this data, the bank wants to refine its marketing strategies, tailor its offerings, and ultimately enhance customer satisfaction and retention.

However, before the team can dive into algorithms and analytics, they need the data to be squeaky clean. That’s where you come in! Your meticulous data cleaning will set the foundation for the team’s success, empowering the team to craft a marketing masterpiece that resonates with its customers.

Timeless quotes often require a modern twist

Loading our data toolkit

For this tutorial, you’ll be relying on pandas and numpy, two of the most popular libraries that let you manipulate data in a variety of formats.

pandas: pandas is the Swiss Army knife in this toolkit, helping you with filtering, grouping, and aggregation. It’s especially powerful for reading and writing data in various formats, handling missing data, and reshaping or pivoting datasets.
numpy: Short for Numerical Python, numpy is fundamental for scientific computing in Python. Its ability to perform complex computations quickly and efficiently makes it invaluable for analyzing large datasets. It will help you sort through any numerical mess.

# Import the libraries
import pandas as pd
import numpy as np

Exploring our data

The team has managed to gather some information and provided it to you in the form of a csv file. Here’s what it contains:

Expand to view columns

Column Name	Description
customerid	A unique identifier for each customer.
age	The age of the customer.
salary	Annual income of the customer.
balance	Current account balance.
marital	Marital status of the customer.
jobedu	A combined field of the customer’s job title and level of education.
targeted	Indicates whether marketing efforts were targeted at the customer based on data analytics.
default	Whether the customer has defaulted on a loan.
housing	If the customer has a housing loan.
loan	If the customer has a personal loan.
contact	The method of communication used in the campaign.
day	The day of the month the customer was last contacted.
month	The month the customer was last contacted.
duration	Duration of the last contact, in seconds.
campaign	Number of contacts performed during this campaign for this customer.
pdays	Number of days since the customer was last contacted from a previous campaign.
previous	Number of contacts performed before this campaign for this customer.
poutcome	Outcome of the previous marketing campaign.
response	If the customer subscribed to a term deposit.

Now, you’ll use the pandas library to load this file into a DataFrame, and read the first 5 rows:

#read the dataset and check the first five rows
raw_data = pd.read_csv("./dataset/bank_marketing_updated_v1.csv")
raw_data.head(n=5)

/tmp/ipykernel_2510602/1749228689.py:2: DtypeWarning: Columns (0,1,2,3,11,14,15,16) have mixed types. Specify dtype option on import or set low_memory=False.
  raw_data = pd.read_csv("./dataset/bank_marketing_updated_v1.csv")

	banking marketing	Unnamed: 1	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Unnamed: 9	Unnamed: 10	Unnamed: 11	Unnamed: 12	Unnamed: 13	Unnamed: 14	Unnamed: 15	Unnamed: 16	Unnamed: 17	Unnamed: 18
0	customer id and age.	NaN	Customer salary and balance.	NaN	Customer marital status and job with education...	NaN	particular customer before targeted or not	NaN	Loan types: loans or housing loans	NaN	Contact type	NaN	month of contact	duration of call	NaN	NaN	NaN	outcome of previous contact	response of customer after call happned
1	customerid	age	salary	balance	marital	jobedu	targeted	default	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome	response
2	1	58	100000	2143	married	management,tertiary	yes	no	yes	no	unknown	5	may, 2017	261 sec	1	-1	0	unknown	no
3	2	44	60000	29	single	technician,secondary	yes	no	yes	no	unknown	5	may, 2017	151 sec	1	-1	0	unknown	no
4	3	33	120000	2	married	entrepreneur,secondary	yes	no	yes	yes	unknown	5	may, 2017	76 sec	1	-1	0	unknown	no

raw_data.shape #print the dimensions of the dataset

(45213, 19)

The data cleaning process

Broadly speaking, the data cleaning process can be divided into 5 steps:

graph LR
    A[Fix Rows and Columns] --> B[Fix Missing Values]
    B --> C[Standardize Values]
    C --> D[Fix Invalid Values]
    D --> E[Filter Data]

Fix Rows and Columns: Start by adjusting or removing any unnecessary or incorrectly formatted rows and columns.
Fix Missing Values: Next, identify and handle missing data, either by imputing values or deciding how to deal with these gaps.
Standardize Values: Ensure that all data follows a consistent format, which could involve normalizing numerical values or standardizing text entries.
Fix Invalid Values: Correct or remove any data that is incorrect or does not fit the expected pattern.
Filter Data: Finally, filter the data to focus on the relevant subset for your analysis or requirements.

We’ll begin with the first step:

Step 1: Fix rows and columns

For fixing rows, here’s a checklist:

Delete Summary Rows: Remove rows that summarize data, like ‘Total’ or ‘Subtotal’ rows, which can skew analysis.
Delete Incorrect Rows: Remove rows that are not part of the actual data, such as repeated header rows or footer information.
Delete Extra Rows: Get rid of rows that are irrelevant to your analysis, such as blank rows, page numbers, or formatting indicators.

Notice, when we loaded our data, the header column looks a bit weird? It seems like the actual data that we need starts from the 3rd row. The first 2 rows seem to be some extra rows, with some unnamed columns. We’ll need to skip them. Here’s how we can do it:

#skip the first 2 rows and read the data again
raw_data = pd.read_csv("./dataset/bank_marketing_updated_v1.csv", skiprows= 2)
raw_data.head(n=5)

	customerid	age	salary	balance	marital	jobedu	targeted	default	housing	loan	contact	day	month	duration	campaign	pdays	poutcome	response
0	1	58.0	100000	2143	married	management,tertiary	yes	no	yes	no	unknown	5	may, 2017	261 sec	1	-1	unknown	no
1	2	44.0	60000	29	single	technician,secondary	yes	no	yes	no	unknown	5	may, 2017	151 sec	1	-1	unknown	no
2	3	33.0	120000	2	married	entrepreneur,secondary	yes	no	yes	yes	unknown	5	may, 2017	76 sec	1	-1	unknown	no
3	4	47.0	20000	1506	married	blue-collar,unknown	no	no	yes	no	unknown	5	may, 2017	92 sec	1	-1	unknown	no
4	5	33.0	0	1	single	unknown,unknown	no	no	no	no	unknown	5	may, 2017	198 sec	1	-1	unknown	no

# We can also print the column names this way:
raw_data.columns

Index(['customerid', 'age', 'salary', 'balance', 'marital', 'jobedu',
       'targeted', 'default', 'housing', 'loan', 'contact', 'day', 'month',
       'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'response'],
      dtype='object')

Next, notice that the jobedu column is a mashup of job and education details. For a proper analysis, it would be beneficial to separate this into two distinct columns: one for job and another for education. Analyzing job and education individually could play a crucial role in identifying customer segments more likely to respond positively to term deposits.

# Isolate job details into a newly created 'job' column from "jobedu" column
raw_data['job'] = raw_data.jobedu.apply(lambda x: x.split(",")[0])
raw_data.head()

	customerid	age	salary	balance	marital	jobedu	targeted	default	housing	loan	contact	day	month	duration	campaign	pdays	poutcome	response	job
0	1	58.0	100000	2143	married	management,tertiary	yes	no	yes	no	unknown	5	may, 2017	261 sec	1	-1	unknown	no	management
1	2	44.0	60000	29	single	technician,secondary	yes	no	yes	no	unknown	5	may, 2017	151 sec	1	-1	unknown	no	technician
2	3	33.0	120000	2	married	entrepreneur,secondary	yes	no	yes	yes	unknown	5	may, 2017	76 sec	1	-1	unknown	no	entrepreneur
3	4	47.0	20000	1506	married	blue-collar,unknown	no	no	yes	no	unknown	5	may, 2017	92 sec	1	-1	unknown	no	blue-collar
4	5	33.0	0	1	single	unknown,unknown	no	no	no	no	unknown	5	may, 2017	198 sec	1	-1	unknown	no	unknown

Understanding the code

The apply() method is used to apply a function along an axis of the DataFrame. In this case, the function is a lambda function, which is an anonymous function defined in-line.

lambda x: x.split(",")[0]: This lambda function takes an input x (which represents each element in the jobedu column) and splits the string into a list at each comma.
[0] accesses the first element of the resulting list (which contains the job info).

# Isolate education in newly created 'education' column from "jobedu" column.

raw_data['education'] = raw_data.jobedu.apply(lambda x: x.split(",")[1])
raw_data.head()

	customerid	age	salary	balance	marital	jobedu	targeted	default	housing	loan	...	day	month	duration	campaign	pdays	poutcome	response	job	education
0	1	58.0	100000	2143	married	management,tertiary	yes	no	yes	no	...	5	may, 2017	261 sec	1	-1	unknown	no	management	tertiary
1	2	44.0	60000	29	single	technician,secondary	yes	no	yes	no	...	5	may, 2017	151 sec	1	-1	unknown	no	technician	secondary
2	3	33.0	120000	2	married	entrepreneur,secondary	yes	no	yes	yes	...	5	may, 2017	76 sec	1	-1	unknown	no	entrepreneur	secondary
3	4	47.0	20000	1506	married	blue-collar,unknown	no	no	yes	no	...	5	may, 2017	92 sec	1	-1	unknown	no	blue-collar	unknown
4	5	33.0	0	1	single	unknown,unknown	no	no	no	no	...	5	may, 2017	198 sec	1	-1	unknown	no	unknown	unknown

5 rows × 21 columns

#drop the "jobedu" column from the dataframe.
raw_data.drop('jobedu',axis= 1, inplace= True)
raw_data.head()

	customerid	age	salary	balance	marital	targeted	default	housing	loan	contact	day	month	duration	campaign	pdays	poutcome	response	job	education
0	1	58.0	100000	2143	married	yes	no	yes	no	unknown	5	may, 2017	261 sec	1	-1	unknown	no	management	tertiary
1	2	44.0	60000	29	single	yes	no	yes	no	unknown	5	may, 2017	151 sec	1	-1	unknown	no	technician	secondary
2	3	33.0	120000	2	married	yes	no	yes	yes	unknown	5	may, 2017	76 sec	1	-1	unknown	no	entrepreneur	secondary
3	4	47.0	20000	1506	married	no	no	yes	no	unknown	5	may, 2017	92 sec	1	-1	unknown	no	blue-collar	unknown
4	5	33.0	0	1	single	no	no	no	no	unknown	5	may, 2017	198 sec	1	-1	unknown	no	unknown	unknown

inplace=True is useful for making changes to the original DataFrame without having to create a new one

Step 2: Fix missing values

Missing values are like puzzle pieces that got lost – they can leave you with an incomplete picture. But don’t worry, there are four strategic ways to bring everything back into focus:

Mark the Absentees: Scour through your dataset and flag the elusive ones. For instance, one might have marked a value as NA or some placeholder values like XXX, or 999.
Trim the excess: When a few data points have wandered off leaving barely a trace, it’s safe to let them go. Such rows can be deleted if the number of missing values is insignificant, as this would not impact the overall analysis results. But Columns can be treated a bit differently - you can get rid of one if the missing values are significant in number.
Fill in the blanks with business insight: If you have a good understanding of your domain, you can make informed guesses that make sense.
Fill missing values using statistics: You can also make informed guesses by observing patterns in your data, and filling in the missing values with a close approximation. This technique is also called imputation.

Let’s see which columns in the dataset contain null values:

raw_data.isnull().sum()

customerid     0
age           20
salary         0
balance        0
marital        0
targeted       0
default        0
housing        0
loan           0
contact        0
day            0
month         50
duration       0
campaign       0
pdays          0
previous       0
poutcome       0
response      30
job            0
education      0
dtype: int64

isnull returns a new DataFrame of booleans, where a cell is True if the corresponding cell in raw_data is a missing value (NaN, None, or NaT), and False otherwise.
In pandas, when you call .sum() on a DataFrame of booleans, it treats True as 1 and False as 0.

So it appears that age, month, and response all have missing values. Let’s handle the age column first. How much percent of the age column is actually missing?

For this, we’ll need to compute

\[ \frac{\text{missing values in the age column}}{\text{total number of rows}} \times 100 \]

#calculate the percentage of missing values in age column.
missing_ages_count = raw_data.age.isnull().sum()
missing_ages_count * 100/raw_data.shape[0]

0.04423702196368141

Tip

Alternatively, you can do this too:

raw_data.age.isnull().mean() * 100

As we can see, they are quite small in number. The easiest way to deal with them is to just delete them.

#drop the records with age missing in raw_data

raw_data = raw_data[-raw_data.age.isnull()].copy()

print('Number of null values in age after dropping: ', raw_data.age.isnull().sum())
print('Total number of rows:', raw_data.shape[0])

Number of null values in age after dropping:  0
Total number of rows: 45191

What does this code do?

raw_data.age.isnull(): This part of the code returns a boolean series where each value is True if the corresponding value in the age column of raw_data is missing, and False otherwise.
The - sign in front is used for negating the boolean series. In other words, it converts True to False and vice versa. Thus, this operation effectively selects all rows where the age column does not have missing values.
As a result, raw_data[-raw_data.age.isnull()] creates a new DataFrame that includes only those rows of raw_data where the age value is not null.

Finally, we use copy() method to create a new copy of this dataframe, and store it back in the original one.

Next, we’ll look at the month column. How many records are missing?

#count the missing values in month column.
raw_data.month.isnull().mean() * 100

0.11064149941360005

Depending on what we want, we could try different approaches.

We could once again, simply drop the rows with missing valaues.
Or, we could fill them (imputation).

Let’s try the imputation strategy this time! But how do we know what value to impute?

Let’s analyze the values in the month columnn and how frequently they appear in the dataset. Here are a few techniques:

We could replace the missing values with the value that occurs most frequently (mode).
For columns containing quantitative values, we might also have the option to replace the missing values by the average.

raw_data.month.value_counts(normalize = True)

may, 2017    0.304380
jul, 2017    0.152522
aug, 2017    0.138123
jun, 2017    0.118141
nov, 2017    0.087880
apr, 2017    0.064908
feb, 2017    0.058616
jan, 2017    0.031058
oct, 2017    0.016327
sep, 2017    0.012760
mar, 2017    0.010545
dec, 2017    0.004741
Name: month, dtype: float64

# Find the value that occurs most frequently (mode)

month_mode=raw_data.month.mode()[0]
print(month_mode)

may, 2017

In this case, the value may, 2017 appears most frequently (about 30%) of the time. This is a good value to impute.

Let’s replace all missing values in the month column with this value.

# fill the missing values with mode value of month in raw_data.
raw_data.month.fillna(month_mode, inplace= True)

print('Number of null values in month after dropping: ', raw_data.month.isnull().sum())
print(raw_data.month.value_counts(normalize= True))

Number of null values in month after dropping:  0
may, 2017    0.305149
jul, 2017    0.152353
aug, 2017    0.137970
jun, 2017    0.118010
nov, 2017    0.087783
apr, 2017    0.064836
feb, 2017    0.058551
jan, 2017    0.031024
oct, 2017    0.016309
sep, 2017    0.012746
mar, 2017    0.010533
dec, 2017    0.004735
Name: month, dtype: float64

fillna is a method in pandas used to fill missing values (denoted as NaN) in a DataFrame or Series.
The inplace=True argument specifies that this operation should modify the DataFrame directly, instead of creating a new one with the missing values filled. This means that after this line of code is executed, the month column will no longer contain any missing values; they will have been replaced with the value in month_mode.

Now, we have one more column to fix missing values in: response.

#count the missing values in month column.
raw_data.response.isnull().mean() * 100

0.06638489964816004

This column has less than 7% of null values. We could just drop them off!

#drop the records with missing values for response.
raw_data = raw_data[~raw_data.response.isnull()]
print('Number of null values in response after dropping: ', raw_data.response.isnull().sum())
print('Total number of rows:', raw_data.shape[0])

Number of null values in response after dropping:  0
Total number of rows: 45161

Is that all?

Not quite!

Note

Data can often be deceptive. Don’t just assume that missing values will always be present as null or NaN!

Take a look at the pdays column, for example -

#describe the pdays column
raw_data.pdays.describe()

count    45161.000000
mean        40.182015
std        100.079372
min         -1.000000
25%         -1.000000
50%         -1.000000
75%         -1.000000
max        871.000000
Name: pdays, dtype: float64

(raw_data.pdays < 0).mean() * 100

81.74088261995969

Woah! Looks like 81% of the values in this column have a value of -1 value in them! Seems like the data gathering team entered -1 to indicate Null values!

Here’s what we could do:

we could simply ignore the missing values in the calculations. For this, we’ll need to replace all the negative values with Nan, so that the number crunching that we might do later on will ignore these missing values.

#describe the pdays column with considering the -1 values.
raw_data.loc[raw_data.pdays < 0, "pdays"] = np.NaN
raw_data.pdays.describe()

count    8246.000000
mean      224.542202
std       115.210792
min         1.000000
25%       133.000000
50%       195.000000
75%       327.000000
max       871.000000
Name: pdays, dtype: float64

What does this code do?

.loc[] is a label-based data selecting method used to select rows and columns from a DataFrame.
raw_data.pdays < 0 is a condition that checks which rows in the pdays column of raw_data have values less than 0. This condition creates a boolean series where each value is True if the corresponding value in pdays is less than 0, and False otherwise.
By placing this condition within raw_data.loc[], along with the column name "pdays", the code selects all rows with pdays values that are less than 0.
Finally, np.Nan replaces them with NaN, effectively treating them as missing or undefined values.

Perfect! Now you can move on to the next step -

Step 3: Standardizing values

The first question to ask is - are there extreme values in our dataset?

Detecting outliers

Finding out outliers in a dataset is crucial, because very small or very large values can negatively affect our data analysis and skew our findings.

How outliers skew data analysis

Imagine you’re analyzing the salary data of employees in a tech company. The company has a diverse pool of talent, ranging from fresh graduates to seasoned experts, with salaries mostly ranging between $50,000 and $150,000 annually. Out of 500 employees, 498 have salaries within this range.

But out of 500 folks, imagine that there are these two big shots – high-flyers with a compensation that’s in a league of its own, over $2 million each, thanks to a stack of stock options and bonuses. What happens?

In this scenario, the two executives are outliers. They’re like the unicorns in a field of horses. Not exactly what you’d call ‘typical’. They are not representative of the typical employee’s earnings and can significantly skew the average salary calculation. So if you were to calculate the average salary including these outliers, you might conclude that the average employee at this company earns around $90,000 a year. But let’s face it - this number is misleading! It’s heavily influenced by the two extreme values.

This is why Per capita income isn’t always considered to be a true measure of development.

Okay, so how do we handle them?

There’s a whole arsenal of tactics at your disposal to tackle these outliers. You could, for example, kick them out of your dataset entirely. Or, you might play it cool and just cap their influence, keeping them in check. Another move? Play the imputation game, where you replace their values with other values that are more ‘typical’, with some savvy guesswork. Or, if you’re feeling strategic, go for binning and categorize them.

Let’s dive deeper into each of these methods and see what works best:

When it comes to hunting down outliers, Python arms us with some seriously powerful visualization tools. They’re like our digital magnifying glasses, that help us zoom in on those sneaky data points that try to throw a wrench in our analysis.

For this task, you’ll be utilizing Plotly. It’s a great tool for interactive data visualization in Python, enabling us to clearly identify and analyze outliers through its interactive plots.

#import plotly

import plotly.express as px

import plotly.io as pio
pio.renderers.default = 'notebook'

Let’s plot a histogram - one of the seven basic tools of quality.

# Create a histogram of the age variable using Plotly Express
fig = px.histogram(raw_data, x='age')

# Show the plot
fig.show()

How does a histogram help?

A histogram is a valuable tool in outlier detection for several reasons:

Visual Representation of Distribution: Histograms provide a visual representation of data distribution. They display the frequency of data points within certain ranges (bins), making it easier to see the overall spread of the data.
Identifying Skewness and Peaks: Outliers can cause a distribution to be skewed. In a histogram, this might appear as a long tail extending to the right (positive skew) or left (negative skew). A distribution with a significant skew might suggest the presence of outliers.
Spotting Unusual Patterns: Outliers can manifest as isolated bars on a histogram, separated from the majority of the data. For example, if most data points are clustered within a certain range, but a few bars appear far from this cluster, those bars could represent outliers.

Next, we’ll make use of another plot - the box plot.

# Create a boxplot of the age variable using Plotly Express
fig = px.box(raw_data, y='age')

# Show the plot
fig.show()

In this graph, data points that fall outside of the whiskers are often considered potential outliers. In this plot, they are indicated with dots.

What’s the box plot about?

The box plot is another excellent tool for detecting outliers due to its specific way of representing data distribution. The box in a box plot represents the Interquartile range, which is the range between the first and third quartiles (Q3 - Q1). The IQR is crucial in identifying outliers, as it measures the spread of the middle 50% of the data.

So what does the graph tell us?

Alright, let’s get this straight: our data’s showing some ‘outliers’ in the age variable, especially in the 70-90 bracket. But let’s not jump the gun here. People being in their 70s or 80s? Totally normal, happens all the time! This isn’t a teen party; it’s real life, and in real life, people grow old.

Just because they’re few and far between doesn’t mean they are some statistical fluke. So it would be wise to keep them. They might even be key players in this whole term deposit account game.

Next, let’s look at the salary variable:

raw_data.salary.describe()

count     45161.000000
mean      57004.849317
std       32087.698810
min           0.000000
25%       20000.000000
50%       60000.000000
75%       70000.000000
max      120000.000000
Name: salary, dtype: float64

#plot the boxplot of salary variable.
fig = px.box(raw_data, y='salary')

# Show the plot
fig.show()

Next, examine the balance variable:

#describe the balance variable of inp1.
raw_data.balance.describe()

count     45161.000000
mean       1362.850690
std        3045.939589
min       -8019.000000
25%          72.000000
50%         448.000000
75%        1428.000000
max      102127.000000
Name: balance, dtype: float64

fig = px.box(raw_data, y='balance')

# Show the plot
fig.show()

Your Turn to Dive In

We’ve walked through the nitty-gritty of understanding and handling outliers, and now, it’s your turn to take the reins. Consider this the part where you roll up your sleeves and get your hands dirty with real data. As an exercise, here’s what you can do:

Identify Potential Outliers: Use your newfound skills to spot potential outliers in other variables of the dataset. Maybe take a closer look at variables like ‘salary’, ‘balance’, or ‘duration’. Are there any unexpected peaks or values that seem off the charts?
Analyze Their Impact: Once you’ve identified these outliers, analyze their impact. How do they skew the data? Do they tell a story that’s worth digging into, or are they just statistical noise?
Decide Your Approach: Based on your analysis, decide how you want to handle these outliers. Will you adjust them, remove them, or leave them as they are? Remember, there’s no one-size-fits-all answer here; it’s about what makes sense in the context of your specific data story.
Reflect and Document: As you work through this exercise, make sure to document your observations and decisions. What did you learn? How did the presence (or absence) of outliers affect your overall analysis?

Step 4: Fixing invalid values

In the world of data analysis, consistency is king. Why is this important? Because data often comes from various sources, each singing to its own tune – different units, scales, formats, you name it. Our goal here is to orchestrate these diverse elements into a unified language that our analytical tools can understand and interpret accurately.

Not fixing invalid values in your data can lead to data quality issues, making the data less reliable and harder to work with!

So here’s what you need to look out for:

Fix Units

Ensure all observations under a single variable are expressed in a consistent unit. Here are some examples:

If you have weight data in both pounds (lbs) and kilograms (kg), choose one (preferably the one most commonly used in your dataset’s context) and convert all data to that unit.
If some data points are recorded on a different scale (e.g., a test score out of 50 vs. 100), convert them to a common scale, like a percentage.
This uniformity prevents confusion and errors in analysis that can arise from unit discrepancies.

Fix Precision

Maintain a consistent level of decimal precision for better readability and uniformity. Round off numerical values to a standard number of decimal places that makes sense for your analysis, e.g., changing 4.5312341 kg to 4.53 kg. It enhances data readability and prevents the overemphasis of minor differences that are not meaningful for the analysis.

Remove extra characters

Cleanse string variables of unnecessary characters. Remove any irrelevant characters such as prefixes, suffixes, and extra spaces. For instance, trim spaces from a name string like " John Doe ".

Fix inconsistent casing

Unify the casing style of string variables. Convert all text to a consistent case format (uppercase, lowercase, title case, etc.), based on what’s most appropriate for your data. This is particularly useful for categorical data and text analysis, ensuring that the same categories or keywords are not treated as different due to case differences.

Standardize Format

Ensure consistency in the format of structured text data like dates and names. For example, changing date formats from 23/10/16 to 2016/10/23, or standardizing name formats. This uniformity is crucial for sorting, filtering, and correctly interpreting structured text data.

Let’s look at an example from our dataset:

raw_data.duration.head(10)

0    261 sec
1    151 sec
2     76 sec
3     92 sec
4    198 sec
5    139 sec
6    217 sec
7    380 sec
8     50 sec
9     55 sec
Name: duration, dtype: object

As you’ll notice, the values in this column are appended with a ‘sec’ suffix. This could pose a challenge when performing numerical calculations.

A practical solution would be to standardize these values into minutes while simultaneously removing the sec suffix. This way, we’re left with purely numerical values, more conducive to accurate and efficient data analysis.

raw_data.duration = raw_data.duration.apply(lambda x: float(x.split()[0])/60 if x.find("sec") > 0 else float(x.split()[0]))

raw_data.duration.describe()

count    45161.000000
mean         4.302774
std          4.293129
min          0.000000
25%          1.716667
50%          3.000000
75%          5.316667
max         81.966667
Name: duration, dtype: float64

What does this code do?

.apply(lambda x: ...): The apply method is used to apply a function along an axis of the DataFrame. Here, it’s being applied to each element (x) in the duration column
The function splits each string in the duration columnn if the substring 'sec' exists, and takes the first element of the resulting list. This is useful for extracting the numeric part from a string like "120 sec"
float(...): Converts the extracted string to a floating-point number.

Wrapping up

We’ve navigated through the crucial steps of data cleaning and standardization, tackling challenges like handling missing values, detecting outliers, and ensuring our data speaks in a consistent and clear language. These steps form the backbone of any robust data analysis process, turning raw data into a reliable foundation for informed decision-making.

But remember, this is just the tip of the iceberg. The realm of data analysis is vast and filled with many more techniques, methods, and best practices. What we’ve explored are the highlights, the essential skills that every data enthusiast should master.

To further aid your journey in data analysis and to ensure you have quick access to essential information, here’s a handy reference below. This table is a distilled guide to data cleaning, encompassing the key steps and methods we’ve discussed. Keep it as a go-to resource whenever you embark on a data-cleaning exercise. It’s designed to serve as a quick reminder of the various techniques and approaches that are crucial in transforming raw data into a clean, analysis-ready format.

The data-cleaning cheatsheet
Data quality issue	Examples	How to resolve
Fix rows and columns
Incorrect rows	Header rows, footer rows	Delete
Summary rows	Total, subtotal rows	Delete
Extra rows	Column numbers, indicators, blank rows	Delete
Missing Column Names	Column names as blanks, `NA`, `XX` etc.	Add the column names
Inconsistent column names	`X1`, `X2`,`C4` which give no information about the column	Add column names that give some information about the data
Unnecessary columns	Unidentified columns, irrelevant columns, blank columns	Delete
Columns containing Multiple data values	E.g. address columns containing city, state, country	Split columns into components
No Unique Identifier	E.g. Multiple cities with same name in a column	Combine columns to create unique identifiers e.g. combine City with the State
Misaligned columns	Shifted columns	Align these columns
Missing Values
Disguised Missing values	blank strings `" "`, `"NA"`, `"XX"`, `"999"`	Set values as missing values
Significant number of Missing values in a row/column		Delete rows, columns
Partial missing values	Missing time zone, century etc	Fill the missing values with the correct value
Standardise Numbers
Non-standard units	Convert lbs to kgs, miles/hr to km/hr	Standardise the observations so all of them have the same consistent units
Values with varying Scales	A column containing marks in subjects, with some subject marks out of 50 and others out of 100	Make the scale common. E.g. a percentage scale
Over-precision	`4.5312341` kgs, `9.323252` meters	Standardise precision for better presentation of data. `4.5312341` kgs could be presented as `4.53` kgs
Remove outliers	Abnormally High and Low values	Correct if by mistake else Remove
Standardise Text
Extra characters	Common prefix/suffix, leading/trailing/multiple spaces	Remove the extra characters
Different cases of same words	Uppercase, lowercase, Title Case, Sentence case, etc	Standardise the case/bring to a common case
Non-standard formats	`23/10/16` to `2016/10/20`, `"Reacher, Jack"` to `"Jack Reacher"`	Correct the format/Standardise format for better readability and analysis
Fix Invalid Values
Encoding Issues	`CP1252` instead of `UTF-8`	Encode unicode properly
Incorrect data types	Number stored as a string: `"12,300"`	Convert to Correct data type
Correct values not in list	Non-existent country, PIN code	Delete the invalid values, treat as Missing
Wrong structure	Phone number with over 10 digits
Correct values beyond range	Temperature less than `-273`°C (`0°` K)
Validate internal rules	Gross sales > Net sales
	Date of delivery > Date of ordering
	If Title is `"Mr"` then Gender is `"M"`
Filter Data
Duplicate data	Identical rows, rows where some columns are identical	Deduplicate Data/ Remove duplicated data
Extra/Unnecessary rows	Rows that are not required in the analysis. E.g if observations before or after a particular date only are required for analysis, other rows become unnecessary	Filter rows to keep only the relevant data.
Columns not relevant to analysis	Columns that are not needed for analysis e.g. Personal Detail columns such as Address, phone column in a dataset for	Filter columns and pick only the ones relevant to your analysis
Dispersed data	Parts of data required for analysis stored in different files or part of different datasets	Bring the data together, Group by required keys, aggregate the rest