# Import the libraries
import pandas as pd
import numpy as np
“Data is the new oil!”, they say, painting it as the big-shot driver of the economy. But when you hear that, you’re probably picturing slick, refined gasoline, right?
Wrong picture.
In the real world, raw data is like crude oil – unrefined, messy, and full of crud. If not properly cleaned, guess what? You risk clogging the very engine of decision-making you set out to fuel.
When it comes to data analysis, the adage “garbage in, garbage out” couldn’t be more relevant!
Consider this: bad data costs the US a staggering 3 trillion USD annually! Without thorough cleaning, data becomes not just useless but potentially detrimental, leading to misguided insights and costly business blunders. It’s no surprise, then, that data cleaning is often the first, non-negotiable step in any data science or machine learning endeavor.
So in this tutorial, you’ll learn the basics of scrubbing your data clean of inaccuracies and inconsistencies. You’ll explore techniques that cut through the crap, using tools in Python that do the heavy lifting for you. Let’s begin!
Tutorial prerequisites
I’m assuming that you already have some knowledge of programming. Some programming knowledge of Python is necessary, so if you know it, you’ll find this tutorial relatively simple. If you don’t, I highly recommend that you check out this free course on Introduction to Python.
Why Python? Because it is fast emerging as the preferred choice of language for data science. It is fairly easy to pick up and learn, and the Python ecosystem has a lot of tools and libraries for virtually building anything — ranging from web servers, packages for statistics, data cleaning, and machine learning. Python has also one of the most active communities on the internet, like stack overflow.
Launch the code
Clone this repository, and follow the instructions to install and run the code. But if that’s too cumbersome -
Just click the button below, which will automatically open an interactive executable environment, right within your browser! Much better, right?
The scenario
Imagine that the data science and machine learning team at FutureBank want to understand the key factors influencing a customer’s decision to respond positively to term deposit subscriptions. By analyzing this data, the bank wants to refine its marketing strategies, tailor its offerings, and ultimately enhance customer satisfaction and retention.
However, before the team can dive into algorithms and analytics, they need the data to be squeaky clean. That’s where you come in! Your meticulous data cleaning will set the foundation for the team’s success, empowering the team to craft a marketing masterpiece that resonates with its customers.
Loading our data toolkit
For this tutorial, you’ll be relying on pandas and numpy, two of the most popular libraries that let you manipulate data in a variety of formats.
- pandas: pandas is the Swiss Army knife in this toolkit, helping you with filtering, grouping, and aggregation. It’s especially powerful for reading and writing data in various formats, handling missing data, and reshaping or pivoting datasets.
- numpy: Short for Numerical Python, numpy is fundamental for scientific computing in Python. Its ability to perform complex computations quickly and efficiently makes it invaluable for analyzing large datasets. It will help you sort through any numerical mess.
Exploring our data
The team has managed to gather some information and provided it to you in the form of a csv file. Here’s what it contains:
Expand to view columns
Column Name | Description |
---|---|
customerid | A unique identifier for each customer. |
age | The age of the customer. |
salary | Annual income of the customer. |
balance | Current account balance. |
marital | Marital status of the customer. |
jobedu | A combined field of the customer’s job title and level of education. |
targeted | Indicates whether marketing efforts were targeted at the customer based on data analytics. |
default | Whether the customer has defaulted on a loan. |
housing | If the customer has a housing loan. |
loan | If the customer has a personal loan. |
contact | The method of communication used in the campaign. |
day | The day of the month the customer was last contacted. |
month | The month the customer was last contacted. |
duration | Duration of the last contact, in seconds. |
campaign | Number of contacts performed during this campaign for this customer. |
pdays | Number of days since the customer was last contacted from a previous campaign. |
previous | Number of contacts performed before this campaign for this customer. |
poutcome | Outcome of the previous marketing campaign. |
response | If the customer subscribed to a term deposit. |
Now, you’ll use the pandas
library to load this file into a DataFrame, and read the first 5 rows:
#read the dataset and check the first five rows
= pd.read_csv("./dataset/bank_marketing_updated_v1.csv")
raw_data =5) raw_data.head(n
/tmp/ipykernel_2510602/1749228689.py:2: DtypeWarning: Columns (0,1,2,3,11,14,15,16) have mixed types. Specify dtype option on import or set low_memory=False.
raw_data = pd.read_csv("./dataset/bank_marketing_updated_v1.csv")
banking marketing | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | Unnamed: 10 | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | Unnamed: 15 | Unnamed: 16 | Unnamed: 17 | Unnamed: 18 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | customer id and age. | NaN | Customer salary and balance. | NaN | Customer marital status and job with education... | NaN | particular customer before targeted or not | NaN | Loan types: loans or housing loans | NaN | Contact type | NaN | month of contact | duration of call | NaN | NaN | NaN | outcome of previous contact | response of customer after call happned |
1 | customerid | age | salary | balance | marital | jobedu | targeted | default | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | response |
2 | 1 | 58 | 100000 | 2143 | married | management,tertiary | yes | no | yes | no | unknown | 5 | may, 2017 | 261 sec | 1 | -1 | 0 | unknown | no |
3 | 2 | 44 | 60000 | 29 | single | technician,secondary | yes | no | yes | no | unknown | 5 | may, 2017 | 151 sec | 1 | -1 | 0 | unknown | no |
4 | 3 | 33 | 120000 | 2 | married | entrepreneur,secondary | yes | no | yes | yes | unknown | 5 | may, 2017 | 76 sec | 1 | -1 | 0 | unknown | no |
#print the dimensions of the dataset raw_data.shape
(45213, 19)
The data cleaning process
Broadly speaking, the data cleaning process can be divided into 5 steps:
graph LR A[Fix Rows and Columns] --> B[Fix Missing Values] B --> C[Standardize Values] C --> D[Fix Invalid Values] D --> E[Filter Data]
- Fix Rows and Columns: Start by adjusting or removing any unnecessary or incorrectly formatted rows and columns.
- Fix Missing Values: Next, identify and handle missing data, either by imputing values or deciding how to deal with these gaps.
- Standardize Values: Ensure that all data follows a consistent format, which could involve normalizing numerical values or standardizing text entries.
- Fix Invalid Values: Correct or remove any data that is incorrect or does not fit the expected pattern.
- Filter Data: Finally, filter the data to focus on the relevant subset for your analysis or requirements.
We’ll begin with the first step:
Step 1: Fix rows and columns
For fixing rows, here’s a checklist:
- Delete Summary Rows: Remove rows that summarize data, like ‘Total’ or ‘Subtotal’ rows, which can skew analysis.
- Delete Incorrect Rows: Remove rows that are not part of the actual data, such as repeated header rows or footer information.
- Delete Extra Rows: Get rid of rows that are irrelevant to your analysis, such as blank rows, page numbers, or formatting indicators.
Notice, when we loaded our data, the header column looks a bit weird? It seems like the actual data that we need starts from the 3rd row. The first 2 rows seem to be some extra rows, with some unnamed columns. We’ll need to skip them. Here’s how we can do it:
#skip the first 2 rows and read the data again
= pd.read_csv("./dataset/bank_marketing_updated_v1.csv", skiprows= 2)
raw_data =5) raw_data.head(n
customerid | age | salary | balance | marital | jobedu | targeted | default | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | response | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 58.0 | 100000 | 2143 | married | management,tertiary | yes | no | yes | no | unknown | 5 | may, 2017 | 261 sec | 1 | -1 | 0 | unknown | no |
1 | 2 | 44.0 | 60000 | 29 | single | technician,secondary | yes | no | yes | no | unknown | 5 | may, 2017 | 151 sec | 1 | -1 | 0 | unknown | no |
2 | 3 | 33.0 | 120000 | 2 | married | entrepreneur,secondary | yes | no | yes | yes | unknown | 5 | may, 2017 | 76 sec | 1 | -1 | 0 | unknown | no |
3 | 4 | 47.0 | 20000 | 1506 | married | blue-collar,unknown | no | no | yes | no | unknown | 5 | may, 2017 | 92 sec | 1 | -1 | 0 | unknown | no |
4 | 5 | 33.0 | 0 | 1 | single | unknown,unknown | no | no | no | no | unknown | 5 | may, 2017 | 198 sec | 1 | -1 | 0 | unknown | no |
# We can also print the column names this way:
raw_data.columns
Index(['customerid', 'age', 'salary', 'balance', 'marital', 'jobedu',
'targeted', 'default', 'housing', 'loan', 'contact', 'day', 'month',
'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'response'],
dtype='object')
Next, notice that the jobedu
column is a mashup of job and education details. For a proper analysis, it would be beneficial to separate this into two distinct columns: one for job and another for education. Analyzing job and education individually could play a crucial role in identifying customer segments more likely to respond positively to term deposits.
# Isolate job details into a newly created 'job' column from "jobedu" column
'job'] = raw_data.jobedu.apply(lambda x: x.split(",")[0])
raw_data[ raw_data.head()
customerid | age | salary | balance | marital | jobedu | targeted | default | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | response | job | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 58.0 | 100000 | 2143 | married | management,tertiary | yes | no | yes | no | unknown | 5 | may, 2017 | 261 sec | 1 | -1 | 0 | unknown | no | management |
1 | 2 | 44.0 | 60000 | 29 | single | technician,secondary | yes | no | yes | no | unknown | 5 | may, 2017 | 151 sec | 1 | -1 | 0 | unknown | no | technician |
2 | 3 | 33.0 | 120000 | 2 | married | entrepreneur,secondary | yes | no | yes | yes | unknown | 5 | may, 2017 | 76 sec | 1 | -1 | 0 | unknown | no | entrepreneur |
3 | 4 | 47.0 | 20000 | 1506 | married | blue-collar,unknown | no | no | yes | no | unknown | 5 | may, 2017 | 92 sec | 1 | -1 | 0 | unknown | no | blue-collar |
4 | 5 | 33.0 | 0 | 1 | single | unknown,unknown | no | no | no | no | unknown | 5 | may, 2017 | 198 sec | 1 | -1 | 0 | unknown | no | unknown |
Understanding the code
The apply()
method is used to apply a function along an axis of the DataFrame. In this case, the function is a lambda function, which is an anonymous function defined in-line.
lambda x: x.split(",")[0]
: This lambda function takes an inputx
(which represents each element in thejobedu
column) and splits the string into a list at each comma.[0]
accesses the first element of the resulting list (which contains the job info).
# Isolate education in newly created 'education' column from "jobedu" column.
'education'] = raw_data.jobedu.apply(lambda x: x.split(",")[1])
raw_data[ raw_data.head()
customerid | age | salary | balance | marital | jobedu | targeted | default | housing | loan | ... | day | month | duration | campaign | pdays | previous | poutcome | response | job | education | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 58.0 | 100000 | 2143 | married | management,tertiary | yes | no | yes | no | ... | 5 | may, 2017 | 261 sec | 1 | -1 | 0 | unknown | no | management | tertiary |
1 | 2 | 44.0 | 60000 | 29 | single | technician,secondary | yes | no | yes | no | ... | 5 | may, 2017 | 151 sec | 1 | -1 | 0 | unknown | no | technician | secondary |
2 | 3 | 33.0 | 120000 | 2 | married | entrepreneur,secondary | yes | no | yes | yes | ... | 5 | may, 2017 | 76 sec | 1 | -1 | 0 | unknown | no | entrepreneur | secondary |
3 | 4 | 47.0 | 20000 | 1506 | married | blue-collar,unknown | no | no | yes | no | ... | 5 | may, 2017 | 92 sec | 1 | -1 | 0 | unknown | no | blue-collar | unknown |
4 | 5 | 33.0 | 0 | 1 | single | unknown,unknown | no | no | no | no | ... | 5 | may, 2017 | 198 sec | 1 | -1 | 0 | unknown | no | unknown | unknown |
5 rows × 21 columns
#drop the "jobedu" column from the dataframe.
'jobedu',axis= 1, inplace= True)
raw_data.drop( raw_data.head()
customerid | age | salary | balance | marital | targeted | default | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | response | job | education | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 58.0 | 100000 | 2143 | married | yes | no | yes | no | unknown | 5 | may, 2017 | 261 sec | 1 | -1 | 0 | unknown | no | management | tertiary |
1 | 2 | 44.0 | 60000 | 29 | single | yes | no | yes | no | unknown | 5 | may, 2017 | 151 sec | 1 | -1 | 0 | unknown | no | technician | secondary |
2 | 3 | 33.0 | 120000 | 2 | married | yes | no | yes | yes | unknown | 5 | may, 2017 | 76 sec | 1 | -1 | 0 | unknown | no | entrepreneur | secondary |
3 | 4 | 47.0 | 20000 | 1506 | married | no | no | yes | no | unknown | 5 | may, 2017 | 92 sec | 1 | -1 | 0 | unknown | no | blue-collar | unknown |
4 | 5 | 33.0 | 0 | 1 | single | no | no | no | no | unknown | 5 | may, 2017 | 198 sec | 1 | -1 | 0 | unknown | no | unknown | unknown |
inplace=True
is useful for making changes to the original DataFrame without having to create a new one
Step 2: Fix missing values
Missing values are like puzzle pieces that got lost – they can leave you with an incomplete picture. But don’t worry, there are four strategic ways to bring everything back into focus:
- Mark the Absentees: Scour through your dataset and flag the elusive ones. For instance, one might have marked a value as
NA
or some placeholder values likeXXX
, or999
. - Trim the excess: When a few data points have wandered off leaving barely a trace, it’s safe to let them go. Such rows can be deleted if the number of missing values is insignificant, as this would not impact the overall analysis results. But Columns can be treated a bit differently - you can get rid of one if the missing values are significant in number.
- Fill in the blanks with business insight: If you have a good understanding of your domain, you can make informed guesses that make sense.
- Fill missing values using statistics: You can also make informed guesses by observing patterns in your data, and filling in the missing values with a close approximation. This technique is also called imputation.
Let’s see which columns in the dataset contain null values:
sum() raw_data.isnull().
customerid 0
age 20
salary 0
balance 0
marital 0
targeted 0
default 0
housing 0
loan 0
contact 0
day 0
month 50
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
response 30
job 0
education 0
dtype: int64
isnull
returns a new DataFrame of booleans, where a cell isTrue
if the corresponding cell inraw_data
is a missing value (NaN
,None
, orNaT
), and False otherwise.- In pandas, when you call
.sum()
on a DataFrame of booleans, it treatsTrue
as1
andFalse
as0
.
So it appears that age, month, and response all have missing values. Let’s handle the age
column first. How much percent of the age column is actually missing?
For this, we’ll need to compute
\[ \frac{\text{missing values in the age column}}{\text{total number of rows}} \times 100 \]
#calculate the percentage of missing values in age column.
= raw_data.age.isnull().sum()
missing_ages_count * 100/raw_data.shape[0] missing_ages_count
0.04423702196368141
Alternatively, you can do this too:
* 100 raw_data.age.isnull().mean()
As we can see, they are quite small in number. The easiest way to deal with them is to just delete them.
#drop the records with age missing in raw_data
= raw_data[-raw_data.age.isnull()].copy()
raw_data
print('Number of null values in age after dropping: ', raw_data.age.isnull().sum())
print('Total number of rows:', raw_data.shape[0])
Number of null values in age after dropping: 0
Total number of rows: 45191
What does this code do?
raw_data.age.isnull()
: This part of the code returns a boolean series where each value isTrue
if the corresponding value in theage
column ofraw_data
is missing, andFalse
otherwise.- The
-
sign in front is used for negating the boolean series. In other words, it convertsTrue
toFalse
and vice versa. Thus, this operation effectively selects all rows where theage
column does not have missing values. - As a result,
raw_data[-raw_data.age.isnull()]
creates a new DataFrame that includes only those rows ofraw_data
where theage
value is not null.
Finally, we use copy()
method to create a new copy of this dataframe, and store it back in the original one.
Next, we’ll look at the month
column. How many records are missing?
#count the missing values in month column.
* 100 raw_data.month.isnull().mean()
0.11064149941360005
Depending on what we want, we could try different approaches.
- We could once again, simply drop the rows with missing valaues.
- Or, we could fill them (imputation).
Let’s try the imputation strategy this time! But how do we know what value to impute?
Let’s analyze the values in the month columnn and how frequently they appear in the dataset. Here are a few techniques:
- We could replace the missing values with the value that occurs most frequently (mode).
- For columns containing quantitative values, we might also have the option to replace the missing values by the average.
= True) raw_data.month.value_counts(normalize
may, 2017 0.304380
jul, 2017 0.152522
aug, 2017 0.138123
jun, 2017 0.118141
nov, 2017 0.087880
apr, 2017 0.064908
feb, 2017 0.058616
jan, 2017 0.031058
oct, 2017 0.016327
sep, 2017 0.012760
mar, 2017 0.010545
dec, 2017 0.004741
Name: month, dtype: float64
# Find the value that occurs most frequently (mode)
=raw_data.month.mode()[0]
month_modeprint(month_mode)
may, 2017
In this case, the value may, 2017
appears most frequently (about 30%) of the time. This is a good value to impute.
Let’s replace all missing values in the month column with this value.
# fill the missing values with mode value of month in raw_data.
= True)
raw_data.month.fillna(month_mode, inplace
print('Number of null values in month after dropping: ', raw_data.month.isnull().sum())
print(raw_data.month.value_counts(normalize= True))
Number of null values in month after dropping: 0
may, 2017 0.305149
jul, 2017 0.152353
aug, 2017 0.137970
jun, 2017 0.118010
nov, 2017 0.087783
apr, 2017 0.064836
feb, 2017 0.058551
jan, 2017 0.031024
oct, 2017 0.016309
sep, 2017 0.012746
mar, 2017 0.010533
dec, 2017 0.004735
Name: month, dtype: float64
fillna
is a method in pandas used to fill missing values (denoted asNaN
) in a DataFrame or Series.- The
inplace=True
argument specifies that this operation should modify the DataFrame directly, instead of creating a new one with the missing values filled. This means that after this line of code is executed, themonth
column will no longer contain any missing values; they will have been replaced with the value inmonth_mode
.
Now, we have one more column to fix missing values in: response
.
#count the missing values in month column.
* 100 raw_data.response.isnull().mean()
0.06638489964816004
This column has less than 7% of null values. We could just drop them off!
#drop the records with missing values for response.
= raw_data[~raw_data.response.isnull()]
raw_data print('Number of null values in response after dropping: ', raw_data.response.isnull().sum())
print('Total number of rows:', raw_data.shape[0])
Number of null values in response after dropping: 0
Total number of rows: 45161
Is that all?
Not quite!
Data can often be deceptive. Don’t just assume that missing values will always be present as null
or NaN
!
Take a look at the pdays
column, for example -
#describe the pdays column
raw_data.pdays.describe()
count 45161.000000
mean 40.182015
std 100.079372
min -1.000000
25% -1.000000
50% -1.000000
75% -1.000000
max 871.000000
Name: pdays, dtype: float64
< 0).mean() * 100 (raw_data.pdays
81.74088261995969
Woah! Looks like 81% of the values in this column have a value of -1
value in them! Seems like the data gathering team entered -1
to indicate Null values!
Here’s what we could do:
we could simply ignore the missing values in the calculations. For this, we’ll need to replace all the negative values with Nan
, so that the number crunching that we might do later on will ignore these missing values.
#describe the pdays column with considering the -1 values.
< 0, "pdays"] = np.NaN
raw_data.loc[raw_data.pdays raw_data.pdays.describe()
count 8246.000000
mean 224.542202
std 115.210792
min 1.000000
25% 133.000000
50% 195.000000
75% 327.000000
max 871.000000
Name: pdays, dtype: float64
What does this code do?
.loc[]
is a label-based data selecting method used to select rows and columns from a DataFrame.raw_data.pdays < 0
is a condition that checks which rows in thepdays
column of raw_data have values less than0
. This condition creates a boolean series where each value isTrue
if the corresponding value inpdays
is less than0
, andFalse
otherwise.- By placing this condition within
raw_data.loc[]
, along with the column name"pdays"
, the code selects all rows withpdays
values that are less than0
. - Finally,
np.Nan
replaces them withNaN
, effectively treating them as missing or undefined values.
Perfect! Now you can move on to the next step -
Step 3: Standardizing values
The first question to ask is - are there extreme values in our dataset?
Detecting outliers
Finding out outliers in a dataset is crucial, because very small or very large values can negatively affect our data analysis and skew our findings.
How outliers skew data analysis
Imagine you’re analyzing the salary data of employees in a tech company. The company has a diverse pool of talent, ranging from fresh graduates to seasoned experts, with salaries mostly ranging between $50,000 and $150,000 annually. Out of 500 employees, 498 have salaries within this range.
But out of 500 folks, imagine that there are these two big shots – high-flyers with a compensation that’s in a league of its own, over $2 million each, thanks to a stack of stock options and bonuses. What happens?
In this scenario, the two executives are outliers. They’re like the unicorns in a field of horses. Not exactly what you’d call ‘typical’. They are not representative of the typical employee’s earnings and can significantly skew the average salary calculation. So if you were to calculate the average salary including these outliers, you might conclude that the average employee at this company earns around $90,000 a year. But let’s face it - this number is misleading! It’s heavily influenced by the two extreme values.
This is why Per capita income isn’t always considered to be a true measure of development.
Okay, so how do we handle them?
There’s a whole arsenal of tactics at your disposal to tackle these outliers. You could, for example, kick them out of your dataset entirely. Or, you might play it cool and just cap their influence, keeping them in check. Another move? Play the imputation game, where you replace their values with other values that are more ‘typical’, with some savvy guesswork. Or, if you’re feeling strategic, go for binning and categorize them.
Let’s dive deeper into each of these methods and see what works best:
When it comes to hunting down outliers, Python arms us with some seriously powerful visualization tools. They’re like our digital magnifying glasses, that help us zoom in on those sneaky data points that try to throw a wrench in our analysis.
For this task, you’ll be utilizing Plotly. It’s a great tool for interactive data visualization in Python, enabling us to clearly identify and analyze outliers through its interactive plots.
#import plotly
import plotly.express as px
import plotly.io as pio
= 'notebook' pio.renderers.default
Let’s plot a histogram - one of the seven basic tools of quality.
# Create a histogram of the age variable using Plotly Express
= px.histogram(raw_data, x='age')
fig
# Show the plot
fig.show()
A histogram is a valuable tool in outlier detection for several reasons:
- Visual Representation of Distribution: Histograms provide a visual representation of data distribution. They display the frequency of data points within certain ranges (bins), making it easier to see the overall spread of the data.
- Identifying Skewness and Peaks: Outliers can cause a distribution to be skewed. In a histogram, this might appear as a long tail extending to the right (positive skew) or left (negative skew). A distribution with a significant skew might suggest the presence of outliers.
- Spotting Unusual Patterns: Outliers can manifest as isolated bars on a histogram, separated from the majority of the data. For example, if most data points are clustered within a certain range, but a few bars appear far from this cluster, those bars could represent outliers.
Next, we’ll make use of another plot - the box plot.
# Create a boxplot of the age variable using Plotly Express
= px.box(raw_data, y='age')
fig
# Show the plot
fig.show()
In this graph, data points that fall outside of the whiskers are often considered potential outliers. In this plot, they are indicated with dots.
What’s the box plot about?
The box plot is another excellent tool for detecting outliers due to its specific way of representing data distribution. The box in a box plot represents the Interquartile range, which is the range between the first and third quartiles (Q3 - Q1). The IQR is crucial in identifying outliers, as it measures the spread of the middle 50% of the data.
Alright, let’s get this straight: our data’s showing some ‘outliers’ in the age variable, especially in the 70-90 bracket. But let’s not jump the gun here. People being in their 70s or 80s? Totally normal, happens all the time! This isn’t a teen party; it’s real life, and in real life, people grow old.
Just because they’re few and far between doesn’t mean they are some statistical fluke. So it would be wise to keep them. They might even be key players in this whole term deposit account game.
Next, let’s look at the salary
variable:
raw_data.salary.describe()
count 45161.000000
mean 57004.849317
std 32087.698810
min 0.000000
25% 20000.000000
50% 60000.000000
75% 70000.000000
max 120000.000000
Name: salary, dtype: float64
#plot the boxplot of salary variable.
= px.box(raw_data, y='salary')
fig
# Show the plot
fig.show()
Next, examine the balance
variable:
#describe the balance variable of inp1.
raw_data.balance.describe()
count 45161.000000
mean 1362.850690
std 3045.939589
min -8019.000000
25% 72.000000
50% 448.000000
75% 1428.000000
max 102127.000000
Name: balance, dtype: float64
= px.box(raw_data, y='balance')
fig
# Show the plot
fig.show()
Your Turn to Dive In
We’ve walked through the nitty-gritty of understanding and handling outliers, and now, it’s your turn to take the reins. Consider this the part where you roll up your sleeves and get your hands dirty with real data. As an exercise, here’s what you can do:
- Identify Potential Outliers: Use your newfound skills to spot potential outliers in other variables of the dataset. Maybe take a closer look at variables like ‘salary’, ‘balance’, or ‘duration’. Are there any unexpected peaks or values that seem off the charts?
- Analyze Their Impact: Once you’ve identified these outliers, analyze their impact. How do they skew the data? Do they tell a story that’s worth digging into, or are they just statistical noise?
- Decide Your Approach: Based on your analysis, decide how you want to handle these outliers. Will you adjust them, remove them, or leave them as they are? Remember, there’s no one-size-fits-all answer here; it’s about what makes sense in the context of your specific data story.
- Reflect and Document: As you work through this exercise, make sure to document your observations and decisions. What did you learn? How did the presence (or absence) of outliers affect your overall analysis?
Step 4: Fixing invalid values
In the world of data analysis, consistency is king. Why is this important? Because data often comes from various sources, each singing to its own tune – different units, scales, formats, you name it. Our goal here is to orchestrate these diverse elements into a unified language that our analytical tools can understand and interpret accurately.
Not fixing invalid values in your data can lead to data quality issues, making the data less reliable and harder to work with!
So here’s what you need to look out for:
Fix Units
Ensure all observations under a single variable are expressed in a consistent unit. Here are some examples:
- If you have weight data in both pounds (lbs) and kilograms (kg), choose one (preferably the one most commonly used in your dataset’s context) and convert all data to that unit.
- If some data points are recorded on a different scale (e.g., a test score out of 50 vs. 100), convert them to a common scale, like a percentage.
- This uniformity prevents confusion and errors in analysis that can arise from unit discrepancies.
Fix Precision
Maintain a consistent level of decimal precision for better readability and uniformity. Round off numerical values to a standard number of decimal places that makes sense for your analysis, e.g., changing 4.5312341
kg to 4.53
kg. It enhances data readability and prevents the overemphasis of minor differences that are not meaningful for the analysis.
Remove extra characters
Cleanse string variables of unnecessary characters. Remove any irrelevant characters such as prefixes, suffixes, and extra spaces. For instance, trim spaces from a name string like " John Doe "
.
Fix inconsistent casing
Unify the casing style of string variables. Convert all text to a consistent case format (uppercase, lowercase, title case, etc.), based on what’s most appropriate for your data. This is particularly useful for categorical data and text analysis, ensuring that the same categories or keywords are not treated as different due to case differences.
Standardize Format
Ensure consistency in the format of structured text data like dates and names. For example, changing date formats from 23/10/16
to 2016/10/23
, or standardizing name formats. This uniformity is crucial for sorting, filtering, and correctly interpreting structured text data.
Let’s look at an example from our dataset:
10) raw_data.duration.head(
0 261 sec
1 151 sec
2 76 sec
3 92 sec
4 198 sec
5 139 sec
6 217 sec
7 380 sec
8 50 sec
9 55 sec
Name: duration, dtype: object
As you’ll notice, the values in this column are appended with a ‘sec’ suffix. This could pose a challenge when performing numerical calculations.
A practical solution would be to standardize these values into minutes while simultaneously removing the sec suffix. This way, we’re left with purely numerical values, more conducive to accurate and efficient data analysis.
= raw_data.duration.apply(lambda x: float(x.split()[0])/60 if x.find("sec") > 0 else float(x.split()[0]))
raw_data.duration
raw_data.duration.describe()
count 45161.000000
mean 4.302774
std 4.293129
min 0.000000
25% 1.716667
50% 3.000000
75% 5.316667
max 81.966667
Name: duration, dtype: float64
What does this code do?
.apply(lambda x: ...)
: The apply method is used to apply a function along an axis of the DataFrame. Here, it’s being applied to each element (x) in the duration column- The function splits each string in the duration columnn if the substring
'sec'
exists, and takes the first element of the resulting list. This is useful for extracting the numeric part from a string like"120 sec"
float(...)
: Converts the extracted string to a floating-point number.
Wrapping up
We’ve navigated through the crucial steps of data cleaning and standardization, tackling challenges like handling missing values, detecting outliers, and ensuring our data speaks in a consistent and clear language. These steps form the backbone of any robust data analysis process, turning raw data into a reliable foundation for informed decision-making.
But remember, this is just the tip of the iceberg. The realm of data analysis is vast and filled with many more techniques, methods, and best practices. What we’ve explored are the highlights, the essential skills that every data enthusiast should master.
To further aid your journey in data analysis and to ensure you have quick access to essential information, here’s a handy reference below. This table is a distilled guide to data cleaning, encompassing the key steps and methods we’ve discussed. Keep it as a go-to resource whenever you embark on a data-cleaning exercise. It’s designed to serve as a quick reminder of the various techniques and approaches that are crucial in transforming raw data into a clean, analysis-ready format.
Data quality issue | Examples | How to resolve |
---|---|---|
Fix rows and columns | ||
Incorrect rows | Header rows, footer rows | Delete |
Summary rows | Total, subtotal rows | Delete |
Extra rows | Column numbers, indicators, blank rows | Delete |
Missing Column Names | Column names as blanks, NA , XX etc. |
Add the column names |
Inconsistent column names | X1 , X2 ,C4 which give no information about the column |
Add column names that give some information about the data |
Unnecessary columns | Unidentified columns, irrelevant columns, blank columns | Delete |
Columns containing Multiple data values | E.g. address columns containing city, state, country | Split columns into components |
No Unique Identifier | E.g. Multiple cities with same name in a column | Combine columns to create unique identifiers e.g. combine City with the State |
Misaligned columns | Shifted columns | Align these columns |
Missing Values | ||
Disguised Missing values | blank strings " " , "NA" , "XX" , "999" |
Set values as missing values |
Significant number of Missing values in a row/column | Delete rows, columns | |
Partial missing values | Missing time zone, century etc | Fill the missing values with the correct value |
Standardise Numbers | ||
Non-standard units | Convert lbs to kgs, miles/hr to km/hr | Standardise the observations so all of them have the same consistent units |
Values with varying Scales | A column containing marks in subjects, with some subject marks out of 50 and others out of 100 | Make the scale common. E.g. a percentage scale |
Over-precision | 4.5312341 kgs, 9.323252 meters |
Standardise precision for better presentation of data. 4.5312341 kgs could be presented as 4.53 kgs |
Remove outliers | Abnormally High and Low values | Correct if by mistake else Remove |
Standardise Text | ||
Extra characters | Common prefix/suffix, leading/trailing/multiple spaces | Remove the extra characters |
Different cases of same words | Uppercase, lowercase, Title Case, Sentence case, etc | Standardise the case/bring to a common case |
Non-standard formats | 23/10/16 to 2016/10/20 , "Reacher, Jack" to "Jack Reacher" |
Correct the format/Standardise format for better readability and analysis |
Fix Invalid Values | ||
Encoding Issues | CP1252 instead of UTF-8 |
Encode unicode properly |
Incorrect data types | Number stored as a string: "12,300" |
Convert to Correct data type |
Correct values not in list | Non-existent country, PIN code | Delete the invalid values, treat as Missing |
Wrong structure | Phone number with over 10 digits | |
Correct values beyond range | Temperature less than -273 °C (0° K) |
|
Validate internal rules | Gross sales > Net sales | |
Date of delivery > Date of ordering | ||
If Title is "Mr" then Gender is "M" |
||
Filter Data | ||
Duplicate data | Identical rows, rows where some columns are identical | Deduplicate Data/ Remove duplicated data |
Extra/Unnecessary rows | Rows that are not required in the analysis. E.g if observations before or after a particular date only are required for analysis, other rows become unnecessary | Filter rows to keep only the relevant data. |
Columns not relevant to analysis | Columns that are not needed for analysis e.g. Personal Detail columns such as Address, phone column in a dataset for | Filter columns and pick only the ones relevant to your analysis |
Dispersed data | Parts of data required for analysis stored in different files or part of different datasets | Bring the data together, Group by required keys, aggregate the rest |