Improving Data Quality
So, you’ve got a dataset. Cool. But let’s be real—your data is probably a hot mess. Missing values, mixed data types, and categorical columns that look like they were named by a drunk intern. Don’t worry. I’m here to clean up your mess, Gotham-style. Let’s dive into the shadows and make your data shine.
Step 1: Read Your Dataset into a Pandas DataFrame:
First things first. You’ve got a CSV file, and you need to load it into a Pandas DataFrame. Think of this as your Batcave—a place where all your tools and gadgets are stored.
import pandas as pd
df_transport = pd.read_csv('your_dataset.csv')
Now, your data is in a DataFrame. But don’t get too comfortable. This is just the beginning.
Step 2: Check the Data Types:
Your DataFrame might look clean, but it’s probably hiding secrets. Pandas tries to guess the data types, but it’s not perfect. Let’s expose the truth.
Copy
df_transport.info()
This will show you the data types of each column. Numbers? Strings? Dates? Pandas will spill the beans. But remember, not everything is as it seems.
Step 3: Get the Stats:
For numerical columns, you need to know the lay of the land. Use .describe() to get the summary statistics.
df_transport.describe()
This will give you the mean, standard deviation, min, max, and more. Think of it as your Batcomputer analyzing the situation.
Step 4: Check for Missing Values:
Missing values are like the Joker—they wreak havoc wherever they go. You need to find them and deal with them.
df_transport.isnull().sum()
This will show you how many missing values are in each column. If you see a lot of NaNs, don’t panic. We’ll handle them.
Step 5: Resolve Missing Values:
Most algorithms hate missing values. But you’re not most people. You’re Batman. You don’t just drop rows like a coward. You fix them.
For numeric columns, fill missing values with the mean. For categorical columns, use the mode (the most frequent value).
df_transport = df_transport.apply(lambda x: x.fillna(x.value_counts().index[0]))
Boom. Missing values? Handled.
Step 6: Convert the Date Column:
Dates are tricky. They look like strings, but they’re not. You need to convert them to a datetime format.
df_transport['Date'] = pd.to_datetime(df_transport['Date'], format='%m/%d/%Y')
Now, let’s break it down into year, month, and day. Because why not?
df_transport['year'] = df_transport['Date'].dt.year
df_transport['month'] = df_transport['Date'].dt.month
df_transport['day'] = df_transport['Date'].dt.day
Your data just got a whole lot more powerful.
Step 7: Rename Columns and Clean Up:
Your column names are a mess. Some are uppercase, some are lowercase, and some have spaces. Let’s fix that.
df_transport.rename(columns={
'Date': 'date',
'Zip Code': 'zipcode',
'Model Year': 'modelyear',
'Fuel': 'fuel',
'Make': 'make',
'Light_Duty': 'lightduty',
'Vehicles': 'vehicles'
}, inplace=True)
And while we’re at it, let’s remove any rows where the model year is "<2006". Because who cares about ancient history?
df_transport = df_transport[df_transport['modelyear'] != '<2006']
Step 8: Handle Categorical Columns:
Categorical columns are like riddles. They need to be decoded. Let’s start with the lightduty column, which has "Yes/No" values. We’ll convert them to 1s and 0s.
df_transport['lightduty'] = df_transport['lightduty'].apply(lambda x: 0 if x == 'No' else 1)
Lambda functions are your best friend here. They’re quick, dirty, and get the job done.
Step 9: One-Hot Encoding:
Machine learning models don’t understand text. They need numbers. So, we’ll convert categorical columns into binary vectors using one-hot encoding.
data_dummy = pd.get_dummies(df_transport[['zipcode', 'modelyear', 'fuel', 'make']], drop_first=True)
This creates a new DataFrame where each category is a binary column. It’s like turning your data into a swarm of bats—each one representing a single category.
enjoy the blog boys.