habib's blog

Improving Data Quality

So, you’ve got a dataset. Cool. But let’s be real—your data is probably a hot mess. Missing values, mixed data types, and categorical columns that look like they were named by a drunk intern. Don’t worry. I’m here to clean up your mess, Gotham-style. Let’s dive into the shadows and make your data shine.

Step 1: Read Your Dataset into a Pandas DataFrame:

First things first. You’ve got a CSV file, and you need to load it into a Pandas DataFrame. Think of this as your Batcave—a place where all your tools and gadgets are stored.

import pandas as pd  
df_transport = pd.read_csv('your_dataset.csv')  

Now, your data is in a DataFrame. But don’t get too comfortable. This is just the beginning.

Step 2: Check the Data Types:

Your DataFrame might look clean, but it’s probably hiding secrets. Pandas tries to guess the data types, but it’s not perfect. Let’s expose the truth.

Copy
df_transport.info()  

This will show you the data types of each column. Numbers? Strings? Dates? Pandas will spill the beans. But remember, not everything is as it seems.

Step 3: Get the Stats:

For numerical columns, you need to know the lay of the land. Use .describe() to get the summary statistics.

df_transport.describe()  

This will give you the mean, standard deviation, min, max, and more. Think of it as your Batcomputer analyzing the situation.

Step 4: Check for Missing Values:

Missing values are like the Joker—they wreak havoc wherever they go. You need to find them and deal with them.

df_transport.isnull().sum()  

This will show you how many missing values are in each column. If you see a lot of NaNs, don’t panic. We’ll handle them.

Step 5: Resolve Missing Values:

Most algorithms hate missing values. But you’re not most people. You’re Batman. You don’t just drop rows like a coward. You fix them.

For numeric columns, fill missing values with the mean. For categorical columns, use the mode (the most frequent value).

df_transport = df_transport.apply(lambda x: x.fillna(x.value_counts().index[0]))  

Boom. Missing values? Handled.

Step 6: Convert the Date Column:

Dates are tricky. They look like strings, but they’re not. You need to convert them to a datetime format.

df_transport['Date'] = pd.to_datetime(df_transport['Date'], format='%m/%d/%Y')

Now, let’s break it down into year, month, and day. Because why not?

df_transport['year'] = df_transport['Date'].dt.year  
df_transport['month'] = df_transport['Date'].dt.month  
df_transport['day'] = df_transport['Date'].dt.day  

Your data just got a whole lot more powerful.

Step 7: Rename Columns and Clean Up:

Your column names are a mess. Some are uppercase, some are lowercase, and some have spaces. Let’s fix that.

df_transport.rename(columns={  
    'Date': 'date',  
    'Zip Code': 'zipcode',  
    'Model Year': 'modelyear',  
    'Fuel': 'fuel',  
    'Make': 'make',  
    'Light_Duty': 'lightduty',  
    'Vehicles': 'vehicles'  
}, inplace=True) 

And while we’re at it, let’s remove any rows where the model year is "<2006". Because who cares about ancient history?

df_transport = df_transport[df_transport['modelyear'] != '<2006']  

Step 8: Handle Categorical Columns:

Categorical columns are like riddles. They need to be decoded. Let’s start with the lightduty column, which has "Yes/No" values. We’ll convert them to 1s and 0s.

df_transport['lightduty'] = df_transport['lightduty'].apply(lambda x: 0 if x == 'No' else 1)  

Lambda functions are your best friend here. They’re quick, dirty, and get the job done.

Step 9: One-Hot Encoding:

Machine learning models don’t understand text. They need numbers. So, we’ll convert categorical columns into binary vectors using one-hot encoding.

data_dummy = pd.get_dummies(df_transport[['zipcode', 'modelyear', 'fuel', 'make']], drop_first=True)  

This creates a new DataFrame where each category is a binary column. It’s like turning your data into a swarm of bats—each one representing a single category.

enjoy the blog boys.