Eight Data Preprocessing Steps to Build Efficient Models

2024.10.24

Hello everyone! Today we will discuss how to improve the performance of machine learning models through data preprocessing. Data preprocessing is a very critical part of machine learning projects, which directly affects the training effect and prediction accuracy of the model. This article will introduce 8 important data preprocessing steps in detail, and use practical code examples to help you better understand and apply these methods.

1. Data loading and preliminary inspection

First, we need to load the data and do a preliminary check. This step is very important because understanding the basic situation of the data will help us in subsequent processing.

import pandas as pd

# 加载数据
data = pd.read_csv('data.csv')

# 查看前几行数据
print(data.head())

# 检查数据基本信息
print(data.info())
   Age  Salary  Purchased
0   19     70K         0
1   25     80K         0
2   26     55K         1
3   27     75K         1
4   30     85K         0

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Age        400 non-null    int64  
 1   Salary     400 non-null    object 
 2   Purchased  400 non-null    int64  
dtypes: int64(2), object(1)
memory usage: 9.6+ KB

explain:

  • The Age and Purchased columns have the correct data types.
  • The data type of the Salary column is object, which means that non-numeric data may exist.

2. Data cleaning

Data cleaning mainly includes operations such as deleting duplicate records and processing missing values. These operations can ensure the quality of data and thus improve the effect of the model.

# 删除重复记录
data.drop_duplicates(inplace=True)

# 处理缺失值
print(data.isnull().sum())  # 检查缺失值

# 如果有缺失值,可以使用均值填充
data['Age'].fillna(data['Age'].mean(), inplace=True)
Age            0
Salary         0
Purchased      0
dtype: int64

Explanation: In this example, the data has no missing values. If there are missing values, we can fill them using the mean or other methods.

3. Data type conversion

Sometimes, we need to convert the data type of some columns to numeric or categorical types. For example, convert the Salary column to numeric type.

# 将 Salary 转换成数值型
data['Salary'] = data['Salary'].str.replace('K', '').astype(float) * 1000

explain:

  • Use str.replace to remove the K character in Salary.
  • Use astype(float) to convert a string to a floating point number.
  • Multiply by 1000 to convert K into a specific value.

4. Data Standardization

Data normalization is a common preprocessing technique used to unify data of different ranges into the same range. This helps improve the speed and accuracy of model training.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
data[['Age', 'Salary']] = scaler.fit_transform(data[['Age', 'Salary']])

explain:

  • MinMaxScaler can scale the data to the range [0, 1].
  • Use the fit_transform method to standardize the Age and Salary columns.

5. Data Normalization

Data normalization can convert data into a form with zero mean and unit variance, which is particularly important for some algorithms (such as support vector machines).

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data[['Age', 'Salary']] = scaler.fit_transform(data[['Age', 'Salary']])

explain:

  • StandardScaler transforms the data to have zero mean and unit variance.
  • Use the fit_transform method to normalize the Age and Salary columns.

6. Feature Selection

Feature selection is to select the most relevant features from the original data to reduce the input dimension of the model and improve the performance of the model. Common feature selection methods include correlation-based selection and model-based selection.

# 导入相关库
import seaborn as sns
import matplotlib.pyplot as plt

# 计算特征之间的相关性
correlation_matrix = data.corr()

# 绘制热力图
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()

# 选择相关性高的特征

The heat map shows the correlation between the features:

          Age   Salary  Purchased
Age     1.0000  0.1000    -0.1000
Salary  0.1000  1.0000     0.5000
Purchased -0.1000  0.5000    1.0000

explain:

  • Age and Salary have low correlation.
  • Salary and Purchased are highly correlated.
  • We can select Age and Salary as the final features.

7. Category feature encoding

For categorical features (such as gender, region, etc.), we need to convert them into numerical types so that the model can process them. Common encoding methods include One-Hot Encoding and Label Encoding.

# 假设数据集中有一个分类特征 'Gender'
data['Gender'] = ['Male', 'Female', 'Male', 'Female', 'Male']

# 使用 Label Encoding
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
data['Gender'] = label_encoder.fit_transform(data['Gender'])

# 使用 One-Hot Encoding
from sklearn.preprocessing import OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse=False)
gender_encoded = one_hot_encoder.fit_transform(data[['Gender']])
data = pd.concat([data, pd.DataFrame(gender_encoded, columns=['Gender_Male', 'Gender_Female'])], axis=1)
data.drop('Gender', axis=1, inplace=True)

Output:

The encoded data:

   Age  Salary  Purchased  Gender_Male  Gender_Female
0  0.0    70.0         0            1              0
1  0.2    80.0         0            0              1
2  0.4    55.0         1            1              0
3  0.6    75.0         1            0              1
4  0.8    85.0         0            1              0

explain:

  • Label Encoding encodes Gender into a number, for example, Male is 0 and Female is 1.
  • One-Hot Encoding converts Gender into multiple binary features, such as Gender_Male and Gender_Female.

8. Dataset Partitioning

Dataset partitioning usually divides the data into training and testing sets, and sometimes also includes a validation set. This helps evaluate the generalization ability of the model.

from sklearn.model_selection import train_test_split

# 分割数据集
X = data[['Age', 'Salary']]
y = data['Purchased']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

explain:

  • X contains the feature columns Age and Salary.
  • y contains the target column Purchased.
  • Use train_test_split to split the data into training and test sets, where the test set accounts for 20% of the total data.

Summary

This article introduces 8 important data preprocessing steps in detail, including data loading and preliminary inspection, data cleaning, data type conversion, data standardization, data normalization, feature selection, category feature encoding, and data set partitioning. Through these steps, we can ensure the quality of data and thus improve the performance of machine learning models. I hope these contents can help you in actual projects.