

AI4I
Public / LearnAI Course
Public / LearnAI Course
SUP-4 Classification
-
Hi to anyone who can help,
I am on the section about Imbalanced data and I’ve encountered a valueError when trying to fit my training model using RandomOverSampler().
I got the following ValueError:
ValueError: could not convert string to float: ‘Female’And below is the script. Thanks for your help.
# Basic Library Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, classification_report, roc_curve, auc
from imblearn.over_sampling import RandomOverSampler
# TASK: Read in the CSV file saved from Exercise 1
file_path = “telco_churn.csv” # Filename
# The 1st column of the csv file should be the customer ID, which is loaded in the the Dataframe’s index
input_data = pd.read_csv(file_path, index_col=0)
# Validate that data is as expected
input_data.head()
# Size of data – TASK: Validate that it’s (7032, 20)
input_data.shape
# Outcome variable
# Instead of keeping the values, we will encode as 1s and 0s using the map function
output_var_name = ‘ChurnLabel’
output_var = input_data[output_var_name]
output_var = output_var.map({‘Yes’: 1, ‘No’: 0})
# Note that the map function can be run only once. You will get an error if you try to run this cell again as Yes/No are no longer valid values in this feature.
# Count the number of rows for each outcome value
print(“Row count for each outcome”)
print(output_var.value_counts(normalize=True))
# Remove the outcome variable from the main dataframe
input_data.drop(output_var_name, axis=1, inplace=True)
# Next, we want to define 3 lists for each of the data types found in our data i.e. Numerical, Categorical (more than 2 values), Binary (2 values only)
# Numerical features
num_features = [key for key in dict(input_data.dtypes) if dict(input_data.dtypes)[key] in [‘int64’, ‘float64’]]
print(num_features) # TASK: Confirm the columns based on Exercise 1
# TASK: Define the 4 categorical features as a list of strings. These are the non-numerical features that do not have Yes/No values
cat_features = [“Gender”, “InternetService”, “Contract”, “PaymentMethod”] # Categorical feature names
# TASK: Define the binary features. Complete the steps denoted in this cell.
# 1. Get the list of non-numerical features (both categorical and binary). Hint: Add ‘not’ to the code from num_features
bin_features = [key for key in dict(input_data.dtypes) if dict(input_data.dtypes)[key] not in [‘int64’, ‘float64’]]
# Copy then modify the code from num_features
print( f”Binary features before remove: {bin_features}”)
# 2. Remove the categorical feature names from this list
for col in cat_features:
# # Hint: There is a list method to remove an element
bin_features.remove(col)
print(f”List of binary features: {bin_features}”) # TASK: Confirm the resulting list
# Encoding the binary features. Similar to the outcome variable, we will need to convert the values of these features from Yes/No to 1/0.
# Note: As an alternative, this could have been done when building the pipeline.
# TASK: Complete the code
for col in bin_features:
input_data[col] = input_data[col].map({‘Yes’: 1, ‘No’: 0, ‘No internet service’: 0})
# Define preprocessing pipeline. Reminder that the binary features have already been encoded and thus only passed through
# TASK: Match the list of features to the correct encoding operation.
# Remember to add the library imports for ColumnTransformer, StandardScaler, OneHotEncoder to the imports above
preprocess = ColumnTransformer(
transformers=[
(‘standardscaler’, StandardScaler(), num_features),
(‘onehotencoder’, OneHotEncoder(), cat_features)
],
remainder=’passthrough’
)
#Section replaced with imbalance
preprocessed_data = preprocess.fit_transform(input_data)
# Train/Test Split
# TASK: Split the data into 70:30 train/test. Use the random_state=42
x_train, x_test, y_train, y_test = train_test_split(input_data, output_var, test_size=0.3, random_state=42)
print(“Before Oversampling, count of label ‘1’:{}”.format(sum(y_train==1)))
print(“Before Oversampling, count of label ‘0’:{} \n”.format(sum(y_train==0)))
over_sample=RandomOverSampler(random_state=0)
x_train_res, y_train_res = over_sample.fit_resample(x_train,y_train.ravel())
print(“After Oversampling, count of label ‘1’:{}”.format(sum(y_train_res==1)))
print(“After Oversampling, count of label ‘0’:{}\n”.format(sum(y_train_res==0)))
model = KNeighborsClassifier(n_neighbors=5)
model.fit(x_train_res, y_train_res)
-
Hi,
Maybe your input to the
train_test_split
should bepreprocessed_data
instead ofinput_data
?Since the
input_data
have not been preprocessed, the values for ‘Gender` column is still in string data type.