SUP-4 Classification

  • SUP-4 Classification

     raymondbcng76 updated 3 months ago 2 Members · 2 Posts
  • jansoncy

    Member
    February 16, 2022 at 3:14 pm

    Hi to anyone who can help,

    I am on the section about Imbalanced data and I’ve encountered a valueError when trying to fit my training model using RandomOverSampler().


    I got the following ValueError:
    ValueError: could not convert string to float: ‘Female’

    And below is the script. Thanks for your help.

    # Basic Library Imports

    import numpy as np

    import pandas as pd

    import matplotlib.pyplot as plt

    import seaborn as sn

    %matplotlib inline

    from sklearn.neighbors import KNeighborsClassifier

    from sklearn.compose import ColumnTransformer

    from sklearn.preprocessing import StandardScaler

    from sklearn.preprocessing import OneHotEncoder

    from sklearn.pipeline import make_pipeline

    from sklearn.model_selection import train_test_split

    from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, classification_report, roc_curve, auc

    from imblearn.over_sampling import RandomOverSampler

    # TASK: Read in the CSV file saved from Exercise 1

    file_path = “telco_churn.csv” # Filename

    # The 1st column of the csv file should be the customer ID, which is loaded in the the Dataframe’s index

    input_data = pd.read_csv(file_path, index_col=0)

    # Validate that data is as expected

    input_data.head()

    # Size of data – TASK: Validate that it’s (7032, 20)

    input_data.shape

    # Outcome variable

    # Instead of keeping the values, we will encode as 1s and 0s using the map function

    output_var_name = ‘ChurnLabel’

    output_var = input_data[output_var_name]

    output_var = output_var.map({‘Yes’: 1, ‘No’: 0})

    # Note that the map function can be run only once. You will get an error if you try to run this cell again as Yes/No are no longer valid values in this feature.

    # Count the number of rows for each outcome value

    print(“Row count for each outcome”)

    print(output_var.value_counts(normalize=True))

    # Remove the outcome variable from the main dataframe

    input_data.drop(output_var_name, axis=1, inplace=True)

    # Next, we want to define 3 lists for each of the data types found in our data i.e. Numerical, Categorical (more than 2 values), Binary (2 values only)

    # Numerical features

    num_features = [key for key in dict(input_data.dtypes) if dict(input_data.dtypes)[key] in [‘int64’, ‘float64’]]

    print(num_features) # TASK: Confirm the columns based on Exercise 1

    # TASK: Define the 4 categorical features as a list of strings. These are the non-numerical features that do not have Yes/No values

    cat_features = [“Gender”, “InternetService”, “Contract”, “PaymentMethod”] # Categorical feature names

    # TASK: Define the binary features. Complete the steps denoted in this cell.

    # 1. Get the list of non-numerical features (both categorical and binary). Hint: Add ‘not’ to the code from num_features

    bin_features = [key for key in dict(input_data.dtypes) if dict(input_data.dtypes)[key] not in [‘int64’, ‘float64’]]

    # Copy then modify the code from num_features

    print( f”Binary features before remove: {bin_features}”)

    # 2. Remove the categorical feature names from this list

    for col in cat_features:

    # # Hint: There is a list method to remove an element

    bin_features.remove(col)

    print(f”List of binary features: {bin_features}”) # TASK: Confirm the resulting list

    # Encoding the binary features. Similar to the outcome variable, we will need to convert the values of these features from Yes/No to 1/0.

    # Note: As an alternative, this could have been done when building the pipeline.

    # TASK: Complete the code

    for col in bin_features:

    input_data[col] = input_data[col].map({‘Yes’: 1, ‘No’: 0, ‘No internet service’: 0})

    # Define preprocessing pipeline. Reminder that the binary features have already been encoded and thus only passed through

    # TASK: Match the list of features to the correct encoding operation.

    # Remember to add the library imports for ColumnTransformer, StandardScaler, OneHotEncoder to the imports above

    preprocess = ColumnTransformer(

    transformers=[

    (‘standardscaler’, StandardScaler(), num_features),

    (‘onehotencoder’, OneHotEncoder(), cat_features)

    ],

    remainder=’passthrough’

    )

    #Section replaced with imbalance

    preprocessed_data = preprocess.fit_transform(input_data)

    # Train/Test Split

    # TASK: Split the data into 70:30 train/test. Use the random_state=42

    x_train, x_test, y_train, y_test = train_test_split(input_data, output_var, test_size=0.3, random_state=42)

    print(“Before Oversampling, count of label ‘1’:{}”.format(sum(y_train==1)))

    print(“Before Oversampling, count of label ‘0’:{} \n”.format(sum(y_train==0)))

    over_sample=RandomOverSampler(random_state=0)

    x_train_res, y_train_res = over_sample.fit_resample(x_train,y_train.ravel())

    print(“After Oversampling, count of label ‘1’:{}”.format(sum(y_train_res==1)))

    print(“After Oversampling, count of label ‘0’:{}\n”.format(sum(y_train_res==0)))

    model = KNeighborsClassifier(n_neighbors=5)

    model.fit(x_train_res, y_train_res)

  • raymondbcng76

    Member
    February 17, 2022 at 1:26 pm

    Hi,

    Maybe your input to the train_test_split should be preprocessed_data instead of input_data?

    Since the input_data have not been preprocessed, the values for ‘Gender` column is still in string data type.

Viewing 1 - 2 of 2 replies