Client Segments Data Simulation & Cost Sensitive Multi-classification

this notebook is loosely based on a given financial customer segmentation by BAI. https://www.latinia.com/IF/Documentos/The_New_Banking_Consumer_5_Core_Segments.pdf

Marketing research produces these kinds of segmentation through researching their market and collecting and mining a lot of data which is then used in their segmentation analysis, with clustering analysis, expert knowledge and loads of different quantitative and qualitative variables (predictors). This is what we cannot emulate here and instead accept as given.

The general goal of market segmentation is to find groups of customers that differ in important ways related to Product interest, market participation, or response to marketing efforts. By understanding the differences between groups, a marketer is enabled to take better strategic decisions about opportunities, product specification, the positioning, and engage in more effective advertising.

Finance segmentation

Banks and many other types of financial institutions classify their clients and try to perceive their behavioural structure which includes if they will pay their debts at all. If the bank has an improved understanding of the differences and nuances between these classes, then the bank can rethink of their marketing strategy. Resulting in more targeted, cost-effective approach and greater investment return.

Here is an example of banking customer segmentation by BAI Research Group.

Customer base is divided into 5 segments:

Marginalised Middles

It is the largest group containing 36% of the customers. This group is,

Corresponding marketing strategy: Providing steady and clear marketing messages such as fee disclosure

Disengaged Skeptics

It is the second oldest group of these 5. This group

Corresponding marketing strategy: The door is open for other institutions to entice these unsatisfied group, be that other institution for this group.

Satisfied Traditionalists

It is the oldest group with above average income. This segment:

Corresponding marketing strategy: As this is a tempting segment with high deposit revenue and highest investment balance, offer them a wide product base.

Struggling Techies

It is the youngest group with lowest income. Simply the youth and the youth has the features such as:

Corresponding marketing strategy: The receptivity of this segment makes them a worthwhile target. Easiest segment to gain new customers.

Sophisticated Opportunists

It has the average age and the highest income. This segment is

Corresponding marketing strategy: Treat this segment well, provide them with right tools and innovate products to allow them manage their own finance.

Source: https://www.segmentify.com/blog/customer-consumer-user-segmentation-examples#Marginalised_Middles

Client Segments Data Simulation

In this notebook we opt to interpret the given client segmentation model and generate our own data based on our assumptions and a modest variety of features and limited knowledge about the differences of these segments; it is not supposed to be a full fletched broad data generation effort. For doing the actual segmentation as a step in marketing research there will be more variables needed in order to delimit each segment better and still there will be always clients fitting only marginally better to one cluster than another. With this exercise we simulate the whole process with our own generated data and see if the code works before any real potential private data is exposed to exploratory analysis or machine learning further down the pipeline.

We can manipulate the synthetic data, re-run analyses, and examine how the results change. This data simulation highlights a strength of Python: it is easy to simulate data with various Python functions. And we will see that our limited dataset will be reasonably well suited to predict unsegmented clients to one of these given segments.

Let’s get started!

import numpy as np
import pandas as pd

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 70)

Client Segment Data Definitions

segment_variables = [
                     'age', 
                     'gender',
                     'annual_income', 
                     'willingness_to_leave', 
                     'annual_branch_visits', 
                     'service_satisfaction_level', # scale 1-5
                     'use_debit_services', 
                     'use_online_mobile', 
                     'deposit_revenue',

                     #'knowledgeable'
                     
                    ]
segment_variables_distribution = dict(zip(segment_variables,
                                          [
                                            'normal',
                                           'binomial',
                                           'normal', 
                                           'binomial',
                                           'poisson',
                                           'custom_discrete',
                                           'binomial',
                                           'binomial',
                                           #'binomial',
                                           'normal',

                                           #'binomial'
                                          ]
                                           ))

segment_variables_distribution['age']
'normal'
segment_variables_distribution
{'age': 'normal',
 'gender': 'binomial',
 'annual_income': 'normal',
 'willingness_to_leave': 'binomial',
 'annual_branch_visits': 'poisson',
 'service_satisfaction_level': 'custom_discrete',
 'use_debit_services': 'binomial',
 'use_online_mobile': 'binomial',
 'deposit_revenue': 'normal'}
segment_means = {'marginalised_middles':        [43, 0.45, 55000, 0.2, 6, None, 0.8, 0.7, 1000],
                 'disengaged_skeptics':         [54, 0.6, 38000, 0.5, 14, None, 0.2, 0.2, 340],
                 'satisfied_traditionalists':   [64, 0.55, 59000, 0.05, 30, None, 0.9, 0.15, 1200],
                 'struggling_techies':          [25, 0.45, 25000, 0.4, 2, None, 0.7, 0.9, 200],
                 'sophisticated_opportunists':  [45.5, 0.4, 70000, 0.2, 11, None, 0.9, 0.7, 2400]}
# standard deviations for each segment where applicable

segment_stddev = {'marginalised_middles':        [7, None, 10000, None, None, None, None, None,300],
                  'disengaged_skeptics':         [12, None, 7000, None, None, None, None, None, 120],
                  'satisfied_traditionalists':   [8, None, 11000, None, None, None, None, None,400],
                  'struggling_techies':          [4, None, 6000, None, None, None, None, None,80],
                  'sophisticated_opportunists':  [8, None, 16000, None, None, None, None, None,1000]}

# discrete variable probabilities
segment_pdist  = {'marginalised_middles':        [None, None, None, None, None,(0.05, 0.20, 0.35, 0.25, 0.15),None, None, None],
                  'disengaged_skeptics':         [None, None, None, None, None,(0.25, 0.25, 0.30, 0.15, 0.05),None, None, None],
                  'satisfied_traditionalists':   [None, None, None, None, None,(0.05, 0.10, 0.25, 0.30, 0.30),None, None, None],
                  'struggling_techies':          [None, None, None, None, None,(0.15, 0.25, 0.35, 0.20, 0.05),None, None, None],
                  'sophisticated_opportunists':  [None, None, None, None, None,(0.05, 0.15, 0.30, 0.25, 0.25),None, None, None]
                 }
segment_names = ['marginalised_middles', 'disengaged_skeptics', 'satisfied_traditionalists', 'struggling_techies',
                 'sophisticated_opportunists']
segment_sizes = dict(zip(segment_names,[180, 60, 160, 60, 40]))

segment_statistics = {}
for name in segment_names:
  segment_statistics[name] = {'size': segment_sizes[name]}
  for i, variable in enumerate(segment_variables):
    segment_statistics[name][variable] = {
        'mean': segment_means[name][i],
        'stddev': segment_stddev[name][i],
        'pdist': segment_pdist[name][i]
    }

Final segment data generation


#np.random.seed(seed=0)
from numpy.random import default_rng
from numpy.random import Generator, PCG64
from scipy import stats
rng = default_rng(0)
#rg = Generator(PCG64())
segment_constructor = {}

# Iterate over segments to create data for each
for name in segment_names:
  segment_data_subset = {}
  print('segment: {0}'.format(name))
  # Within each segment, iterate over the variables and generate data
  for variable in segment_variables:
    #print('\tvariable: {0}'.format(variable))
    if segment_variables_distribution[variable] == 'normal':
     
      #print(segment_statistics[name][variable]['mean'], segment_statistics[name][variable]['stddev'],segment_statistics[name]['size'] )
        # Draw random normals
      segment_data_subset[variable] = np.maximum(0, np.round(rng.normal(
          
          loc=segment_statistics[name][variable]['mean'],
          scale=segment_statistics[name][variable]['stddev'],
          size=segment_statistics[name]['size']
          
      ),2))
    elif segment_variables_distribution[variable] == 'poisson':
      # Draw counts
      segment_data_subset[variable] = rng.poisson(
          lam=segment_statistics[name][variable]['mean'],
          size=segment_statistics[name]['size']
      )
    elif segment_variables_distribution[variable] == 'binomial':
      # Draw binomials
      segment_data_subset[variable] = rng.binomial(
          n=1,
          p=segment_statistics[name][variable]['mean'],
          size=segment_statistics[name]['size']
      )
    elif segment_variables_distribution[variable] == 'custom_discrete':
      # Draw custom discrete ordinal variable
      xk = 1 + np.arange(5) # ordinal scale 1-5
      pk = segment_statistics[name][variable]['pdist']
      custm = stats.rv_discrete(name='custm', values=(xk, pk))
      segment_data_subset[variable] = custm.rvs(
          size=segment_statistics[name]['size']
      )
    else:
      # Data type unknown
      print('Bad segment data type: {0}'.format(
          segment_variables_distribution[j])
           )
      raise StopIteration
  segment_data_subset['Segment'] = np.repeat(
      name,
      repeats=segment_statistics[name]['size']
  )
  #print(segment_data_subset)
    
  segment_constructor[name] = pd.DataFrame(segment_data_subset)

segment_data = pd.concat(segment_constructor.values())
#segment_data.reindex(range(0,500,1))
segment_data.index = range(0,len(segment_data),1)
segment: marginalised_middles
segment: disengaged_skeptics
segment: satisfied_traditionalists
segment: struggling_techies
segment: sophisticated_opportunists
segment_data.index
RangeIndex(start=0, stop=500, step=1)
np.random.seed(seed=2554)
name = 'sophisticated_opportunists'
variable = 'age'
print(segment_statistics[name][variable]['mean'])
print(segment_statistics[name][variable]['stddev'])
np.random.normal(
    loc=segment_statistics[name][variable]['mean'],
    scale=segment_statistics[name][variable]['stddev'],
    size=10
)
45.5
8





array([51.99132515, 36.35474139, 53.75517472, 47.15209108, 48.3509104 ,
       47.01605492, 35.91508828, 67.32429057, 45.67072279, 36.21226966])
np.random.seed(seed=1)
variable = 'annual_branch_visits'
print(segment_statistics[name][variable]['mean'])
print(segment_statistics[name][variable]['stddev'])
np.random.poisson(
    lam=segment_statistics[name][variable]['mean'],
    size=10
)
11
None





array([10,  7,  7, 10, 10,  8, 10,  7,  9,  5])
np.random.seed(seed=2554)
variable = 'use_online_mobile'
print(segment_statistics[name][variable]['mean'])
print(segment_statistics[name][variable]['stddev'])
np.random.binomial(
    n=1,
    p=segment_statistics[name][variable]['mean'],
    size=10
)
0.7
None





array([0, 0, 1, 1, 1, 0, 0, 0, 1, 1])
np.repeat(name, repeats=10)
array(['sophisticated_opportunists', 'sophisticated_opportunists',
       'sophisticated_opportunists', 'sophisticated_opportunists',
       'sophisticated_opportunists', 'sophisticated_opportunists',
       'sophisticated_opportunists', 'sophisticated_opportunists',
       'sophisticated_opportunists', 'sophisticated_opportunists'],
      dtype='<U26')
# codings

segment_data['gender'] = segment_data['gender'].apply(
    lambda x: 'male' if x else 'female'
)
segment_data['use_online_mobile'] = segment_data['use_online_mobile'].apply(
    lambda x: True if x else False
)
segment_data['use_debit_services'] = segment_data['use_debit_services'].apply(
    lambda x: True if x else False
)
segment_data['willingness_to_leave'] = segment_data['willingness_to_leave'].apply(
    lambda x: True if x else False
)

# number adjustments
segment_data['service_satisfaction_level'] = segment_data['service_satisfaction_level'].apply(
    lambda x: np.minimum(x, 5)
)

segment_data.describe(include='all').T
count unique top freq mean std min 25% 50% 75% max
age 500 NaN NaN NaN 49.1268 15.0625 14.31 38.9975 47.865 61.175 84.47
gender 500 2 male 258 NaN NaN NaN NaN NaN NaN NaN
annual_income 500 NaN NaN NaN 52407.5 16297.6 10933 42061.9 53450.3 63444.1 101433
willingness_to_leave 500 2 False 403 NaN NaN NaN NaN NaN NaN NaN
annual_branch_visits 500 NaN NaN NaN 14.568 11.8464 0 5 10 26 46
service_satisfaction_level 500 NaN NaN NaN 3.266 1.10711 1 3 3 4 5
use_debit_services 500 2 True 390 NaN NaN NaN NaN NaN NaN NaN
use_online_mobile 500 2 True 262 NaN NaN NaN NaN NaN NaN NaN
deposit_revenue 500 NaN NaN NaN 1008.83 710.295 0 471.582 939.565 1310.11 5545.03
Segment 500 5 marginalised_middles 180 NaN NaN NaN NaN NaN NaN NaN

Fortunately we can skip over the data cleaning part as we have generated a complete data set and almost ideal variables.

We may adjust a few columns where required.

Exploratory data analysis

the data we have just generated we have to understand, both predictors and Segments

age gender annual_income willingness_to_leave annual_branch_visits service_satisfaction_level use_debit_services use_online_mobile deposit_revenue Segment
0 43.88 male 54502.44 True 9 2 True True 518.14 marginalised_middles
1 42.08 male 51325.97 False 5 3 True False 1232.66 marginalised_middles
2 47.48 female 57187.86 False 9 5 True True 841.21 marginalised_middles
3 43.73 female 63448.89 False 9 4 True True 724.39 marginalised_middles
4 39.25 male 64933.36 False 6 3 True False 668.41 marginalised_middles
# overview variable distributions by segment

segment_data.groupby('Segment').describe().T
Segment disengaged_skeptics marginalised_middles satisfied_traditionalists sophisticated_opportunists struggling_techies
age count 60.000000 180.000000 160.000000 40.000000 60.000000
mean 51.970500 43.218778 64.607438 45.684000 25.020667
std 11.292721 6.784148 9.112192 8.194710 4.466628
min 26.240000 26.210000 41.360000 31.110000 14.310000
25% 44.340000 38.532500 59.020000 39.285000 22.070000
50% 52.685000 43.555000 65.175000 44.930000 25.465000
75% 58.415000 47.922500 70.787500 51.187500 28.087500
max 75.350000 57.020000 84.470000 62.450000 35.590000
annual_income count 60.000000 180.000000 160.000000 40.000000 60.000000
mean 39231.491833 55453.384333 59366.816563 71282.635500 25304.253667
std 7327.794135 10107.554102 11499.046189 15910.780974 5859.651345
min 23278.540000 16005.780000 36366.400000 34845.970000 10933.040000
25% 34189.782500 48449.300000 51417.595000 59929.825000 22691.042500
50% 39342.710000 54961.765000 59105.985000 69765.445000 24905.780000
75% 44564.330000 62889.307500 66734.842500 81407.995000 29252.742500
max 55407.350000 76188.030000 88451.780000 101432.840000 36739.460000
annual_branch_visits count 60.000000 180.000000 160.000000 40.000000 60.000000
mean 14.600000 5.766667 30.100000 11.075000 1.850000
std 3.340557 2.521638 5.586597 3.173508 1.549467
min 6.000000 1.000000 14.000000 5.000000 0.000000
25% 12.750000 4.000000 27.000000 8.750000 1.000000
50% 14.000000 6.000000 30.000000 12.000000 2.000000
75% 17.000000 7.000000 34.000000 13.000000 2.000000
max 25.000000 14.000000 46.000000 17.000000 8.000000
service_satisfaction_level count 60.000000 180.000000 160.000000 40.000000 60.000000
mean 2.533333 3.188889 3.650000 3.500000 3.050000
std 0.999435 1.066446 1.071060 0.847319 1.141290
min 1.000000 1.000000 1.000000 2.000000 1.000000
25% 2.000000 2.000000 3.000000 3.000000 2.000000
50% 2.500000 3.000000 4.000000 4.000000 3.000000
75% 3.000000 4.000000 4.000000 4.000000 4.000000
max 5.000000 5.000000 5.000000 5.000000 5.000000
deposit_revenue count 60.000000 180.000000 160.000000 40.000000 60.000000
mean 351.461667 981.358500 1202.073187 2546.885250 207.887000
std 117.945737 285.727065 395.677513 1084.883418 79.367803
min 100.540000 203.520000 158.100000 0.000000 36.660000
25% 269.722500 794.452500 925.917500 2089.477500 155.627500
50% 357.580000 957.185000 1178.785000 2513.155000 226.915000
75% 427.220000 1190.232500 1501.735000 2949.432500 255.882500
max 661.050000 1713.780000 2405.490000 5545.030000 362.050000
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("dark")
plt.figure(figsize=(10,7))

sns.boxplot(y='Segment', x='annual_branch_visits', hue='use_debit_services', data=segment_data, 
             orient='h')
<AxesSubplot:xlabel='annual_branch_visits', ylabel='Segment'>

png

plt.figure(figsize=(10,7))
sns.boxplot(y='Segment', x='annual_branch_visits', hue='use_online_mobile', data=segment_data, 
             orient='h')
<AxesSubplot:xlabel='annual_branch_visits', ylabel='Segment'>

png

sns.catplot(y='Segment', x='deposit_revenue', hue='use_online_mobile', col="gender",
                data=segment_data, kind="box", height=9, 
                aspect=.7);

png

plt.figure(figsize=(8,6))
sns.scatterplot(data=segment_data, x="age", y="annual_income", hue=segment_data.Segment)
<AxesSubplot:xlabel='age', ylabel='annual_income'>

png

plt.figure(figsize=(8,6))
sns.scatterplot(data=segment_data, x="annual_branch_visits", y="deposit_revenue", hue=segment_data.Segment)
<AxesSubplot:xlabel='annual_branch_visits', ylabel='deposit_revenue'>

png

plt.figure(figsize=(8,6))
sns.scatterplot(data=segment_data, x="annual_branch_visits", y="annual_income", hue=segment_data.Segment)
<AxesSubplot:xlabel='annual_branch_visits', ylabel='annual_income'>

png

plt.figure(figsize=(8,6))
sns.scatterplot(data=segment_data, x=segment_data.service_satisfaction_level + np.random.normal(scale=0.08,
size=sum(segment_sizes.values())) , y="age", hue=segment_data.Segment)
plt.legend(bbox_to_anchor=(1,1), loc="upper left")

<matplotlib.legend.Legend at 0x18208f96cc0>

png

# create the main dataframe
#from sklearn import preprocessing
df = segment_data.copy()
df.describe(include='all').T
count unique top freq mean std min 25% 50% 75% max
age 500 NaN NaN NaN 49.1268 15.0625 14.31 38.9975 47.865 61.175 84.47
gender 500 2 male 258 NaN NaN NaN NaN NaN NaN NaN
annual_income 500 NaN NaN NaN 52407.5 16297.6 10933 42061.9 53450.3 63444.1 101433
willingness_to_leave 500 2 False 403 NaN NaN NaN NaN NaN NaN NaN
annual_branch_visits 500 NaN NaN NaN 14.568 11.8464 0 5 10 26 46
service_satisfaction_level 500 NaN NaN NaN 3.242 1.1654 1 3 3 4 5
use_debit_services 500 2 True 390 NaN NaN NaN NaN NaN NaN NaN
use_online_mobile 500 2 True 262 NaN NaN NaN NaN NaN NaN NaN
deposit_revenue 500 NaN NaN NaN 1008.83 710.295 0 471.582 939.565 1310.11 5545.03
Segment 500 5 marginalised_middles 180 NaN NaN NaN NaN NaN NaN NaN
df.groupby('Segment').describe(include='all')
age gender annual_income willingness_to_leave annual_branch_visits service_satisfaction_level use_debit_services use_online_mobile deposit_revenue
count unique top freq mean std min 25% 50% 75% max count unique top freq mean std min 25% 50% 75% max count unique top freq mean std min 25% 50% 75% max count unique top freq mean std min 25% 50% 75% max count unique top freq mean std min 25% 50% 75% max count unique top freq mean std min 25% 50% 75% max count unique top freq mean std min 25% 50% 75% max count unique top freq mean std min 25% 50% 75% max count unique top freq mean std min 25% 50% 75% max
Segment
disengaged_skeptics 60.0 NaN NaN NaN 51.970500 11.292721 26.24 44.3400 52.685 58.4150 75.35 60 2 male 39 NaN NaN NaN NaN NaN NaN NaN 60.0 NaN NaN NaN 39231.491833 7327.794135 23278.54 34189.7825 39342.710 44564.3300 55407.35 60 2 False 35 NaN NaN NaN NaN NaN NaN NaN 60.0 NaN NaN NaN 14.600000 3.340557 6.0 12.75 14.0 17.0 25.0 60.0 NaN NaN NaN 2.400000 1.092346 1.0 1.75 2.0 3.0 5.0 60 2 False 44 NaN NaN NaN NaN NaN NaN NaN 60 2 False 52 NaN NaN NaN NaN NaN NaN NaN 60.0 NaN NaN NaN 351.461667 117.945737 100.54 269.7225 357.580 427.2200 661.05
marginalised_middles 180.0 NaN NaN NaN 43.218778 6.784148 26.21 38.5325 43.555 47.9225 57.02 180 2 male 93 NaN NaN NaN NaN NaN NaN NaN 180.0 NaN NaN NaN 55453.384333 10107.554102 16005.78 48449.3000 54961.765 62889.3075 76188.03 180 2 False 149 NaN NaN NaN NaN NaN NaN NaN 180.0 NaN NaN NaN 5.766667 2.521638 1.0 4.00 6.0 7.0 14.0 180.0 NaN NaN NaN 3.222222 1.059965 1.0 3.00 3.0 4.0 5.0 180 2 True 153 NaN NaN NaN NaN NaN NaN NaN 180 2 True 133 NaN NaN NaN NaN NaN NaN NaN 180.0 NaN NaN NaN 981.358500 285.727065 203.52 794.4525 957.185 1190.2325 1713.78
satisfied_traditionalists 160.0 NaN NaN NaN 64.607438 9.112192 41.36 59.0200 65.175 70.7875 84.47 160 2 male 87 NaN NaN NaN NaN NaN NaN NaN 160.0 NaN NaN NaN 59366.816563 11499.046189 36366.40 51417.5950 59105.985 66734.8425 88451.78 160 2 False 155 NaN NaN NaN NaN NaN NaN NaN 160.0 NaN NaN NaN 30.100000 5.586597 14.0 27.00 30.0 34.0 46.0 160.0 NaN NaN NaN 3.625000 1.174948 1.0 3.00 4.0 5.0 5.0 160 2 True 145 NaN NaN NaN NaN NaN NaN NaN 160 2 False 126 NaN NaN NaN NaN NaN NaN NaN 160.0 NaN NaN NaN 1202.073187 395.677513 158.10 925.9175 1178.785 1501.7350 2405.49
sophisticated_opportunists 40.0 NaN NaN NaN 45.684000 8.194710 31.11 39.2850 44.930 51.1875 62.45 40 2 female 21 NaN NaN NaN NaN NaN NaN NaN 40.0 NaN NaN NaN 71282.635500 15910.780974 34845.97 59929.8250 69765.445 81407.9950 101432.84 40 2 False 32 NaN NaN NaN NaN NaN NaN NaN 40.0 NaN NaN NaN 11.075000 3.173508 5.0 8.75 12.0 13.0 17.0 40.0 NaN NaN NaN 3.750000 1.126601 1.0 3.00 4.0 5.0 5.0 40 2 True 35 NaN NaN NaN NaN NaN NaN NaN 40 2 True 31 NaN NaN NaN NaN NaN NaN NaN 40.0 NaN NaN NaN 2546.885250 1084.883418 0.00 2089.4775 2513.155 2949.4325 5545.03
struggling_techies 60.0 NaN NaN NaN 25.020667 4.466628 14.31 22.0700 25.465 28.0875 35.59 60 2 female 40 NaN NaN NaN NaN NaN NaN NaN 60.0 NaN NaN NaN 25304.253667 5859.651345 10933.04 22691.0425 24905.780 29252.7425 36739.46 60 2 False 32 NaN NaN NaN NaN NaN NaN NaN 60.0 NaN NaN NaN 1.850000 1.549467 0.0 1.00 2.0 2.0 8.0 60.0 NaN NaN NaN 2.783333 0.903696 1.0 2.00 3.0 3.0 5.0 60 2 True 41 NaN NaN NaN NaN NaN NaN NaN 60 2 True 56 NaN NaN NaN NaN NaN NaN NaN 60.0 NaN NaN NaN 207.887000 79.367803 36.66 155.6275 226.915 255.8825 362.05
print(segment_data.dtypes)
df.dtypes.value_counts()
age                           float64
gender                         object
annual_income                 float64
willingness_to_leave             bool
annual_branch_visits            int64
service_satisfaction_level      int64
use_debit_services               bool
use_online_mobile                bool
deposit_revenue               float64
Segment                        object
dtype: object





float64    3
bool       3
int64      2
object     2
dtype: int64

this dataset consists of mixed variable types, numeric such as float64 and int as well as categorical object type and boolean data.

# frequencies plotted in Histogramms
import warnings
with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    
f, axes = plt.subplots(3,4, figsize=(20, 20))
for ax, feature in zip(axes.flat, df.columns):
    sns.histplot(df[feature], ax=ax)
<string>:6: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.
<string>:6: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.
<string>:6: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.
<string>:6: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.
<string>:6: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.
<string>:6: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.

png

The standard in any explorative data analysis is the consultation of the correlations of the various features, due to the limitations to catch only linear relationships we do not spend too much time on this and rather turn to the Predictive Power Score.

# column pairwise correlations
plt.figure(figsize=(8,8))
sns.heatmap(df.corr('pearson'),annot = True, fmt='.2g', square=True)
<AxesSubplot:>

png

Predictive Power Score

Let’s use the Predictive Power Score (PPS) on this occasion to quickly unfold relations not only between predictors but also the Segment labels. The PPS is also directional, hence asymmetric, also can handle any data-type and can detect linear or non-linear relationships between two columns.

From the first row we can see at first sight the most influential predictors for the Segment as annual_branch_visits, age and deposit_revenue.

### ppscore package ###
import ppscore as pps
def heatmap(df):
    plt.rcParams['figure.figsize'] = (11,7)
    df = df[['x', 'y', 'ppscore']].pivot(columns='x', index='y', values='ppscore')
    ax = sns.heatmap(df, vmin=0, vmax=1, linewidths=0.5, annot=True)
    ax.set_title("PPS matrix")
    ax.set_xlabel("feature")
    ax.set_ylabel("target")
    return ax
def corr_heatmap(df):
    ax = sns.heatmap(df, vmin=-1, vmax=1, linewidths=0.5, annot=True)
    ax.set_title("Correlation matrix")
    return ax
### ppscore package ###

ppsmatrix = pps.matrix(df)
heatmap(ppsmatrix)
<AxesSubplot:title={'center':'PPS matrix'}, xlabel='feature', ylabel='target'>

png

The PPS matrix is not symmetric like Pearsons correlations but can be read from feature to target or the other way round. So for instance we can learn from the matrix that we from the annual_branch_visits you can predict likelihood of online/mobile use with 0.54 quite moderately but much less so the reverse with only 0.15.

import seaborn as sns
sns.set(style="ticks", color_codes=True)
sns.set_style("dark")

plt.figure(figsize=(15,15))
customgrid = sns.pairplot(df, hue="Segment")
customgrid.map_upper(plt.scatter, linewidths=1, s=5, alpha=0.5)
plt.show()
<Figure size 1080x1080 with 0 Axes>

png

res = sns.relplot(data=df, x='annual_income', y='age', hue="Segment", alpha=0.6, col="use_online_mobile")

png

We could use Seaborn relational plots to observe multiple independent variables per Segment.

However you still could not apply all independent variables all at the same time, therefore we turn towards Radar Plots / Spider plots which are great for complete profiling as long as the number of variables are manageable or can be reduced to a handful of essentiell ones.

# https://gist.github.com/kylerbrown/29ce940165b22b8f25f4

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns # improves plot aesthetics
from textwrap import wrap
import matplotlib.patches as mpatches

def _invert(x, limits):
    """inverts a value x on a scale from
    limits[0] to limits[1]"""
    return limits[1] - (x - limits[0])

def _scale_data(data, ranges):
    """scales data[1:] to ranges[0],
    inverts if the scale is reversed"""
    for d, (y1, y2) in zip(data[1:], ranges[1:]):
        #print( d, (y1, y2))
        assert (y1 <= d <= y2) or (y2 <= d <= y1)
    x1, x2 = ranges[0]
    d = data[0]
    if x1 > x2:
        d = _invert(d, (x1, x2))
        x1, x2 = x2, x1
    sdata = [d]
    for d, (y1, y2) in zip(data[1:], ranges[1:]):
        if y1 > y2:
            d = _invert(d, (y1, y2))
            y1, y2 = y2, y1
        sdata.append((d-y1) / (y2-y1) 
                     * (x2 - x1) + x1)
    return sdata

class ComplexRadar():
    def __init__(self, fig, variables, ranges,
                 n_ordinate_levels=6):
        angles = np.arange(0, 360, 360./len(variables))

        axes = [fig.add_axes([0.1,0.1,0.9,0.9],polar=True,
                label = "axes{}".format(i)) 
                for i in range(len(variables))]
        l, text = axes[0].set_thetagrids(angles, 
                                         labels=variables)
        
        [txt.set_rotation(angle-90) for txt, angle 
             in zip(text, angles)]
        
        for ax in axes[1:]:
            ax.patch.set_visible(False)
            ax.grid("off")
            ax.xaxis.set_visible(False)
        for i, ax in enumerate(axes):
            grid = np.linspace(*ranges[i], 
                               num=n_ordinate_levels)
            gridlabel = ["{}".format(round(x,1)) 
                         for x in grid]
            if ranges[i][0] > ranges[i][1]:
                grid = grid[::-1] # hack to invert grid
                          # gridlabels aren't reversed
            gridlabel[0] = "" # clean up origin
            ax.set_rgrids(grid, labels=gridlabel,
                         angle=angles[i])
            #ax.spines["polar"].set_visible(False)
            ax.set_ylim(*ranges[i])
        # variables for plotting
        self.angle = np.deg2rad(np.r_[angles, angles[0]])
        self.ranges = ranges
        self.ax = axes[0]
    def plot(self, data, *args, **kw):
        sdata = _scale_data(data, self.ranges)
        r = self.ax.plot(self.angle, np.r_[sdata, sdata[0]], *args, **kw)
        return r
    def fill(self, data, *args, **kw):
        sdata = _scale_data(data, self.ranges)
        self.ax.fill(self.angle, np.r_[sdata, sdata[0]], *args, **kw)
    def legend(self, *args, **kw):
        self.ax.legend(self, *args, **kw)

# prepare data to be plotted in the segment profiles        
criteria = ('age', 'gender', 'annual_income', 'willingness_to_leave',
       'annual_branch_visits', 'service_satisfaction_level',
       'use_debit_services', 
            'use_online_mobile', 'deposit_revenue')

segmeans = df.copy()
segmeans['gender'] = segmeans['gender'].apply(lambda x: 1 if x == 'male' else 0)

segmeans['willingness_to_leave'] = segmeans['willingness_to_leave'].apply(lambda x: 1 if x == True else 0)
segmeans['use_debit_services'] = segmeans['use_debit_services'].apply(lambda x: 1 if x == True else 0)
segmeans['use_online_mobile'] = segmeans['use_online_mobile'].apply(lambda x: 1 if x == True else 0)

segmeans = segmeans.groupby('Segment').mean()

segs = df.Segment.unique()

#test = segdata.get('marginalised_middles')

data=[]
segdata = dict()

for seg in segs:
    #for crit in criteria:
    data.append(segmeans.loc[seg])
    
    segdata.update({seg: data})
    data = []

ranges = []
for col in segmeans.columns:
    #print(df[col].dtype)
    distance = 0.2 # increase in case of assert errors
    if segmeans[col].dtype != 'object' and segmeans[col].dtype != 'bool':
        lower = min(segmeans[col])- min(segmeans[col])* distance
        upper = max(segmeans[col])+ max(segmeans[col])* distance
        if upper > 10:
            upper = round(upper,0)
            lower = round(lower,0)
        else:
            upper = round(upper,1)
            lower = round(lower,1)
        ranges.append((lower,upper))

# plotting



plt.rc('xtick', labelsize=8) 
plt.rc('ytick', labelsize=8) 
i=0
for key, value in segdata.items():
    #print(key, value)
    #plt.title(key)
    radar = ComplexRadar(plt.figure(figsize=(6, 6), dpi = 100), criteria, ranges)
    radar.plot(tuple(value[0]))
    radar.fill(tuple(value[0]), alpha=0.2)
    #plt.tight_layout(pad=2)
    plt.title('Segment ' + str(key), loc='left')
     
plt.show()  

png

png

png

png

png

import warnings
warnings.filterwarnings('ignore')
#ignore neglible user warning for legend params

comparisons = [['disengaged_skeptics', 'marginalised_middles'],
             ['satisfied_traditionalists', 'sophisticated_opportunists'],
             ['struggling_techies', 'disengaged_skeptics']]

for comparison in comparisons:
    segdatasliced = {key: segdata[key] for key in segdata.keys() & comparison}
    
    radar = ComplexRadar(plt.figure(figsize=(6, 6), dpi = 100), criteria, ranges)
    #init
    title = ''
    ha=[]
    la=[]
    for key, value in segdatasliced.items():
        r = radar.plot(tuple(value[0]))
        radar.fill(tuple(value[0]), alpha=0.2)
        
        patch = mpatches.Patch(color=r[0].get_color(), label=key)
        ha.append(patch)
        la.append(key)
        #radar.legend(, loc=1)
        #plt.tight_layout(pad=2)
        
        if len(title) > 1:
            title = 'Segment Means Profile Comparison: ' + title + ' vs. ' + key
        else:
            title = key
            
    radar.legend(handles=ha, labels=la, loc=1, bbox_to_anchor=(0.1, 0.1))
    plt.title('\n'.join(wrap(title,40)), loc='left',   x=-0.3, y=0.94, pad=14)
    plt.show()

png

png

png

Segment Means Profile comparisons

With radar plots the features average profiles can be visually compared to each other especially for those features which are particularly pronounced towards population max- and minima.

So the differences become quite apparent and could not be wider in certain combinations: whether it is the strong use of debit services of the ‘middles’ or the high willingness to leave of the ‘skeptics’. Both have in common the equally low service satisfaction rating on average.

Cost Sensitive Multi-classification

We assume we have a representative independent set of customer records which can be labeled by customer service and domain experts, we may able to create a classification model that can assign also every other customer record into the correct customer segment they belong to (Multi-class classification task). For this notebook we just use the above generated dataset.

Logistic regression model

So for such a classification I like to choose a Logistic Regression statistical model by default. With logistic regression there are a few assumptions to be met, before it can be applied and used.

Assumptions to be checked

  1. Appropriate outcome type - our case will be the multinomial regression where the dependent variable y will have more than two classes of outcomes.
  2. logistic regression requires the observations to be independent of each other. In other words, the observations should not come from repeated measurements or matched data.
  3. logistic regression requires there to be little or no multicollinearity among the independent variables. This means that the independent variables should not be too highly correlated with each other.
  4. logistic regression assumes linearity of independent variables and log odds. although this analysis does not require the dependent and independent variables to be related linearly, it requires that the independent variables are linearly related to the log odds.
  5. logistic regression typically requires a large sample size. A general guideline is that you need at minimum of 10 cases with the least frequent outcome for each independent variable in your model. For example, if you have 5 independent variables and the expected probability of your least frequent outcome is .10, then you would need a minimum sample size of 500 (10*5 / .10).

the first two assumptions can be checked off quickly: Firstly our case will be the multinomial regression where the dependent variable y will have more than two classes of outcomes, this is an appropiate type of outcome, so we check #1. Secondly all observations should be independent from each other, which is the case here too, as we have generated the data ourselves, we mark assumption number 2 as met as well.

checking assumption 3 - little or no multicollinearity of predictors

We will check all features for elevated mulicorrelation with the Variance Inflation Factor Check. We use the statsmodels package with the same name and prepare the data.

# looks like not needed anymore

"""df_test = df_unlabeled.copy()
df_test['Segment'] = segment_data.Segment
df_test.head()"""
"df_test = df_unlabeled.copy()\ndf_test['Segment'] = segment_data.Segment\ndf_test.head()"
"""import pandas as pd, numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
import warnings
warnings.filterwarnings("ignore")

segm = df.copy()"""
'import pandas as pd, numpy as np\nfrom sklearn.preprocessing import MultiLabelBinarizer\nimport warnings\nwarnings.filterwarnings("ignore")\n\nsegm = df.copy()'
# copy as df
segm = df.copy()

segm['gender'] = segm['gender'].apply(
    lambda x: 1 if x == 'male' else 0
)
segm['use_online_mobile'] = segm['use_online_mobile'].apply(
    lambda x: 1 if x else 0
)
segm['use_debit_services'] = segm['use_debit_services'].apply(
    lambda x: 1 if x else 0
)
segm['willingness_to_leave'] = segm['willingness_to_leave'].apply(
    lambda x: 1 if x else 0
)

# copy as numpy array
segm_array = df.copy().values

X = segm_array[:, :segm.shape[1] - 1]
y = segm_array[:, segm.shape[1] - 1] 

# Variance Inflation Factor Check for Multicollinearity

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

df_vif = add_constant(segm.iloc[:,:-1])
pd.Series([variance_inflation_factor(df_vif.values, i) for i in range(df_vif.shape[1])], index=df_vif.columns)
const                         36.130813
age                            2.516610
gender                         1.023785
annual_income                  1.643256
willingness_to_leave           1.158922
annual_branch_visits           2.628614
service_satisfaction_level     1.152449
use_debit_services             1.168716
use_online_mobile              1.431586
deposit_revenue                1.490094
dtype: float64

The test shows light elevated factors for the variables ‘age’ and ‘annual_branch_visits’.

Levels around 1 are low, threshold to raise concerns range from 2.5 up to 10, and are therefore always subjective.

To be certain we use an other method to detect high-multi-collinearity by inspecting the eigen values of the correlation matrix.

A very low eigen value shows that the data are collinear, and the corresponding eigen vector could show which variables are then collinear. So put another way: if there is no collinearity in the data, you would expect that none of the eigen values are close to zero.

features = segm.iloc[:,:-1]

# correlation matrix
corr = np.corrcoef(features, rowvar=0)

# eigen values & eigen vectors
w, v = np.linalg.eig(corr)

# eigen vector of a predictor
# pd.Series(v[:,3], index=segm.iloc[:,:-1].columns)

pd.Series(w, index=features.columns)
age                           2.797363
gender                        1.526171
annual_income                 0.233061
willingness_to_leave          0.432331
annual_branch_visits          0.518614
service_satisfaction_level    0.704103
use_debit_services            1.032196
use_online_mobile             0.848536
deposit_revenue               0.907625
dtype: float64

The result shows that none of the eigen values could be rounded to zero one digit after the decimal point, so that is bascially enough to rule out a problem with multi-collinearity. We can tick assumption 3 as being checked as well.

Lets move quickly to assumption no 4 which requires linear relationships of the independent variables and the model’s log odds.

# Needed to run the logistic regression
import statsmodels.formula.api as smf

# For plotting/checking assumptions
import seaborn as sns

df_check = segm.copy()

df_check['gender'] = df['gender'].astype('category')
df_check['use_online_mobile'] = df['use_online_mobile'].astype('category')
df_check['use_debit_services'] = df['use_debit_services'].astype('category')
df_check['willingness_to_leave'] = df['willingness_to_leave'].astype('category')


mapdict = dict(zip( df_check.Segment.unique(), range(0,5)))
df_check['Segment'] = df_check.Segment.map(mapdict)

#df_check['Segment'] = df['Segment'].astype('category')
df_check = pd.get_dummies(df_check)
df_check.to_csv('df_check.csv', index=False)

#dummy_segs = df_check.iloc[:,-5:].columns
predictors = df_check.iloc[:,:9].columns
## sklearn
from sklearn.linear_model import LogisticRegression

# Logistic Regression
testmodel = LogisticRegression(multi_class='multinomial')
testmodel.fit(df_check.iloc[:,:-1], df_check['Segment'])

# only for the purpose of this assumption test we predict with training data
probabilities = testmodel.predict_proba(df_check.iloc[:,:-1])

logodds = pd.DataFrame(np.log(probabilities/(1-probabilities)))
logodds.head()
[0 1 2 3 4]
0 1 2 3 4
0 -0.332695 -0.967721 -1.193680 -2.675036 -4.593497
1 -0.214611 -4.825835 -0.328147 -13.571020 -1.929562
2 0.123727 -2.417282 -0.575284 -6.976273 -3.611851
3 0.176263 -1.583293 -1.015212 -4.618127 -4.596909
4 0.152500 -1.298140 -1.288492 -3.688953 -4.958581
class_list = list(df.Segment.unique())
#yy = 10/pow(df_check.service_satisfaction_level + 3, 5)
#yy = df_check.service_satisfaction_level
yy = df_check.age
xx = logodds.iloc[:,4] # logodds for one class
from statsmodels.nonparametric.smoothers_lowess import lowess as  sm_lowess

plt.figure(figsize=(14,14))
plt.title('predictor age')

cols = 2
rows = np.ceil(len(df.Segment.unique()) / cols)

for i in range(0,len(df.Segment.unique())):
    ax = plt.subplot(rows, cols, i + 1) 
    sm_y, sm_x = sm_lowess(yy, logodds.iloc[:,i],  frac=3./5., it=5, return_sorted = True).T
    plt.plot(sm_x, sm_y, color='tomato', linewidth=3)
    plt.plot(yy,logodds.iloc[:,i], 'k.')
    ax.set_title(class_list[i])
    ax.set_xlabel("age")
    ax.set_ylabel("log odds")
    
plt.show()

png

By plotting the independent variable ‘age’ (representing here also the other variables) to the models logodds we can clearly see non-linear relationships for all classes. To document that further we will proceed with a Box Tidwell test and will try transformed versions of this predictor to see if we can achieve an acceptable linearity with a transformed predictor.

# all continuous variables to transform
df_boxtidwell = df_check.copy()
df_boxtidwell = df_boxtidwell.drop(df_boxtidwell[df_boxtidwell.annual_branch_visits == 0].index)
df_boxtidwell = df_boxtidwell.drop(df_boxtidwell[df_boxtidwell.deposit_revenue == 0].index)
df_boxtidwell = df_boxtidwell.drop(df_boxtidwell[df_boxtidwell.annual_income == 0].index)
df_boxtidwell = df_boxtidwell.drop(df_boxtidwell[df_boxtidwell.service_satisfaction_level == 0].index)

# Define continuous variables
z=5
continuous_var = df_boxtidwell.iloc[:, 0:z].columns

# Add logit transform interaction terms (natural log) for continuous variables e.g. Age * Log(Age)
for var in continuous_var:
    df_boxtidwell[f'Transf_{var}'] = df_boxtidwell[var].apply(lambda x: x * np.log(x) ) #np.log = natural log

df_boxtidwell.head()
#continuous_var
cols = continuous_var.append(df_boxtidwell.iloc[:, -z:].columns)
cols
Index(['age', 'annual_income', 'annual_branch_visits',
       'service_satisfaction_level', 'deposit_revenue', 'Transf_age',
       'Transf_annual_income', 'Transf_annual_branch_visits',
       'Transf_service_satisfaction_level',
       'Transf_deposit_revenue'],
      dtype='object')
import statsmodels.api as sm

# Redefine independent variables to include interaction terms
#X_lt = df_boxtidwell[cols[-5:]]
X_lt = df_boxtidwell[cols]
#_lt = df_boxtidwell['deposit_revenue']
y_lt = df_boxtidwell['Segment']

# Add constant
X_lt = sm.add_constant(X_lt, prepend=False)
  
# Build model and fit the data (using statsmodel's Logit)
logit_result = sm.GLM(y_lt, X_lt, family=sm.families.Binomial()).fit()

# Display summary results
print(logit_result.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                Segment   No. Observations:                  488
Model:                            GLM   Df Residuals:                      477
Model Family:                Binomial   Df Model:                           10
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                    nan
Date:                Fri, 10 Dec 2021   Deviance:                       34679.
Time:                        15:42:50   Pearson chi2:                 3.50e+18
No. Iterations:                     7                                         
Covariance Type:            nonrobust                                         
=====================================================================================================
                                        coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------
age                               -8.491e+15   6.72e+06  -1.26e+09      0.000   -8.49e+15   -8.49e+15
annual_income                     -2.671e+13   1.19e+04  -2.24e+09      0.000   -2.67e+13   -2.67e+13
annual_branch_visits              -2.924e+14   3.76e+06  -7.78e+07      0.000   -2.92e+14   -2.92e+14
service_satisfaction_level         3.532e+15   2.36e+07    1.5e+08      0.000    3.53e+15    3.53e+15
deposit_revenue                   -1.562e+14   8.41e+04  -1.86e+09      0.000   -1.56e+14   -1.56e+14
Transf_age                         1.735e+15   1.37e+06   1.27e+09      0.000    1.74e+15    1.74e+15
Transf_annual_income               2.248e+12   1000.974   2.25e+09      0.000    2.25e+12    2.25e+12
Transf_annual_branch_visits        2.267e+14   9.75e+05   2.33e+08      0.000    2.27e+14    2.27e+14
Transf_service_satisfaction_level -1.415e+15   1.13e+07  -1.25e+08      0.000   -1.41e+15   -1.41e+15
Transf_deposit_revenue             2.056e+13   1.01e+04   2.05e+09      0.000    2.06e+13    2.06e+13
const                              2.078e+17   7.07e+07   2.94e+09      0.000    2.08e+17    2.08e+17
=====================================================================================================

The GLM logit_result shows that basically all variables are flagged for non-linearity across all classic transformations by have significant p-values for the regular and tranformed feature variables. This applies also for all other regular transformations. We therefore have to accept that this data is illsuited for such a logistic regression model and rather move on, instead of sinking more time into this.

Overview Logistic Regression assumptions check

| # | Assumption | Check | |—|:—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————-:|:—–:| | 1 | Appropriate outcome type - our case will be the multinomial regression where the dependent variable y will have more than two classes of outcomes. | ☑ | | 2 | Logistic regression requires the observations to be independent of each other. In other words, the observations should not come from repeated measurements or matched data. | ☑ | | 3 | Logistic regression requires there to be little or no multicollinearity among the independent variables. This means that the independent variables should not be too highly correlated with each other. | ☑ | | 4 | Logistic regression assumes linearity of independent variables and log odds. although this analysis does not require the dependent and independent variables to be related linearly, it requires that the independent variables are linearly related to the log odds. | ☒ | | 5 | Logistic regression typically requires a large sample size. A general guideline is that you need at minimum of 10 cases with the least frequent outcome for each independent variable in your model. For example, if you have 5 independent variables and the expected probability of your least frequent outcome is .10, then you would need a minimum sample size of 500 (10*5 / .10). | |

Alternative multi-classification with Support Vector Machines extended by cost sensitiveness

So alternatively to Logistic Regression we could pick the popular random forest algorithm, but for a change we go for a classic support vector machine (SVM) model. SVM does not require any assumptions to be met, that is very soothing at this point ;)

Instead of the algorithm selection and model tuning we want to shift focus towards how to integrate costs of misclassification into the model building process, especially with assymmetric multi-missclassfication costs, where some misclassifications are more costly than others. This is where we actually have an optimization problem and a lever to achieve cost savings, which will more or less contribute to the overall segmentation goal of increasing efficieny and effectiveness at the same time.

Different costs for different types of errors

The overall objective in classification is to minimize classification errors, with cost-sensitive classification however this objective is shifted from minimizing classification errors, to rather minimizing type I and type II mis-classification errors where each of the multiple classes are associated with their own cost.

We define a error-cost matrix, in which the entries at each cell $({row}{a}, {column}{b})$ specify how costly it should be to predict class $a$ when the true class is $b$.

import pandas as pd, numpy as np

# assymmetric nonreciprocal costs of classification errors
# true class vs. wrong predicted class costs

true_vs_predicted_class_cost = pd.DataFrame(index=segment_names, columns=segment_names)

true_vs_predicted_class_cost.marginalised_middles       = [0,10,10,10,10]
true_vs_predicted_class_cost.disengaged_skeptics        = [10,0, 10, 10, 10]
true_vs_predicted_class_cost.satisfied_traditionalists  = [10,10, 0, 10, 10]
true_vs_predicted_class_cost.struggling_techies         = [10,10, 10, 0, 10]
true_vs_predicted_class_cost.sophisticated_opportunists = [200,34, 40, 25, 0]

true_vs_predicted_class_cost = true_vs_predicted_class_cost.T # transpose the column-wise-built DataFrames
true_vs_predicted_class_cost 
marginalised_middles disengaged_skeptics satisfied_traditionalists struggling_techies sophisticated_opportunists
marginalised_middles 0 10 10 10 10
disengaged_skeptics 10 0 10 10 10
satisfied_traditionalists 10 10 0 10 10
struggling_techies 10 10 10 0 10
sophisticated_opportunists 200 34 40 25 0
# for computation we map the class labels to numeric classes

# map class labels numerically
d = dict([(y,x) for x,y in enumerate((segment_names))])

# convert to numeric indices
true_vs_predicted_class_cost.columns = d.values()
true_vs_predicted_class_cost.index = d.values()

true_vs_predicted_class_cost

import random
true_vs_predicted_class_cost
0 1 2 3 4
0 0 10 10 10 10
1 10 0 10 10 10
2 10 10 0 10 10
3 10 10 10 0 10
4 200 34 40 25 0
# map observations class string labels to number

# convert to np array
cost_matrix = true_vs_predicted_class_cost.values

try:
    y = [d[s] for s in y]
except:
    print('already done') 

# new cost array for package 'costsensitive'
C = np.array([cost_matrix[i] for i in y])
np.random.seed(1)
#C = (np.random.random_sample((500, 5))/100 + 1) * C

#C = np.linalg.inv(C)

# save cost matrix to csv file
# np.savetxt('costmatrix.csv', C, delimiter=',')

C
already done





array([[  0,  10,  10,  10,  10],
       [  0,  10,  10,  10,  10],
       [  0,  10,  10,  10,  10],
       ...,
       [200,  34,  40,  25,   0],
       [200,  34,  40,  25,   0],
       [200,  34,  40,  25,   0]], dtype=int64)

We use a Machine Learning model based on Support Vector Machines to classify the five customer segments based on the clients predictors with the goal to generalize well for new real clients.

In a first step we do a conventional SVM for this imbalanced dataset, for which we use an appropiate metric for evaluation: the weighted F1-Score.

After that we will model the different costs of actual misclassifications of clients into a incorrect segments. This is important because not every error that is made is equally costly. The metric to be evaluated then will be cost savings to the baseline cost of the conventional LR model of Step 1.

Baseline SVM model

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler

#segm = segm.values
X = segm.iloc[:, :segm.shape[1] - 1].astype('float64')
y = segm.iloc[:, segm.shape[1] - 1] 

"""# numerical labels
try:
    y = [d[s] for s in y]
except:
    print('already done') """

# numerical labels
nummap = dict(zip(segment_names,range(0,5)))
y = y.map(nummap)
X_train, X_test, C_train, C_test, y_train, y_test = train_test_split(X, C, y, test_size=.5, random_state=0,
                                                                     stratify=y
                                                                    )
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
[[-0.06506306 -1.03252879 -0.535742   -0.52489066 -0.89210993  0.67795498
   0.51868635  0.88640526 -0.0872679 ]
 [-0.73825472  0.968496   -0.75466817 -0.52489066 -0.72250728 -1.89005632
   0.51868635  0.88640526 -0.22493628]]
[[ 0 10 10 10 10]
 [ 0 10 10 10 10]]
71    0
26    0
Name: Segment, dtype: int64
[[ 2.27122869 -1.03252879  0.27819165 -0.52489066  1.56712848  0.67795498
   0.51868635 -1.12815215  0.68012859]
 [-0.94041437 -1.03252879  1.54138938  1.90515869 -0.46810331  0.67795498
   0.51868635 -1.12815215  2.54650874]]
<class 'pandas.core.series.Series'>
(250, 9)
(250, 5)
(250,)

Fitting the classifiers and tracking test set results:

from sklearn import svm
from sklearn.metrics import confusion_matrix

### Keeping track of the results for later
name_algorithm = list()
f1_scores = list()
costs = list()
collect_conf_mat = list()

#### Baseline : Classic SVM with no weights

svm_model = svm.SVC(random_state=1)
svm_model.fit(X_train, y_train)

preds_svm = svm_model.predict(X_test)
from sklearn.metrics import precision_recall_fscore_support, f1_score

conf_mat = confusion_matrix(y_test, preds_svm)

f1 = f1_score(y_test, preds_svm, zero_division=1, average='weighted')
f1
0.9668887073655656
# logging results
name_algorithm.append("Classic Support Vector Machine SVC Method")
f1_scores.append(f1)
# ex post calculation of costs with the baseline model

costs.append( C_test[np.arange(C_test.shape[0]), preds_svm].sum() )
collect_conf_mat.append(confusion_matrix(y_test, preds_svm))
from sklearn.metrics import plot_confusion_matrix

disp = plot_confusion_matrix(svm_model, X_test, y_test,
                                 display_labels=segment_names,
                                 cmap=plt.cm.Blues)
disp.ax_.set_xticklabels(segment_names, size=12, rotation=30, ha='right')
plt.grid(False)
plt.show()

png

pd.DataFrame(precision_recall_fscore_support(y_test, preds_svm), columns=segment_names, index=['precision', 'recall', 'fscore', 'support'])
marginalised_middles disengaged_skeptics satisfied_traditionalists struggling_techies sophisticated_opportunists
precision 0.945652 0.937500 0.987654 1.0 1.000000
recall 0.966667 1.000000 1.000000 1.0 0.750000
fscore 0.956044 0.967742 0.993789 1.0 0.857143
support 90.000000 30.000000 80.000000 30.0 20.000000

Fitting cost-sensitive classifiers

The algorithms applied here are reduction methods that turn our classification-cost problem into a series of importance-weighted binary classification sub-problems (‘multi oracle call’), see for the details the documentation https://costsensitive.readthedocs.io/en/latest/

Github: https://github.com/david-cortes/costsensitive/

One theoretical algorithm implemented in that package originates here: Beygelzimer, A., Dani, V., Hayes, T., Langford, J., & Zadrozny, B. (2005, August). Error limiting reductions between classification tasks. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.2161&rep=rep1&type=pdf

from costsensitive import WeightedAllPairs, WeightedOneVsRest, RegressionOneVsRest, FilterTree, CostProportionateClassifier
from sklearn import svm

### 1: WAP = Weighted All-Pairs as described in "Error limiting reductions between classification tasks"

costsensitive_WAP = WeightedAllPairs(svm.SVC(random_state=1), weigh_by_cost_diff=False)
costsensitive_WAP.fit(X_train, C_train)
preds_WAP = costsensitive_WAP.predict(X_test, method = 'most-wins')
f1_WAP = f1_score(y_test, preds_WAP, zero_division=1, average='weighted')
# log
collect_conf_mat.append(confusion_matrix(y_test, preds_WAP))
name_algorithm.append("Weighted All-Pairs SVM 'Error limiting reductions between classification tasks'")
f1_scores.append(f1_WAP)
costs.append(C_test[np.arange(C_test.shape[0]), preds_WAP].sum())


#### 2: WAP2 = Weighted All-Pairs - cost difference weights enabled

costsensitive_WAP2 = WeightedAllPairs(svm.SVC(random_state=1), weigh_by_cost_diff=True)
costsensitive_WAP2.fit(X_train, C_train)
preds_WAP2 = costsensitive_WAP2.predict(X_test)
f1_WAP2 = f1_score(y_test, preds_WAP2, zero_division=1, average='weighted')
# log
collect_conf_mat.append(confusion_matrix(y_test, preds_WAP2))
name_algorithm.append("Weighted All-Pairs SVM - cost difference weights enabled")
f1_scores.append(f1_WAP2)
costs.append( C_test[np.arange(C_test.shape[0]), preds_WAP2].sum() )


#### 3: WOR = Weighted One-Vs-Rest heuristic with simple cost-weighting schema

costsensitive_WOR = WeightedOneVsRest(svm.SVC(random_state=1), weight_simple_diff=True)
costsensitive_WOR.fit(X_train, C_train)
preds_WOR = costsensitive_WOR.predict(X_test)
f1_WOR = f1_score(y_test, preds_WOR, zero_division=1, average='weighted')
# log
collect_conf_mat.append(confusion_matrix(y_test, preds_WOR))
name_algorithm.append("Weighted One-Vs-Rest SVM, simple cost-weighting schema")
f1_scores.append(f1_WOR)
costs.append( C_test[np.arange(C_test.shape[0]), preds_WOR].sum() )

Optimization Results Comparison

numplots = len(collect_conf_mat)
#print(numplots)
fig, ax = plt.subplots(numplots,1, sharey=True, figsize=(4,17))
#ax[0].get_shared_x_axes().join(ax[1])
for c in range(0,numplots):
    #ax[0].get_shared_x_axes().join(ax[c])
    sns.heatmap(collect_conf_mat[c], yticklabels=segment_names, annot=True, ax=ax[c], cmap=plt.cm.Blues, cbar=False)
    ax[c].set_title(name_algorithm[c],loc='center', fontsize=13)
    #plt.title(drop_combos[c], loc='center', fontsize=5)
    ax[numplots-1].set_xlabel("Predicted")
    ax[c].set_ylabel("True")
    ax[numplots-1].set_xticklabels(segment_names, fontsize=12, rotation=35, ha='right')
    
#fig.tight_layout(h_pad=2)
plt.subplots_adjust(hspace=0.3)
fig.suptitle('Confusion Matrix Comparison', fontsize = 22)
plt.subplots_adjust(top=0.9)
plt.show()

png

import pandas as pd

results = pd.DataFrame({
    'Method' : name_algorithm,
    'F1' : f1_scores,
    'Total_Cost' : costs
})
results=results[['Method', 'Total_Cost', 'F1']]
results.set_index('Method')
Total_Cost F1
Method
Classic Support Vector Machine SVC Method 1030 0.966889
Weighted All-Pairs SVM 'Error limiting reductions between classification tasks' 470 0.963998
Weighted All-Pairs SVM - cost difference weights enabled 510 0.948454
Weighted One-Vs-Rest SVM, simple cost-weighting schema 640 0.971733
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline

rank = results.F1.sort_values().index
palette = sns.color_palette('Blues', len(rank))


fig = plt.gcf()
fig.set_size_inches(11, 8)
sns.set(font_scale=1.3)
ax = sns.barplot(x = "F1", y = "Method", data = results, palette=np.array(palette)[rank])
ax.set(xlabel='log F1 (higher is better)')
# data bar annotations
for p in ax.patches:
    ax.annotate("%.4f" % p.get_width(), xy=(p.get_width(), p.get_y()+p.get_height()/2),
            xytext=(5, 0), textcoords='offset points', ha="left", va="center")
    
ax.set_xscale("log")
plt.title('Cost-Sensitive Classification - F1 Score metric', fontsize=20)
plt.show()

png

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('dark')
#%matplotlib inline

rank = results['Total_Cost'].sort_values(ascending=False).index
palette = sns.color_palette('Blues', len(rank))

fig = plt.gcf()
fig.set_size_inches(11, 8)
sns.set(font_scale=1.3)
ax = sns.barplot(x = "Total_Cost", y = "Method", data = results, palette=np.array(palette)[rank])
ax.set(xlabel='Total Cost (lower is better)')
# data bar annotations
for p in ax.patches:
    ax.annotate("%.0f" % p.get_width(), xy=(p.get_width(), p.get_y()+p.get_height()/2),
            xytext=(5, 0), textcoords='offset points', ha="left", va="center")
    
plt.title('Cost-Sensitive Classification - Total Cost metric', fontsize=20)
plt.show()

png

results
Method Total_Cost F1
0 Classic Support Vector Machine SVC Method 1030 0.966889
1 Weighted All-Pairs SVM 'Error limiting reducti... 470 0.963998
2 Weighted All-Pairs SVM - cost difference weigh... 510 0.948454
3 Weighted One-Vs-Rest SVM, simple cost-weightin... 640 0.971733
results = results.assign(CostSavings = lambda x: (x['Total_Cost'] - results.loc[0,'Total_Cost']) * -1)
results = results.assign(F1_Deviation = lambda x: (x['F1'] - results.loc[0,'F1']))
results
Method Total_Cost F1 CostSavings F1_Deviation
0 Classic Support Vector Machine SVC Method 1030 0.966889 0 0.000000
1 Weighted All-Pairs SVM 'Error limiting reducti... 470 0.963998 560 -0.002891
2 Weighted All-Pairs SVM - cost difference weigh... 510 0.948454 520 -0.018435
3 Weighted One-Vs-Rest SVM, simple cost-weightin... 640 0.971733 390 0.004845
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('dark')
#%matplotlib inline

rank = results.iloc[1:,:]['CostSavings'].sort_values(ascending=True).index
rank2 = results.iloc[1:,:]['F1_Deviation'].sort_values(ascending=True).index
palette = sns.color_palette('Blues', len(rank))
palette2 = sns.color_palette('Reds', len(rank))

fig = plt.gcf()
fig.set_size_inches(11, 8)
sns.set(font_scale=1.3)
ax = sns.barplot(x = "CostSavings", y = "Method", data = results.iloc[1:,:], palette=np.array(palette)[rank-1])

ax.set(xlabel='Cost Savings (higher is better)')
ax2 = ax.twiny()
def change_width(ax, new_value) :
    locs = ax.get_yticks()
    for i,patch in enumerate(ax.patches):
        current_width = patch.get_width()
        diff = current_width - new_value
        # we change the bar width
        patch.set_height(new_value)
        # we recenter the bar
        patch.set_y(locs[i] - (new_value * .5))


ax2 = sns.barplot(x = "F1_Deviation", y = "Method", data = results.iloc[1:,:], palette=np.array(palette2)[rank2-1], ax=ax2)
ax2.set(xlabel='F1_Deviation (higher is better)')
change_width(ax2, .25)
   
plt.title('Comparison of Cost Sensitive Methods to the Baseline Model\n', fontsize=20)
plt.legend(['F1 Score Deviation', 'Cost Saving'], labelcolor=['crimson','royalblue'], loc='lower left', edgecolor='white')
plt.axvline(0, 0, color='crimson')
plt.show()

png

We can summarize that cost savings can be significant out of the box and may be possible even with a higher F1 Score. With further tuning efforts this may become however much more like a trade-off decision. And the question how far you want to favor marginal cost savings over lower F1-Score or Accuracy results.