Client Segments Data Simulation & Cost Sensitive Multi-classification

this notebook is loosely based on a given financial customer segmentation by BAI. https://www.latinia.com/IF/Documentos/The_New_Banking_Consumer_5_Core_Segments.pdf

Marketing research produces these kinds of segmentation through researching their market and collecting and mining a lot of data which is then used in their segmentation analysis, with clustering analysis, expert knowledge and loads of different quantitative and qualitative variables (predictors). This is what we cannot emulate here and instead accept as given.

The general goal of market segmentation is to find groups of customers that differ in important ways related to Product interest, market participation, or response to marketing efforts. By understanding the differences between groups, a marketer is enabled to take better strategic decisions about opportunities, product specification, the positioning, and engage in more effective advertising.

Finance segmentation

Banks and many other types of financial institutions classify their clients and try to perceive their behavioural structure which includes if they will pay their debts at all. If the bank has an improved understanding of the differences and nuances between these classes, then the bank can rethink of their marketing strategy. Resulting in more targeted, cost-effective approach and greater investment return.

Here is an example of banking customer segmentation by BAI Research Group.

Customer base is divided into 5 segments:

Marginalised Middles
Disengaged Skeptics
Satisfied Traditionalists
Struggling Techies
Sophisticated Opportunists

Marginalised Middles

It is the largest group containing 36% of the customers. This group is,

Younger than the average age of 45.5
Have larger annual income than average
Least satisfied and most confused
Least often visitor of the branches
Most likely to pay someone else to handle their finances

Corresponding marketing strategy: Providing steady and clear marketing messages such as fee disclosure

Disengaged Skeptics

It is the second oldest group of these 5. This group

Earns below average
Is the least satisfied with customer services
Less likely to utilise available services like internet or mobile banking
Has high concentration of accounts at other institutions

Corresponding marketing strategy: The door is open for other institutions to entice these unsatisfied group, be that other institution for this group.

Satisfied Traditionalists

It is the oldest group with above average income. This segment:

The least likely to use online, mobile, and debit services due to age and habits
Has the second highest deposit revenue per household and expects a variety of product offerings

Corresponding marketing strategy: As this is a tempting segment with high deposit revenue and highest investment balance, offer them a wide product base.

Struggling Techies

It is the youngest group with lowest income. Simply the youth and the youth has the features such as:

Active in their finance
Comfortable to make decisions
Very receptive to other financial institutions with tempting offers
Most likely to utilise online, mobile, and debit platforms

Corresponding marketing strategy: The receptivity of this segment makes them a worthwhile target. Easiest segment to gain new customers.

Sophisticated Opportunists

It has the average age and the highest income. This segment is

The most satisfied with their primary financial institution
Very knowledgeable about banking system and finance
Has the highest return on mobile and debit services
Has the highest deposit revenue per household

Corresponding marketing strategy: Treat this segment well, provide them with right tools and innovate products to allow them manage their own finance.

Source: https://www.segmentify.com/blog/customer-consumer-user-segmentation-examples#Marginalised_Middles

Client Segments Data Simulation

In this notebook we opt to interpret the given client segmentation model and generate our own data based on our assumptions and a modest variety of features and limited knowledge about the differences of these segments; it is not supposed to be a full fletched broad data generation effort. For doing the actual segmentation as a step in marketing research there will be more variables needed in order to delimit each segment better and still there will be always clients fitting only marginally better to one cluster than another. With this exercise we simulate the whole process with our own generated data and see if the code works before any real potential private data is exposed to exploratory analysis or machine learning further down the pipeline.

We can manipulate the synthetic data, re-run analyses, and examine how the results change. This data simulation highlights a strength of Python: it is easy to simulate data with various Python functions. And we will see that our limited dataset will be reasonably well suited to predict unsegmented clients to one of these given segments.

Let’s get started!

import numpy as np
import pandas as pd

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 70)

Client Segment Data Definitions

segment_variables = [
                     'age', 
                     'gender',
                     'annual_income', 
                     'willingness_to_leave', 
                     'annual_branch_visits', 
                     'service_satisfaction_level', # scale 1-5
                     'use_debit_services', 
                     'use_online_mobile', 
                     'deposit_revenue',

                     #'knowledgeable'
                     
                    ]
segment_variables_distribution = dict(zip(segment_variables,
                                          [
                                            'normal',
                                           'binomial',
                                           'normal', 
                                           'binomial',
                                           'poisson',
                                           'custom_discrete',
                                           'binomial',
                                           'binomial',
                                           #'binomial',
                                           'normal',

                                           #'binomial'
                                          ]
                                           ))

segment_variables_distribution['age']

'normal'

segment_variables_distribution

{'age': 'normal',
 'gender': 'binomial',
 'annual_income': 'normal',
 'willingness_to_leave': 'binomial',
 'annual_branch_visits': 'poisson',
 'service_satisfaction_level': 'custom_discrete',
 'use_debit_services': 'binomial',
 'use_online_mobile': 'binomial',
 'deposit_revenue': 'normal'}

segment_means = {'marginalised_middles':        [43, 0.45, 55000, 0.2, 6, None, 0.8, 0.7, 1000],
                 'disengaged_skeptics':         [54, 0.6, 38000, 0.5, 14, None, 0.2, 0.2, 340],
                 'satisfied_traditionalists':   [64, 0.55, 59000, 0.05, 30, None, 0.9, 0.15, 1200],
                 'struggling_techies':          [25, 0.45, 25000, 0.4, 2, None, 0.7, 0.9, 200],
                 'sophisticated_opportunists':  [45.5, 0.4, 70000, 0.2, 11, None, 0.9, 0.7, 2400]}

# standard deviations for each segment where applicable

segment_stddev = {'marginalised_middles':        [7, None, 10000, None, None, None, None, None,300],
                  'disengaged_skeptics':         [12, None, 7000, None, None, None, None, None, 120],
                  'satisfied_traditionalists':   [8, None, 11000, None, None, None, None, None,400],
                  'struggling_techies':          [4, None, 6000, None, None, None, None, None,80],
                  'sophisticated_opportunists':  [8, None, 16000, None, None, None, None, None,1000]}

# discrete variable probabilities
segment_pdist  = {'marginalised_middles':        [None, None, None, None, None,(0.05, 0.20, 0.35, 0.25, 0.15),None, None, None],
                  'disengaged_skeptics':         [None, None, None, None, None,(0.25, 0.25, 0.30, 0.15, 0.05),None, None, None],
                  'satisfied_traditionalists':   [None, None, None, None, None,(0.05, 0.10, 0.25, 0.30, 0.30),None, None, None],
                  'struggling_techies':          [None, None, None, None, None,(0.15, 0.25, 0.35, 0.20, 0.05),None, None, None],
                  'sophisticated_opportunists':  [None, None, None, None, None,(0.05, 0.15, 0.30, 0.25, 0.25),None, None, None]
                 }

segment_names = ['marginalised_middles', 'disengaged_skeptics', 'satisfied_traditionalists', 'struggling_techies',
                 'sophisticated_opportunists']
segment_sizes = dict(zip(segment_names,[180, 60, 160, 60, 40]))

segment_statistics = {}
for name in segment_names:
  segment_statistics[name] = {'size': segment_sizes[name]}
  for i, variable in enumerate(segment_variables):
    segment_statistics[name][variable] = {
        'mean': segment_means[name][i],
        'stddev': segment_stddev[name][i],
        'pdist': segment_pdist[name][i]
    }

Final segment data generation

#np.random.seed(seed=0)
from numpy.random import default_rng
from numpy.random import Generator, PCG64
from scipy import stats
rng = default_rng(0)
#rg = Generator(PCG64())
segment_constructor = {}

# Iterate over segments to create data for each
for name in segment_names:
  segment_data_subset = {}
  print('segment: {0}'.format(name))
  # Within each segment, iterate over the variables and generate data
  for variable in segment_variables:
    #print('\tvariable: {0}'.format(variable))
    if segment_variables_distribution[variable] == 'normal':
     
      #print(segment_statistics[name][variable]['mean'], segment_statistics[name][variable]['stddev'],segment_statistics[name]['size'] )
        # Draw random normals
      segment_data_subset[variable] = np.maximum(0, np.round(rng.normal(
          
          loc=segment_statistics[name][variable]['mean'],
          scale=segment_statistics[name][variable]['stddev'],
          size=segment_statistics[name]['size']
          
      ),2))
    elif segment_variables_distribution[variable] == 'poisson':
      # Draw counts
      segment_data_subset[variable] = rng.poisson(
          lam=segment_statistics[name][variable]['mean'],
          size=segment_statistics[name]['size']
      )
    elif segment_variables_distribution[variable] == 'binomial':
      # Draw binomials
      segment_data_subset[variable] = rng.binomial(
          n=1,
          p=segment_statistics[name][variable]['mean'],
          size=segment_statistics[name]['size']
      )
    elif segment_variables_distribution[variable] == 'custom_discrete':
      # Draw custom discrete ordinal variable
      xk = 1 + np.arange(5) # ordinal scale 1-5
      pk = segment_statistics[name][variable]['pdist']
      custm = stats.rv_discrete(name='custm', values=(xk, pk))
      segment_data_subset[variable] = custm.rvs(
          size=segment_statistics[name]['size']
      )
    else:
      # Data type unknown
      print('Bad segment data type: {0}'.format(
          segment_variables_distribution[j])
           )
      raise StopIteration
  segment_data_subset['Segment'] = np.repeat(
      name,
      repeats=segment_statistics[name]['size']
  )
  #print(segment_data_subset)
    
  segment_constructor[name] = pd.DataFrame(segment_data_subset)

segment_data = pd.concat(segment_constructor.values())
#segment_data.reindex(range(0,500,1))
segment_data.index = range(0,len(segment_data),1)

segment: marginalised_middles
segment: disengaged_skeptics
segment: satisfied_traditionalists
segment: struggling_techies
segment: sophisticated_opportunists

segment_data.index

RangeIndex(start=0, stop=500, step=1)

np.random.seed(seed=2554)
name = 'sophisticated_opportunists'
variable = 'age'
print(segment_statistics[name][variable]['mean'])
print(segment_statistics[name][variable]['stddev'])
np.random.normal(
    loc=segment_statistics[name][variable]['mean'],
    scale=segment_statistics[name][variable]['stddev'],
    size=10
)

45.5
8

array([51.99132515, 36.35474139, 53.75517472, 47.15209108, 48.3509104 ,
       47.01605492, 35.91508828, 67.32429057, 45.67072279, 36.21226966])

np.random.seed(seed=1)
variable = 'annual_branch_visits'
print(segment_statistics[name][variable]['mean'])
print(segment_statistics[name][variable]['stddev'])
np.random.poisson(
    lam=segment_statistics[name][variable]['mean'],
    size=10
)

11
None

array([10,  7,  7, 10, 10,  8, 10,  7,  9,  5])

np.random.seed(seed=2554)
variable = 'use_online_mobile'
print(segment_statistics[name][variable]['mean'])
print(segment_statistics[name][variable]['stddev'])
np.random.binomial(
    n=1,
    p=segment_statistics[name][variable]['mean'],
    size=10
)

0.7
None

array([0, 0, 1, 1, 1, 0, 0, 0, 1, 1])

np.repeat(name, repeats=10)

array(['sophisticated_opportunists', 'sophisticated_opportunists',
       'sophisticated_opportunists', 'sophisticated_opportunists',
       'sophisticated_opportunists', 'sophisticated_opportunists',
       'sophisticated_opportunists', 'sophisticated_opportunists',
       'sophisticated_opportunists', 'sophisticated_opportunists'],
      dtype='<U26')

# codings

segment_data['gender'] = segment_data['gender'].apply(
    lambda x: 'male' if x else 'female'
)
segment_data['use_online_mobile'] = segment_data['use_online_mobile'].apply(
    lambda x: True if x else False
)
segment_data['use_debit_services'] = segment_data['use_debit_services'].apply(
    lambda x: True if x else False
)
segment_data['willingness_to_leave'] = segment_data['willingness_to_leave'].apply(
    lambda x: True if x else False
)

# number adjustments
segment_data['service_satisfaction_level'] = segment_data['service_satisfaction_level'].apply(
    lambda x: np.minimum(x, 5)
)

segment_data.describe(include='all').T

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
age	500	NaN	NaN	NaN	49.1268	15.0625	14.31	38.9975	47.865	61.175	84.47
gender	500	2	male	258	NaN	NaN	NaN	NaN	NaN	NaN	NaN
annual_income	500	NaN	NaN	NaN	52407.5	16297.6	10933	42061.9	53450.3	63444.1	101433
willingness_to_leave	500	2	False	403	NaN	NaN	NaN	NaN	NaN	NaN	NaN
annual_branch_visits	500	NaN	NaN	NaN	14.568	11.8464	0	5	10	26	46
service_satisfaction_level	500	NaN	NaN	NaN	3.266	1.10711	1	3	3	4	5
use_debit_services	500	2	True	390	NaN	NaN	NaN	NaN	NaN	NaN	NaN
use_online_mobile	500	2	True	262	NaN	NaN	NaN	NaN	NaN	NaN	NaN
deposit_revenue	500	NaN	NaN	NaN	1008.83	710.295	0	471.582	939.565	1310.11	5545.03
Segment	500	5	marginalised_middles	180	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Fortunately we can skip over the data cleaning part as we have generated a complete data set and almost ideal variables.

We may adjust a few columns where required.

Exploratory data analysis

the data we have just generated we have to understand, both predictors and Segments

	age	gender	annual_income	willingness_to_leave	annual_branch_visits	service_satisfaction_level	use_debit_services	use_online_mobile	deposit_revenue	Segment
0	43.88	male	54502.44	True	9	2	True	True	518.14	marginalised_middles
1	42.08	male	51325.97	False	5	3	True	False	1232.66	marginalised_middles
2	47.48	female	57187.86	False	9	5	True	True	841.21	marginalised_middles
3	43.73	female	63448.89	False	9	4	True	True	724.39	marginalised_middles
4	39.25	male	64933.36	False	6	3	True	False	668.41	marginalised_middles

# overview variable distributions by segment

segment_data.groupby('Segment').describe().T

	Segment	disengaged_skeptics	marginalised_middles	satisfied_traditionalists	sophisticated_opportunists	struggling_techies
age	count	60.000000	180.000000	160.000000	40.000000	60.000000
	mean	51.970500	43.218778	64.607438	45.684000	25.020667
	std	11.292721	6.784148	9.112192	8.194710	4.466628
	min	26.240000	26.210000	41.360000	31.110000	14.310000
	25%	44.340000	38.532500	59.020000	39.285000	22.070000
	50%	52.685000	43.555000	65.175000	44.930000	25.465000
	75%	58.415000	47.922500	70.787500	51.187500	28.087500
	max	75.350000	57.020000	84.470000	62.450000	35.590000
annual_income	count	60.000000	180.000000	160.000000	40.000000	60.000000
	mean	39231.491833	55453.384333	59366.816563	71282.635500	25304.253667
	std	7327.794135	10107.554102	11499.046189	15910.780974	5859.651345
	min	23278.540000	16005.780000	36366.400000	34845.970000	10933.040000
	25%	34189.782500	48449.300000	51417.595000	59929.825000	22691.042500
	50%	39342.710000	54961.765000	59105.985000	69765.445000	24905.780000
	75%	44564.330000	62889.307500	66734.842500	81407.995000	29252.742500
	max	55407.350000	76188.030000	88451.780000	101432.840000	36739.460000
annual_branch_visits	count	60.000000	180.000000	160.000000	40.000000	60.000000
	mean	14.600000	5.766667	30.100000	11.075000	1.850000
	std	3.340557	2.521638	5.586597	3.173508	1.549467
	min	6.000000	1.000000	14.000000	5.000000	0.000000
	25%	12.750000	4.000000	27.000000	8.750000	1.000000
	50%	14.000000	6.000000	30.000000	12.000000	2.000000
	75%	17.000000	7.000000	34.000000	13.000000	2.000000
	max	25.000000	14.000000	46.000000	17.000000	8.000000
service_satisfaction_level	count	60.000000	180.000000	160.000000	40.000000	60.000000
	mean	2.533333	3.188889	3.650000	3.500000	3.050000
	std	0.999435	1.066446	1.071060	0.847319	1.141290
	min	1.000000	1.000000	1.000000	2.000000	1.000000
	25%	2.000000	2.000000	3.000000	3.000000	2.000000
	50%	2.500000	3.000000	4.000000	4.000000	3.000000
	75%	3.000000	4.000000	4.000000	4.000000	4.000000
	max	5.000000	5.000000	5.000000	5.000000	5.000000
deposit_revenue	count	60.000000	180.000000	160.000000	40.000000	60.000000
	mean	351.461667	981.358500	1202.073187	2546.885250	207.887000
	std	117.945737	285.727065	395.677513	1084.883418	79.367803
	min	100.540000	203.520000	158.100000	0.000000	36.660000
	25%	269.722500	794.452500	925.917500	2089.477500	155.627500
	50%	357.580000	957.185000	1178.785000	2513.155000	226.915000
	75%	427.220000	1190.232500	1501.735000	2949.432500	255.882500
	max	661.050000	1713.780000	2405.490000	5545.030000	362.050000

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("dark")
plt.figure(figsize=(10,7))

sns.boxplot(y='Segment', x='annual_branch_visits', hue='use_debit_services', data=segment_data, 
             orient='h')

<AxesSubplot:xlabel='annual_branch_visits', ylabel='Segment'>

png

plt.figure(figsize=(10,7))
sns.boxplot(y='Segment', x='annual_branch_visits', hue='use_online_mobile', data=segment_data, 
             orient='h')

<AxesSubplot:xlabel='annual_branch_visits', ylabel='Segment'>

png

sns.catplot(y='Segment', x='deposit_revenue', hue='use_online_mobile', col="gender",
                data=segment_data, kind="box", height=9, 
                aspect=.7);

png

plt.figure(figsize=(8,6))
sns.scatterplot(data=segment_data, x="age", y="annual_income", hue=segment_data.Segment)

<AxesSubplot:xlabel='age', ylabel='annual_income'>

png

plt.figure(figsize=(8,6))
sns.scatterplot(data=segment_data, x="annual_branch_visits", y="deposit_revenue", hue=segment_data.Segment)

<AxesSubplot:xlabel='annual_branch_visits', ylabel='deposit_revenue'>

png

plt.figure(figsize=(8,6))
sns.scatterplot(data=segment_data, x="annual_branch_visits", y="annual_income", hue=segment_data.Segment)

<AxesSubplot:xlabel='annual_branch_visits', ylabel='annual_income'>

png

plt.figure(figsize=(8,6))
sns.scatterplot(data=segment_data, x=segment_data.service_satisfaction_level + np.random.normal(scale=0.08,
size=sum(segment_sizes.values())) , y="age", hue=segment_data.Segment)
plt.legend(bbox_to_anchor=(1,1), loc="upper left")

<matplotlib.legend.Legend at 0x18208f96cc0>

png

# create the main dataframe
#from sklearn import preprocessing
df = segment_data.copy()

df.describe(include='all').T

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
age	500	NaN	NaN	NaN	49.1268	15.0625	14.31	38.9975	47.865	61.175	84.47
gender	500	2	male	258	NaN	NaN	NaN	NaN	NaN	NaN	NaN
annual_income	500	NaN	NaN	NaN	52407.5	16297.6	10933	42061.9	53450.3	63444.1	101433
willingness_to_leave	500	2	False	403	NaN	NaN	NaN	NaN	NaN	NaN	NaN
annual_branch_visits	500	NaN	NaN	NaN	14.568	11.8464	0	5	10	26	46
service_satisfaction_level	500	NaN	NaN	NaN	3.242	1.1654	1	3	3	4	5
use_debit_services	500	2	True	390	NaN	NaN	NaN	NaN	NaN	NaN	NaN
use_online_mobile	500	2	True	262	NaN	NaN	NaN	NaN	NaN	NaN	NaN
deposit_revenue	500	NaN	NaN	NaN	1008.83	710.295	0	471.582	939.565	1310.11	5545.03
Segment	500	5	marginalised_middles	180	NaN	NaN	NaN	NaN	NaN	NaN	NaN

df.groupby('Segment').describe(include='all')

	age											gender											annual_income											willingness_to_leave											annual_branch_visits											service_satisfaction_level											use_debit_services											use_online_mobile											deposit_revenue
	count	unique	top	freq	mean	std	min	25%	50%	75%	max	count	unique	top	freq	mean	std	min	25%	50%	75%	max	count	unique	top	freq	mean	std	min	25%	50%	75%	max	count	unique	top	freq	mean	std	min	25%	50%	75%	max	count	unique	top	freq	mean	std	min	25%	50%	75%	max	count	unique	top	freq	mean	std	min	25%	50%	75%	max	count	unique	top	freq	mean	std	min	25%	50%	75%	max	count	unique	top	freq	mean	std	min	25%	50%	75%	max	count	unique	top	freq	mean	std	min	25%	50%	75%	max
Segment
disengaged_skeptics	60.0	NaN	NaN	NaN	51.970500	11.292721	26.24	44.3400	52.685	58.4150	75.35	60	2	male	39	NaN	NaN	NaN	NaN	NaN	NaN	NaN	60.0	NaN	NaN	NaN	39231.491833	7327.794135	23278.54	34189.7825	39342.710	44564.3300	55407.35	60	2	False	35	NaN	NaN	NaN	NaN	NaN	NaN	NaN	60.0	NaN	NaN	NaN	14.600000	3.340557	6.0	12.75	14.0	17.0	25.0	60.0	NaN	NaN	NaN	2.400000	1.092346	1.0	1.75	2.0	3.0	5.0	60	2	False	44	NaN	NaN	NaN	NaN	NaN	NaN	NaN	60	2	False	52	NaN	NaN	NaN	NaN	NaN	NaN	NaN	60.0	NaN	NaN	NaN	351.461667	117.945737	100.54	269.7225	357.580	427.2200	661.05
marginalised_middles	180.0	NaN	NaN	NaN	43.218778	6.784148	26.21	38.5325	43.555	47.9225	57.02	180	2	male	93	NaN	NaN	NaN	NaN	NaN	NaN	NaN	180.0	NaN	NaN	NaN	55453.384333	10107.554102	16005.78	48449.3000	54961.765	62889.3075	76188.03	180	2	False	149	NaN	NaN	NaN	NaN	NaN	NaN	NaN	180.0	NaN	NaN	NaN	5.766667	2.521638	1.0	4.00	6.0	7.0	14.0	180.0	NaN	NaN	NaN	3.222222	1.059965	1.0	3.00	3.0	4.0	5.0	180	2	True	153	NaN	NaN	NaN	NaN	NaN	NaN	NaN	180	2	True	133	NaN	NaN	NaN	NaN	NaN	NaN	NaN	180.0	NaN	NaN	NaN	981.358500	285.727065	203.52	794.4525	957.185	1190.2325	1713.78
satisfied_traditionalists	160.0	NaN	NaN	NaN	64.607438	9.112192	41.36	59.0200	65.175	70.7875	84.47	160	2	male	87	NaN	NaN	NaN	NaN	NaN	NaN	NaN	160.0	NaN	NaN	NaN	59366.816563	11499.046189	36366.40	51417.5950	59105.985	66734.8425	88451.78	160	2	False	155	NaN	NaN	NaN	NaN	NaN	NaN	NaN	160.0	NaN	NaN	NaN	30.100000	5.586597	14.0	27.00	30.0	34.0	46.0	160.0	NaN	NaN	NaN	3.625000	1.174948	1.0	3.00	4.0	5.0	5.0	160	2	True	145	NaN	NaN	NaN	NaN	NaN	NaN	NaN	160	2	False	126	NaN	NaN	NaN	NaN	NaN	NaN	NaN	160.0	NaN	NaN	NaN	1202.073187	395.677513	158.10	925.9175	1178.785	1501.7350	2405.49
sophisticated_opportunists	40.0	NaN	NaN	NaN	45.684000	8.194710	31.11	39.2850	44.930	51.1875	62.45	40	2	female	21	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.0	NaN	NaN	NaN	71282.635500	15910.780974	34845.97	59929.8250	69765.445	81407.9950	101432.84	40	2	False	32	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.0	NaN	NaN	NaN	11.075000	3.173508	5.0	8.75	12.0	13.0	17.0	40.0	NaN	NaN	NaN	3.750000	1.126601	1.0	3.00	4.0	5.0	5.0	40	2	True	35	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40	2	True	31	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.0	NaN	NaN	NaN	2546.885250	1084.883418	0.00	2089.4775	2513.155	2949.4325	5545.03
struggling_techies	60.0	NaN	NaN	NaN	25.020667	4.466628	14.31	22.0700	25.465	28.0875	35.59	60	2	female	40	NaN	NaN	NaN	NaN	NaN	NaN	NaN	60.0	NaN	NaN	NaN	25304.253667	5859.651345	10933.04	22691.0425	24905.780	29252.7425	36739.46	60	2	False	32	NaN	NaN	NaN	NaN	NaN	NaN	NaN	60.0	NaN	NaN	NaN	1.850000	1.549467	0.0	1.00	2.0	2.0	8.0	60.0	NaN	NaN	NaN	2.783333	0.903696	1.0	2.00	3.0	3.0	5.0	60	2	True	41	NaN	NaN	NaN	NaN	NaN	NaN	NaN	60	2	True	56	NaN	NaN	NaN	NaN	NaN	NaN	NaN	60.0	NaN	NaN	NaN	207.887000	79.367803	36.66	155.6275	226.915	255.8825	362.05

print(segment_data.dtypes)
df.dtypes.value_counts()

age                           float64
gender                         object
annual_income                 float64
willingness_to_leave             bool
annual_branch_visits            int64
service_satisfaction_level      int64
use_debit_services               bool
use_online_mobile                bool
deposit_revenue               float64
Segment                        object
dtype: object





float64    3
bool       3
int64      2
object     2
dtype: int64

this dataset consists of mixed variable types, numeric such as float64 and int as well as categorical object type and boolean data.

# frequencies plotted in Histogramms
import warnings
with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    
f, axes = plt.subplots(3,4, figsize=(20, 20))
for ax, feature in zip(axes.flat, df.columns):
    sns.histplot(df[feature], ax=ax)

<string>:6: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.
<string>:6: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.
<string>:6: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.
<string>:6: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.
<string>:6: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.
<string>:6: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.

png

The standard in any explorative data analysis is the consultation of the correlations of the various features, due to the limitations to catch only linear relationships we do not spend too much time on this and rather turn to the Predictive Power Score.

# column pairwise correlations
plt.figure(figsize=(8,8))
sns.heatmap(df.corr('pearson'),annot = True, fmt='.2g', square=True)

<AxesSubplot:>

png

Predictive Power Score

Let’s use the Predictive Power Score (PPS) on this occasion to quickly unfold relations not only between predictors but also the Segment labels. The PPS is also directional, hence asymmetric, also can handle any data-type and can detect linear or non-linear relationships between two columns.

From the first row we can see at first sight the most influential predictors for the Segment as annual_branch_visits, age and deposit_revenue.

### ppscore package ###
import ppscore as pps
def heatmap(df):
    plt.rcParams['figure.figsize'] = (11,7)
    df = df[['x', 'y', 'ppscore']].pivot(columns='x', index='y', values='ppscore')
    ax = sns.heatmap(df, vmin=0, vmax=1, linewidths=0.5, annot=True)
    ax.set_title("PPS matrix")
    ax.set_xlabel("feature")
    ax.set_ylabel("target")
    return ax
def corr_heatmap(df):
    ax = sns.heatmap(df, vmin=-1, vmax=1, linewidths=0.5, annot=True)
    ax.set_title("Correlation matrix")
    return ax
### ppscore package ###

ppsmatrix = pps.matrix(df)
heatmap(ppsmatrix)

<AxesSubplot:title={'center':'PPS matrix'}, xlabel='feature', ylabel='target'>

png

The PPS matrix is not symmetric like Pearsons correlations but can be read from feature to target or the other way round. So for instance we can learn from the matrix that we from the annual_branch_visits you can predict likelihood of online/mobile use with 0.54 quite moderately but much less so the reverse with only 0.15.

import seaborn as sns
sns.set(style="ticks", color_codes=True)
sns.set_style("dark")

plt.figure(figsize=(15,15))
customgrid = sns.pairplot(df, hue="Segment")
customgrid.map_upper(plt.scatter, linewidths=1, s=5, alpha=0.5)
plt.show()

<Figure size 1080x1080 with 0 Axes>

png

res = sns.relplot(data=df, x='annual_income', y='age', hue="Segment", alpha=0.6, col="use_online_mobile")

png

We could use Seaborn relational plots to observe multiple independent variables per Segment.

However you still could not apply all independent variables all at the same time, therefore we turn towards Radar Plots / Spider plots which are great for complete profiling as long as the number of variables are manageable or can be reduced to a handful of essentiell ones.

# https://gist.github.com/kylerbrown/29ce940165b22b8f25f4

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns # improves plot aesthetics
from textwrap import wrap
import matplotlib.patches as mpatches

def _invert(x, limits):
    """inverts a value x on a scale from
    limits[0] to limits[1]"""
    return limits[1] - (x - limits[0])

def _scale_data(data, ranges):
    """scales data[1:] to ranges[0],
    inverts if the scale is reversed"""
    for d, (y1, y2) in zip(data[1:], ranges[1:]):
        #print( d, (y1, y2))
        assert (y1 <= d <= y2) or (y2 <= d <= y1)
    x1, x2 = ranges[0]
    d = data[0]
    if x1 > x2:
        d = _invert(d, (x1, x2))
        x1, x2 = x2, x1
    sdata = [d]
    for d, (y1, y2) in zip(data[1:], ranges[1:]):
        if y1 > y2:
            d = _invert(d, (y1, y2))
            y1, y2 = y2, y1
        sdata.append((d-y1) / (y2-y1) 
                     * (x2 - x1) + x1)
    return sdata

class ComplexRadar():
    def __init__(self, fig, variables, ranges,
                 n_ordinate_levels=6):
        angles = np.arange(0, 360, 360./len(variables))

        axes = [fig.add_axes([0.1,0.1,0.9,0.9],polar=True,
                label = "axes{}".format(i)) 
                for i in range(len(variables))]
        l, text = axes[0].set_thetagrids(angles, 
                                         labels=variables)
        
        [txt.set_rotation(angle-90) for txt, angle 
             in zip(text, angles)]
        
        for ax in axes[1:]:
            ax.patch.set_visible(False)
            ax.grid("off")
            ax.xaxis.set_visible(False)
        for i, ax in enumerate(axes):
            grid = np.linspace(*ranges[i], 
                               num=n_ordinate_levels)
            gridlabel = ["{}".format(round(x,1)) 
                         for x in grid]
            if ranges[i][0] > ranges[i][1]:
                grid = grid[::-1] # hack to invert grid
                          # gridlabels aren't reversed
            gridlabel[0] = "" # clean up origin
            ax.set_rgrids(grid, labels=gridlabel,
                         angle=angles[i])
            #ax.spines["polar"].set_visible(False)
            ax.set_ylim(*ranges[i])
        # variables for plotting
        self.angle = np.deg2rad(np.r_[angles, angles[0]])
        self.ranges = ranges
        self.ax = axes[0]
    def plot(self, data, *args, **kw):
        sdata = _scale_data(data, self.ranges)
        r = self.ax.plot(self.angle, np.r_[sdata, sdata[0]], *args, **kw)
        return r
    def fill(self, data, *args, **kw):
        sdata = _scale_data(data, self.ranges)
        self.ax.fill(self.angle, np.r_[sdata, sdata[0]], *args, **kw)
    def legend(self, *args, **kw):
        self.ax.legend(self, *args, **kw)

# prepare data to be plotted in the segment profiles        
criteria = ('age', 'gender', 'annual_income', 'willingness_to_leave',
       'annual_branch_visits', 'service_satisfaction_level',
       'use_debit_services', 
            'use_online_mobile', 'deposit_revenue')

segmeans = df.copy()
segmeans['gender'] = segmeans['gender'].apply(lambda x: 1 if x == 'male' else 0)

segmeans['willingness_to_leave'] = segmeans['willingness_to_leave'].apply(lambda x: 1 if x == True else 0)
segmeans['use_debit_services'] = segmeans['use_debit_services'].apply(lambda x: 1 if x == True else 0)
segmeans['use_online_mobile'] = segmeans['use_online_mobile'].apply(lambda x: 1 if x == True else 0)

segmeans = segmeans.groupby('Segment').mean()

segs = df.Segment.unique()

#test = segdata.get('marginalised_middles')

data=[]
segdata = dict()

for seg in segs:
    #for crit in criteria:
    data.append(segmeans.loc[seg])
    
    segdata.update({seg: data})
    data = []

ranges = []
for col in segmeans.columns:
    #print(df[col].dtype)
    distance = 0.2 # increase in case of assert errors
    if segmeans[col].dtype != 'object' and segmeans[col].dtype != 'bool':
        lower = min(segmeans[col])- min(segmeans[col])* distance
        upper = max(segmeans[col])+ max(segmeans[col])* distance
        if upper > 10:
            upper = round(upper,0)
            lower = round(lower,0)
        else:
            upper = round(upper,1)
            lower = round(lower,1)
        ranges.append((lower,upper))

# plotting



plt.rc('xtick', labelsize=8) 
plt.rc('ytick', labelsize=8) 
i=0
for key, value in segdata.items():
    #print(key, value)
    #plt.title(key)
    radar = ComplexRadar(plt.figure(figsize=(6, 6), dpi = 100), criteria, ranges)
    radar.plot(tuple(value[0]))
    radar.fill(tuple(value[0]), alpha=0.2)
    #plt.tight_layout(pad=2)
    plt.title('Segment ' + str(key), loc='left')
     
plt.show()  

png

import warnings
warnings.filterwarnings('ignore')
#ignore neglible user warning for legend params

comparisons = [['disengaged_skeptics', 'marginalised_middles'],
             ['satisfied_traditionalists', 'sophisticated_opportunists'],
             ['struggling_techies', 'disengaged_skeptics']]

for comparison in comparisons:
    segdatasliced = {key: segdata[key] for key in segdata.keys() & comparison}
    
    radar = ComplexRadar(plt.figure(figsize=(6, 6), dpi = 100), criteria, ranges)
    #init
    title = ''
    ha=[]
    la=[]
    for key, value in segdatasliced.items():
        r = radar.plot(tuple(value[0]))
        radar.fill(tuple(value[0]), alpha=0.2)
        
        patch = mpatches.Patch(color=r[0].get_color(), label=key)
        ha.append(patch)
        la.append(key)
        #radar.legend(, loc=1)
        #plt.tight_layout(pad=2)
        
        if len(title) > 1:
            title = 'Segment Means Profile Comparison: ' + title + ' vs. ' + key
        else:
            title = key
            
    radar.legend(handles=ha, labels=la, loc=1, bbox_to_anchor=(0.1, 0.1))
    plt.title('\n'.join(wrap(title,40)), loc='left',   x=-0.3, y=0.94, pad=14)
    plt.show()

png

Segment Means Profile comparisons

With radar plots the features average profiles can be visually compared to each other especially for those features which are particularly pronounced towards population max- and minima.

So the differences become quite apparent and could not be wider in certain combinations: whether it is the strong use of debit services of the ‘middles’ or the high willingness to leave of the ‘skeptics’. Both have in common the equally low service satisfaction rating on average.

Cost Sensitive Multi-classification

We assume we have a representative independent set of customer records which can be labeled by customer service and domain experts, we may able to create a classification model that can assign also every other customer record into the correct customer segment they belong to (Multi-class classification task). For this notebook we just use the above generated dataset.

Logistic regression model

So for such a classification I like to choose a Logistic Regression statistical model by default. With logistic regression there are a few assumptions to be met, before it can be applied and used.

Assumptions to be checked

Appropriate outcome type - our case will be the multinomial regression where the dependent variable y will have more than two classes of outcomes.
logistic regression requires the observations to be independent of each other. In other words, the observations should not come from repeated measurements or matched data.
logistic regression requires there to be little or no multicollinearity among the independent variables. This means that the independent variables should not be too highly correlated with each other.
logistic regression assumes linearity of independent variables and log odds. although this analysis does not require the dependent and independent variables to be related linearly, it requires that the independent variables are linearly related to the log odds.
logistic regression typically requires a large sample size. A general guideline is that you need at minimum of 10 cases with the least frequent outcome for each independent variable in your model. For example, if you have 5 independent variables and the expected probability of your least frequent outcome is .10, then you would need a minimum sample size of 500 (10*5 / .10).

the first two assumptions can be checked off quickly: Firstly our case will be the multinomial regression where the dependent variable y will have more than two classes of outcomes, this is an appropiate type of outcome, so we check #1. Secondly all observations should be independent from each other, which is the case here too, as we have generated the data ourselves, we mark assumption number 2 as met as well.

checking assumption 3 - little or no multicollinearity of predictors

We will check all features for elevated mulicorrelation with the Variance Inflation Factor Check. We use the statsmodels package with the same name and prepare the data.

# looks like not needed anymore

"""df_test = df_unlabeled.copy()
df_test['Segment'] = segment_data.Segment
df_test.head()"""

"df_test = df_unlabeled.copy()\ndf_test['Segment'] = segment_data.Segment\ndf_test.head()"

"""import pandas as pd, numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
import warnings
warnings.filterwarnings("ignore")

segm = df.copy()"""

'import pandas as pd, numpy as np\nfrom sklearn.preprocessing import MultiLabelBinarizer\nimport warnings\nwarnings.filterwarnings("ignore")\n\nsegm = df.copy()'

# copy as df
segm = df.copy()

segm['gender'] = segm['gender'].apply(
    lambda x: 1 if x == 'male' else 0
)
segm['use_online_mobile'] = segm['use_online_mobile'].apply(
    lambda x: 1 if x else 0
)
segm['use_debit_services'] = segm['use_debit_services'].apply(
    lambda x: 1 if x else 0
)
segm['willingness_to_leave'] = segm['willingness_to_leave'].apply(
    lambda x: 1 if x else 0
)

# copy as numpy array
segm_array = df.copy().values

X = segm_array[:, :segm.shape[1] - 1]
y = segm_array[:, segm.shape[1] - 1] 

# Variance Inflation Factor Check for Multicollinearity

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

df_vif = add_constant(segm.iloc[:,:-1])
pd.Series([variance_inflation_factor(df_vif.values, i) for i in range(df_vif.shape[1])], index=df_vif.columns)

const                         36.130813
age                            2.516610
gender                         1.023785
annual_income                  1.643256
willingness_to_leave           1.158922
annual_branch_visits           2.628614
service_satisfaction_level     1.152449
use_debit_services             1.168716
use_online_mobile              1.431586
deposit_revenue                1.490094
dtype: float64

The test shows light elevated factors for the variables ‘age’ and ‘annual_branch_visits’.

Levels around 1 are low, threshold to raise concerns range from 2.5 up to 10, and are therefore always subjective.

To be certain we use an other method to detect high-multi-collinearity by inspecting the eigen values of the correlation matrix.

A very low eigen value shows that the data are collinear, and the corresponding eigen vector could show which variables are then collinear. So put another way: if there is no collinearity in the data, you would expect that none of the eigen values are close to zero.

features = segm.iloc[:,:-1]

# correlation matrix
corr = np.corrcoef(features, rowvar=0)

# eigen values & eigen vectors
w, v = np.linalg.eig(corr)

# eigen vector of a predictor
# pd.Series(v[:,3], index=segm.iloc[:,:-1].columns)

pd.Series(w, index=features.columns)

age                           2.797363
gender                        1.526171
annual_income                 0.233061
willingness_to_leave          0.432331
annual_branch_visits          0.518614
service_satisfaction_level    0.704103
use_debit_services            1.032196
use_online_mobile             0.848536
deposit_revenue               0.907625
dtype: float64

The result shows that none of the eigen values could be rounded to zero one digit after the decimal point, so that is bascially enough to rule out a problem with multi-collinearity. We can tick assumption 3 as being checked as well.

Lets move quickly to assumption no 4 which requires linear relationships of the independent variables and the model’s log odds.

# Needed to run the logistic regression
import statsmodels.formula.api as smf

# For plotting/checking assumptions
import seaborn as sns

df_check = segm.copy()

df_check['gender'] = df['gender'].astype('category')
df_check['use_online_mobile'] = df['use_online_mobile'].astype('category')
df_check['use_debit_services'] = df['use_debit_services'].astype('category')
df_check['willingness_to_leave'] = df['willingness_to_leave'].astype('category')


mapdict = dict(zip( df_check.Segment.unique(), range(0,5)))
df_check['Segment'] = df_check.Segment.map(mapdict)

#df_check['Segment'] = df['Segment'].astype('category')
df_check = pd.get_dummies(df_check)
df_check.to_csv('df_check.csv', index=False)

#dummy_segs = df_check.iloc[:,-5:].columns
predictors = df_check.iloc[:,:9].columns

## sklearn
from sklearn.linear_model import LogisticRegression

# Logistic Regression
testmodel = LogisticRegression(multi_class='multinomial')
testmodel.fit(df_check.iloc[:,:-1], df_check['Segment'])

# only for the purpose of this assumption test we predict with training data
probabilities = testmodel.predict_proba(df_check.iloc[:,:-1])

logodds = pd.DataFrame(np.log(probabilities/(1-probabilities)))
logodds.head()

[0 1 2 3 4]

	0	1	2	3	4
0	-0.332695	-0.967721	-1.193680	-2.675036	-4.593497
1	-0.214611	-4.825835	-0.328147	-13.571020	-1.929562
2	0.123727	-2.417282	-0.575284	-6.976273	-3.611851
3	0.176263	-1.583293	-1.015212	-4.618127	-4.596909
4	0.152500	-1.298140	-1.288492	-3.688953	-4.958581

class_list = list(df.Segment.unique())
#yy = 10/pow(df_check.service_satisfaction_level + 3, 5)
#yy = df_check.service_satisfaction_level
yy = df_check.age
xx = logodds.iloc[:,4] # logodds for one class
from statsmodels.nonparametric.smoothers_lowess import lowess as  sm_lowess

plt.figure(figsize=(14,14))
plt.title('predictor age')

cols = 2
rows = np.ceil(len(df.Segment.unique()) / cols)

for i in range(0,len(df.Segment.unique())):
    ax = plt.subplot(rows, cols, i + 1) 
    sm_y, sm_x = sm_lowess(yy, logodds.iloc[:,i],  frac=3./5., it=5, return_sorted = True).T
    plt.plot(sm_x, sm_y, color='tomato', linewidth=3)
    plt.plot(yy,logodds.iloc[:,i], 'k.')
    ax.set_title(class_list[i])
    ax.set_xlabel("age")
    ax.set_ylabel("log odds")
    
plt.show()

png

By plotting the independent variable ‘age’ (representing here also the other variables) to the models logodds we can clearly see non-linear relationships for all classes. To document that further we will proceed with a Box Tidwell test and will try transformed versions of this predictor to see if we can achieve an acceptable linearity with a transformed predictor.

# all continuous variables to transform
df_boxtidwell = df_check.copy()
df_boxtidwell = df_boxtidwell.drop(df_boxtidwell[df_boxtidwell.annual_branch_visits == 0].index)
df_boxtidwell = df_boxtidwell.drop(df_boxtidwell[df_boxtidwell.deposit_revenue == 0].index)
df_boxtidwell = df_boxtidwell.drop(df_boxtidwell[df_boxtidwell.annual_income == 0].index)
df_boxtidwell = df_boxtidwell.drop(df_boxtidwell[df_boxtidwell.service_satisfaction_level == 0].index)

# Define continuous variables
z=5
continuous_var = df_boxtidwell.iloc[:, 0:z].columns

# Add logit transform interaction terms (natural log) for continuous variables e.g. Age * Log(Age)
for var in continuous_var:
    df_boxtidwell[f'Transf_{var}'] = df_boxtidwell[var].apply(lambda x: x * np.log(x) ) #np.log = natural log

df_boxtidwell.head()
#continuous_var
cols = continuous_var.append(df_boxtidwell.iloc[:, -z:].columns)
cols

Index(['age', 'annual_income', 'annual_branch_visits',
       'service_satisfaction_level', 'deposit_revenue', 'Transf_age',
       'Transf_annual_income', 'Transf_annual_branch_visits',
       'Transf_service_satisfaction_level',
       'Transf_deposit_revenue'],
      dtype='object')

import statsmodels.api as sm

# Redefine independent variables to include interaction terms
#X_lt = df_boxtidwell[cols[-5:]]
X_lt = df_boxtidwell[cols]
#_lt = df_boxtidwell['deposit_revenue']
y_lt = df_boxtidwell['Segment']

# Add constant
X_lt = sm.add_constant(X_lt, prepend=False)
  
# Build model and fit the data (using statsmodel's Logit)
logit_result = sm.GLM(y_lt, X_lt, family=sm.families.Binomial()).fit()

# Display summary results
print(logit_result.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                Segment   No. Observations:                  488
Model:                            GLM   Df Residuals:                      477
Model Family:                Binomial   Df Model:                           10
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                    nan
Date:                Fri, 10 Dec 2021   Deviance:                       34679.
Time:                        15:42:50   Pearson chi2:                 3.50e+18
No. Iterations:                     7                                         
Covariance Type:            nonrobust                                         
=====================================================================================================
                                        coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------
age                               -8.491e+15   6.72e+06  -1.26e+09      0.000   -8.49e+15   -8.49e+15
annual_income                     -2.671e+13   1.19e+04  -2.24e+09      0.000   -2.67e+13   -2.67e+13
annual_branch_visits              -2.924e+14   3.76e+06  -7.78e+07      0.000   -2.92e+14   -2.92e+14
service_satisfaction_level         3.532e+15   2.36e+07    1.5e+08      0.000    3.53e+15    3.53e+15
deposit_revenue                   -1.562e+14   8.41e+04  -1.86e+09      0.000   -1.56e+14   -1.56e+14
Transf_age                         1.735e+15   1.37e+06   1.27e+09      0.000    1.74e+15    1.74e+15
Transf_annual_income               2.248e+12   1000.974   2.25e+09      0.000    2.25e+12    2.25e+12
Transf_annual_branch_visits        2.267e+14   9.75e+05   2.33e+08      0.000    2.27e+14    2.27e+14
Transf_service_satisfaction_level -1.415e+15   1.13e+07  -1.25e+08      0.000   -1.41e+15   -1.41e+15
Transf_deposit_revenue             2.056e+13   1.01e+04   2.05e+09      0.000    2.06e+13    2.06e+13
const                              2.078e+17   7.07e+07   2.94e+09      0.000    2.08e+17    2.08e+17
=====================================================================================================

The GLM logit_result shows that basically all variables are flagged for non-linearity across all classic transformations by have significant p-values for the regular and tranformed feature variables. This applies also for all other regular transformations. We therefore have to accept that this data is illsuited for such a logistic regression model and rather move on, instead of sinking more time into this.

Overview Logistic Regression assumptions check

| # | Assumption | Check | |—|:—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————-:|:—–:| | 1 | Appropriate outcome type - our case will be the multinomial regression where the dependent variable y will have more than two classes of outcomes. | ☑ | | 2 | Logistic regression requires the observations to be independent of each other. In other words, the observations should not come from repeated measurements or matched data. | ☑ | | 3 | Logistic regression requires there to be little or no multicollinearity among the independent variables. This means that the independent variables should not be too highly correlated with each other. | ☑ | | 4 | Logistic regression assumes linearity of independent variables and log odds. although this analysis does not require the dependent and independent variables to be related linearly, it requires that the independent variables are linearly related to the log odds. | ☒ | | 5 | Logistic regression typically requires a large sample size. A general guideline is that you need at minimum of 10 cases with the least frequent outcome for each independent variable in your model. For example, if you have 5 independent variables and the expected probability of your least frequent outcome is .10, then you would need a minimum sample size of 500 (10*5 / .10). | |

Alternative multi-classification with Support Vector Machines extended by cost sensitiveness

So alternatively to Logistic Regression we could pick the popular random forest algorithm, but for a change we go for a classic support vector machine (SVM) model. SVM does not require any assumptions to be met, that is very soothing at this point ;)

Instead of the algorithm selection and model tuning we want to shift focus towards how to integrate costs of misclassification into the model building process, especially with assymmetric multi-missclassfication costs, where some misclassifications are more costly than others. This is where we actually have an optimization problem and a lever to achieve cost savings, which will more or less contribute to the overall segmentation goal of increasing efficieny and effectiveness at the same time.

Different costs for different types of errors

The overall objective in classification is to minimize classification errors, with cost-sensitive classification however this objective is shifted from minimizing classification errors, to rather minimizing type I and type II mis-classification errors where each of the multiple classes are associated with their own cost.

We define a error-cost matrix, in which the entries at each cell $({row}{a}, {column}{b})$ specify how costly it should be to predict class $a$ when the true class is $b$.

import pandas as pd, numpy as np

# assymmetric nonreciprocal costs of classification errors
# true class vs. wrong predicted class costs

true_vs_predicted_class_cost = pd.DataFrame(index=segment_names, columns=segment_names)

true_vs_predicted_class_cost.marginalised_middles       = [0,10,10,10,10]
true_vs_predicted_class_cost.disengaged_skeptics        = [10,0, 10, 10, 10]
true_vs_predicted_class_cost.satisfied_traditionalists  = [10,10, 0, 10, 10]
true_vs_predicted_class_cost.struggling_techies         = [10,10, 10, 0, 10]
true_vs_predicted_class_cost.sophisticated_opportunists = [200,34, 40, 25, 0]

true_vs_predicted_class_cost = true_vs_predicted_class_cost.T # transpose the column-wise-built DataFrames
true_vs_predicted_class_cost 

	marginalised_middles	disengaged_skeptics	satisfied_traditionalists	struggling_techies	sophisticated_opportunists
marginalised_middles	0	10	10	10	10
disengaged_skeptics	10	0	10	10	10
satisfied_traditionalists	10	10	0	10	10
struggling_techies	10	10	10	0	10
sophisticated_opportunists	200	34	40	25	0

# for computation we map the class labels to numeric classes

# map class labels numerically
d = dict([(y,x) for x,y in enumerate((segment_names))])

# convert to numeric indices
true_vs_predicted_class_cost.columns = d.values()
true_vs_predicted_class_cost.index = d.values()

true_vs_predicted_class_cost

import random
true_vs_predicted_class_cost

	0	1	2	3	4
0	0	10	10	10	10
1	10	0	10	10	10
2	10	10	0	10	10
3	10	10	10	0	10
4	200	34	40	25	0

# map observations class string labels to number

# convert to np array
cost_matrix = true_vs_predicted_class_cost.values

try:
    y = [d[s] for s in y]
except:
    print('already done') 

# new cost array for package 'costsensitive'
C = np.array([cost_matrix[i] for i in y])
np.random.seed(1)
#C = (np.random.random_sample((500, 5))/100 + 1) * C

#C = np.linalg.inv(C)

# save cost matrix to csv file
# np.savetxt('costmatrix.csv', C, delimiter=',')

C

already done





array([[  0,  10,  10,  10,  10],
       [  0,  10,  10,  10,  10],
       [  0,  10,  10,  10,  10],
       ...,
       [200,  34,  40,  25,   0],
       [200,  34,  40,  25,   0],
       [200,  34,  40,  25,   0]], dtype=int64)

We use a Machine Learning model based on Support Vector Machines to classify the five customer segments based on the clients predictors with the goal to generalize well for new real clients.

In a first step we do a conventional SVM for this imbalanced dataset, for which we use an appropiate metric for evaluation: the weighted F1-Score.

After that we will model the different costs of actual misclassifications of clients into a incorrect segments. This is important because not every error that is made is equally costly. The metric to be evaluated then will be cost savings to the baseline cost of the conventional LR model of Step 1.

Baseline SVM model

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler

#segm = segm.values
X = segm.iloc[:, :segm.shape[1] - 1].astype('float64')
y = segm.iloc[:, segm.shape[1] - 1] 

"""# numerical labels
try:
    y = [d[s] for s in y]
except:
    print('already done') """

# numerical labels
nummap = dict(zip(segment_names,range(0,5)))
y = y.map(nummap)

X_train, X_test, C_train, C_test, y_train, y_test = train_test_split(X, C, y, test_size=.5, random_state=0,
                                                                     stratify=y
                                                                    )
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

[[-0.06506306 -1.03252879 -0.535742   -0.52489066 -0.89210993  0.67795498
   0.51868635  0.88640526 -0.0872679 ]
 [-0.73825472  0.968496   -0.75466817 -0.52489066 -0.72250728 -1.89005632
   0.51868635  0.88640526 -0.22493628]]
[[ 0 10 10 10 10]
 [ 0 10 10 10 10]]
71    0
26    0
Name: Segment, dtype: int64
[[ 2.27122869 -1.03252879  0.27819165 -0.52489066  1.56712848  0.67795498
   0.51868635 -1.12815215  0.68012859]
 [-0.94041437 -1.03252879  1.54138938  1.90515869 -0.46810331  0.67795498
   0.51868635 -1.12815215  2.54650874]]
<class 'pandas.core.series.Series'>
(250, 9)
(250, 5)
(250,)

Fitting the classifiers and tracking test set results:

from sklearn import svm
from sklearn.metrics import confusion_matrix

### Keeping track of the results for later
name_algorithm = list()
f1_scores = list()
costs = list()
collect_conf_mat = list()

#### Baseline : Classic SVM with no weights

svm_model = svm.SVC(random_state=1)
svm_model.fit(X_train, y_train)

preds_svm = svm_model.predict(X_test)

from sklearn.metrics import precision_recall_fscore_support, f1_score

conf_mat = confusion_matrix(y_test, preds_svm)

f1 = f1_score(y_test, preds_svm, zero_division=1, average='weighted')
f1

0.9668887073655656

# logging results
name_algorithm.append("Classic Support Vector Machine SVC Method")
f1_scores.append(f1)
# ex post calculation of costs with the baseline model

costs.append( C_test[np.arange(C_test.shape[0]), preds_svm].sum() )
collect_conf_mat.append(confusion_matrix(y_test, preds_svm))

from sklearn.metrics import plot_confusion_matrix

disp = plot_confusion_matrix(svm_model, X_test, y_test,
                                 display_labels=segment_names,
                                 cmap=plt.cm.Blues)
disp.ax_.set_xticklabels(segment_names, size=12, rotation=30, ha='right')
plt.grid(False)
plt.show()

png

pd.DataFrame(precision_recall_fscore_support(y_test, preds_svm), columns=segment_names, index=['precision', 'recall', 'fscore', 'support'])

	marginalised_middles	disengaged_skeptics	satisfied_traditionalists	struggling_techies	sophisticated_opportunists
precision	0.945652	0.937500	0.987654	1.0	1.000000
recall	0.966667	1.000000	1.000000	1.0	0.750000
fscore	0.956044	0.967742	0.993789	1.0	0.857143
support	90.000000	30.000000	80.000000	30.0	20.000000

Fitting cost-sensitive classifiers

The algorithms applied here are reduction methods that turn our classification-cost problem into a series of importance-weighted binary classification sub-problems (‘multi oracle call’), see for the details the documentation https://costsensitive.readthedocs.io/en/latest/

Github: https://github.com/david-cortes/costsensitive/

One theoretical algorithm implemented in that package originates here: Beygelzimer, A., Dani, V., Hayes, T., Langford, J., & Zadrozny, B. (2005, August). Error limiting reductions between classification tasks. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.2161&rep=rep1&type=pdf

from costsensitive import WeightedAllPairs, WeightedOneVsRest, RegressionOneVsRest, FilterTree, CostProportionateClassifier
from sklearn import svm

### 1: WAP = Weighted All-Pairs as described in "Error limiting reductions between classification tasks"

costsensitive_WAP = WeightedAllPairs(svm.SVC(random_state=1), weigh_by_cost_diff=False)
costsensitive_WAP.fit(X_train, C_train)
preds_WAP = costsensitive_WAP.predict(X_test, method = 'most-wins')
f1_WAP = f1_score(y_test, preds_WAP, zero_division=1, average='weighted')
# log
collect_conf_mat.append(confusion_matrix(y_test, preds_WAP))
name_algorithm.append("Weighted All-Pairs SVM 'Error limiting reductions between classification tasks'")
f1_scores.append(f1_WAP)
costs.append(C_test[np.arange(C_test.shape[0]), preds_WAP].sum())


#### 2: WAP2 = Weighted All-Pairs - cost difference weights enabled

costsensitive_WAP2 = WeightedAllPairs(svm.SVC(random_state=1), weigh_by_cost_diff=True)
costsensitive_WAP2.fit(X_train, C_train)
preds_WAP2 = costsensitive_WAP2.predict(X_test)
f1_WAP2 = f1_score(y_test, preds_WAP2, zero_division=1, average='weighted')
# log
collect_conf_mat.append(confusion_matrix(y_test, preds_WAP2))
name_algorithm.append("Weighted All-Pairs SVM - cost difference weights enabled")
f1_scores.append(f1_WAP2)
costs.append( C_test[np.arange(C_test.shape[0]), preds_WAP2].sum() )


#### 3: WOR = Weighted One-Vs-Rest heuristic with simple cost-weighting schema

costsensitive_WOR = WeightedOneVsRest(svm.SVC(random_state=1), weight_simple_diff=True)
costsensitive_WOR.fit(X_train, C_train)
preds_WOR = costsensitive_WOR.predict(X_test)
f1_WOR = f1_score(y_test, preds_WOR, zero_division=1, average='weighted')
# log
collect_conf_mat.append(confusion_matrix(y_test, preds_WOR))
name_algorithm.append("Weighted One-Vs-Rest SVM, simple cost-weighting schema")
f1_scores.append(f1_WOR)
costs.append( C_test[np.arange(C_test.shape[0]), preds_WOR].sum() )

Optimization Results Comparison

numplots = len(collect_conf_mat)
#print(numplots)
fig, ax = plt.subplots(numplots,1, sharey=True, figsize=(4,17))
#ax[0].get_shared_x_axes().join(ax[1])
for c in range(0,numplots):
    #ax[0].get_shared_x_axes().join(ax[c])
    sns.heatmap(collect_conf_mat[c], yticklabels=segment_names, annot=True, ax=ax[c], cmap=plt.cm.Blues, cbar=False)
    ax[c].set_title(name_algorithm[c],loc='center', fontsize=13)
    #plt.title(drop_combos[c], loc='center', fontsize=5)
    ax[numplots-1].set_xlabel("Predicted")
    ax[c].set_ylabel("True")
    ax[numplots-1].set_xticklabels(segment_names, fontsize=12, rotation=35, ha='right')
    
#fig.tight_layout(h_pad=2)
plt.subplots_adjust(hspace=0.3)
fig.suptitle('Confusion Matrix Comparison', fontsize = 22)
plt.subplots_adjust(top=0.9)
plt.show()

png

import pandas as pd

results = pd.DataFrame({
    'Method' : name_algorithm,
    'F1' : f1_scores,
    'Total_Cost' : costs
})
results=results[['Method', 'Total_Cost', 'F1']]
results.set_index('Method')

	Total_Cost	F1
Method
Classic Support Vector Machine SVC Method	1030	0.966889
Weighted All-Pairs SVM 'Error limiting reductions between classification tasks'	470	0.963998
Weighted All-Pairs SVM - cost difference weights enabled	510	0.948454
Weighted One-Vs-Rest SVM, simple cost-weighting schema	640	0.971733

import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline

rank = results.F1.sort_values().index
palette = sns.color_palette('Blues', len(rank))


fig = plt.gcf()
fig.set_size_inches(11, 8)
sns.set(font_scale=1.3)
ax = sns.barplot(x = "F1", y = "Method", data = results, palette=np.array(palette)[rank])
ax.set(xlabel='log F1 (higher is better)')
# data bar annotations
for p in ax.patches:
    ax.annotate("%.4f" % p.get_width(), xy=(p.get_width(), p.get_y()+p.get_height()/2),
            xytext=(5, 0), textcoords='offset points', ha="left", va="center")
    
ax.set_xscale("log")
plt.title('Cost-Sensitive Classification - F1 Score metric', fontsize=20)
plt.show()

png

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('dark')
#%matplotlib inline

rank = results['Total_Cost'].sort_values(ascending=False).index
palette = sns.color_palette('Blues', len(rank))

fig = plt.gcf()
fig.set_size_inches(11, 8)
sns.set(font_scale=1.3)
ax = sns.barplot(x = "Total_Cost", y = "Method", data = results, palette=np.array(palette)[rank])
ax.set(xlabel='Total Cost (lower is better)')
# data bar annotations
for p in ax.patches:
    ax.annotate("%.0f" % p.get_width(), xy=(p.get_width(), p.get_y()+p.get_height()/2),
            xytext=(5, 0), textcoords='offset points', ha="left", va="center")
    
plt.title('Cost-Sensitive Classification - Total Cost metric', fontsize=20)
plt.show()

png

results

	Method	Total_Cost	F1
0	Classic Support Vector Machine SVC Method	1030	0.966889
1	Weighted All-Pairs SVM 'Error limiting reducti...	470	0.963998
2	Weighted All-Pairs SVM - cost difference weigh...	510	0.948454
3	Weighted One-Vs-Rest SVM, simple cost-weightin...	640	0.971733

results = results.assign(CostSavings = lambda x: (x['Total_Cost'] - results.loc[0,'Total_Cost']) * -1)
results = results.assign(F1_Deviation = lambda x: (x['F1'] - results.loc[0,'F1']))

results

	Method	Total_Cost	F1	CostSavings	F1_Deviation
0	Classic Support Vector Machine SVC Method	1030	0.966889	0	0.000000
1	Weighted All-Pairs SVM 'Error limiting reducti...	470	0.963998	560	-0.002891
2	Weighted All-Pairs SVM - cost difference weigh...	510	0.948454	520	-0.018435
3	Weighted One-Vs-Rest SVM, simple cost-weightin...	640	0.971733	390	0.004845

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('dark')
#%matplotlib inline

rank = results.iloc[1:,:]['CostSavings'].sort_values(ascending=True).index
rank2 = results.iloc[1:,:]['F1_Deviation'].sort_values(ascending=True).index
palette = sns.color_palette('Blues', len(rank))
palette2 = sns.color_palette('Reds', len(rank))

fig = plt.gcf()
fig.set_size_inches(11, 8)
sns.set(font_scale=1.3)
ax = sns.barplot(x = "CostSavings", y = "Method", data = results.iloc[1:,:], palette=np.array(palette)[rank-1])

ax.set(xlabel='Cost Savings (higher is better)')
ax2 = ax.twiny()
def change_width(ax, new_value) :
    locs = ax.get_yticks()
    for i,patch in enumerate(ax.patches):
        current_width = patch.get_width()
        diff = current_width - new_value
        # we change the bar width
        patch.set_height(new_value)
        # we recenter the bar
        patch.set_y(locs[i] - (new_value * .5))


ax2 = sns.barplot(x = "F1_Deviation", y = "Method", data = results.iloc[1:,:], palette=np.array(palette2)[rank2-1], ax=ax2)
ax2.set(xlabel='F1_Deviation (higher is better)')
change_width(ax2, .25)
   
plt.title('Comparison of Cost Sensitive Methods to the Baseline Model\n', fontsize=20)
plt.legend(['F1 Score Deviation', 'Cost Saving'], labelcolor=['crimson','royalblue'], loc='lower left', edgecolor='white')
plt.axvline(0, 0, color='crimson')
plt.show()

png

We can summarize that cost savings can be significant out of the box and may be possible even with a higher F1 Score. With further tuning efforts this may become however much more like a trade-off decision. And the question how far you want to favor marginal cost savings over lower F1-Score or Accuracy results.