Non contractual Customer Lifetime Value estimated probabilistically with the Beta Geometric/Negative Binomial Distribution (BG/NBD) Model

31 March 2022 - 24 mins read time
Tags: Customer Lifetime Value Python Beta Geometric/Negative Binomial Distribution (BG/NBD) Customer Penetration Purchase Frequency

We assume an online service business where customers/clients continously purchase our services.

For such a service business we generate our customer transactions ourselves instead of using once again one of the few available public Datasets.

We pick a Gamma distribution for the tenure of our 10000 customers with shape, scale = 4, 11

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
sns.set(rc={"figure.figsize":(10, 8)})

rng = np.random.default_rng(2022)

shape, scale = 4, 11
tenure = rng.gamma(shape, scale, 10000) *2
plt.hist(tenure, density=True, edgecolor='black', bins=50)
plt.title('Histogramm - Customer Tenure')
plt.xlabel('Tenure')
plt.ylabel('Count')
plt.show()

png

Using the Beta Geometric/Negative Binomial Distribution (BG/NBD) Model

The BG/NBD is based on the Pareto/NBD model. Unlike to the Pareto/NBD model, the betageometric (BG) model assume that customer die and drop out immideatly after a transaction, whereas Pareto models a probability that dropout with which customers dropout can occurr anytime.

what else is assumed?

While active, the number of transactions made by a customer follows a Poisson process with transaction rate $\lambda$, which essentially means the time between transactions is exponentially distributed.
Heterogeneity in $ \lambda $ is gamma distributed
Customer become inactive with probability p after every transaction inactive with probability $ p $. Therefore the point in time when the customer dies is distributed (shifted) geometrically across transactions
Heterogeneity in $ p $ follows a beta distribution
The transaction rate $ \lambda $ and the dropout probability $ p $ are independent between customers.

Note: $ \lambda $ and $ p $ are both unobserved

All customers are assumed to be active customers in this model, so it makes sense to apply it on a cohort of customers who just have made their first purchase.

let’s generate some daily transactional data for a cohort of fresh customers

%%time
from faker import Faker
import pandas as pd
fake = Faker(['it_IT', 'en_UK', 'fr_FR', 'de_DE', 'uk_UA'])
newcols = fake.simple_profile().keys()

profiles = pd.DataFrame(columns=list(newcols))
for i in range(10000):
    profiles.loc[i] = fake.simple_profile()

profiles['customer_id'] = profiles.index   
first_column = profiles.pop('customer_id')
profiles.insert(0, 'customer_id', first_column)
profiles.head()

CPU times: total: 17.2 s
Wall time: 17.3 s

	customer_id	username	name	sex	address	mail	birthdate
0	0	abramstey	Nuran Trupp-Lachmann	F	Ritterallee 5/3\n23598 Staffelstein	babett14@hotmail.de	1952-02-10
1	1	pichonalex	Marcel Leroy	M	69, rue de Ollivier\n25378 GoncalvesBourg	francoismartin@club-internet.fr	1966-12-13
2	2	le-gallemmanuelle	Monique Roche	F	85, avenue de Thierry\n67649 Maillet	madeleineleroy@dbmail.com	1945-04-22
3	3	jakob73	Nuray Martin B.Sc.	F	Holtallee 7\n19875 Ahaus	hgeisler@yahoo.de	1909-09-26
4	4	shvachkanazar	Тетяна Дергач	F	набережна Лемківська, 2, селище Марʼяна, 101472	leontii90@email.ua	1936-08-12

# model params rounded from CDNOW sample in the paper p. 281
params=dict()

params['r'] = 0.25, 
params['alpha'] = 4.5

params['a'] = 0.8
params['b'] = 2.4

params

{'r': (0.25,), 'alpha': 4.5, 'a': 0.8, 'b': 2.4}

observation_period_end='2021-12-31'

%%time
from lifetimes.generate_data import beta_geometric_nbd_model_transactional_data

transactions = beta_geometric_nbd_model_transactional_data(tenure, params['r'], params['alpha'], params['a'], params['b'], 
                                                  observation_period_end=observation_period_end, freq='D', size=10000)
transactions.shape

CPU times: total: 1min 52s
Wall time: 1min 52s

(29664, 2)

trans_df = transactions.merge(profiles, left_on='customer_id', right_on='customer_id')

trans_df.head()

	customer_id	date	username	name	sex	address	mail	birthdate
0	0	2021-04-20 21:30:50.384044800	abramstey	Nuran Trupp-Lachmann	F	Ritterallee 5/3\n23598 Staffelstein	babett14@hotmail.de	1952-02-10
1	0	2021-08-26 06:44:27.043411199	abramstey	Nuran Trupp-Lachmann	F	Ritterallee 5/3\n23598 Staffelstein	babett14@hotmail.de	1952-02-10
2	0	2021-10-21 23:25:51.052166400	abramstey	Nuran Trupp-Lachmann	F	Ritterallee 5/3\n23598 Staffelstein	babett14@hotmail.de	1952-02-10
3	1	2021-06-10 07:05:42.827625600	pichonalex	Marcel Leroy	M	69, rue de Ollivier\n25378 GoncalvesBourg	francoismartin@club-internet.fr	1966-12-13
4	1	2021-12-11 21:52:01.636838400	pichonalex	Marcel Leroy	M	69, rue de Ollivier\n25378 GoncalvesBourg	francoismartin@club-internet.fr	1966-12-13

Aggregate the summary data analog to RFM segmentation (Recency, Frequency, Monetary) from the transactional data just generated

from lifetimes.utils import summary_data_from_transaction_data

summary = summary_data_from_transaction_data(trans_df, 'customer_id', 'date', observation_period_end=observation_period_end)

summary = pd.concat([profiles, summary], axis=1)
summary

	customer_id	username	name	sex	address	mail	birthdate	frequency	recency	T
0	0	abramstey	Nuran Trupp-Lachmann	F	Ritterallee 5/3\n23598 Staffelstein	babett14@hotmail.de	1952-02-10	2.0	184.0	255.0
1	1	pichonalex	Marcel Leroy	M	69, rue de Ollivier\n25378 GoncalvesBourg	francoismartin@club-internet.fr	1966-12-13	1.0	184.0	204.0
2	2	le-gallemmanuelle	Monique Roche	F	85, avenue de Thierry\n67649 Maillet	madeleineleroy@dbmail.com	1945-04-22	3.0	98.0	99.0
3	3	jakob73	Nuray Martin B.Sc.	F	Holtallee 7\n19875 Ahaus	hgeisler@yahoo.de	1909-09-26	0.0	0.0	15.0
4	4	shvachkanazar	Тетяна Дергач	F	набережна Лемківська, 2, селище Марʼяна, 101472	leontii90@email.ua	1936-08-12	1.0	45.0	109.0
...	...	...	...	...	...	...	...	...	...	...
9995	9995	iarynahavrylyshyn	Олег Рудько	M	набережна Маркіяна Шашкевича, 517, хутір Данил...	khavrylenko@ukr.net	1988-07-07	0.0	0.0	40.0
9996	9996	gisbertriehl	Claire Koch-Anders	F	Zorbachring 7\n70870 Burgdorf	tomas93@hotmail.de	2014-08-13	0.0	0.0	89.0
9997	9997	marta63	Остап Ейбоженко	M	вулиця Бруно Шульца, 915, хутір Ганна, 76375	vdovychenkobohuslav@meta.ua	1976-07-24	0.0	0.0	100.0
9998	9998	opowell	Chelsea Poole	F	Flat 82\nBryan passage\nNorth Luke\nWF1N 0AL	rsmart@hotmail.co.uk	1916-10-24	1.0	47.0	101.0
9999	9999	havrylotsymbaliuk	пан Устим Цибуленко	M	вулиця Василя Симоненка, 628, село Леон, 85627	venedykt92@ukr.net	1975-02-06	2.0	44.0	75.0

10000 rows × 10 columns

"""from lifetimes.generate_data import beta_geometric_nbd_model
#lifetimes.generate_data.beta_geometric_nbd_model(T, r, alpha, a, b, size=1)
#Generate artificial data according to the BG/NBD model.

df = beta_geometric_nbd_model(tenure, params['r'], params['alpha'], params['a'], params['b'], size=10000)
"""
"""Parameters:	
T (array_like) – The length of time observing new customers.
alpha, a, b (r,) – Parameters in the model. See [1]_
size (int, optional) – The number of customers to generate
Returns:	
DataFrame – With index as customer_ids and the following columns: ‘frequency’, ‘recency’, ‘T’, ‘lambda’, ‘p’, ‘alive’, ‘customer_id’"""

'Parameters:\t\nT (array_like) – The length of time observing new customers.\nalpha, a, b (r,) – Parameters in the model. See [1]_\nsize (int, optional) – The number of customers to generate\nReturns:\t\nDataFrame – With index as customer_ids and the following columns: ‘frequency’, ‘recency’, ‘T’, ‘lambda’, ‘p’, ‘alive’, ‘customer_id’'

from lifetimes import BetaGeoFitter
bgf = BetaGeoFitter(penalizer_coef=0)
bgf.fit(summary['frequency'], summary['recency'], summary['T'])
bgf.summary

	coef	se(coef)	lower 95% bound	upper 95% bound
r	0.289300	0.005818	0.277897	0.300704
alpha	6.476316	0.213714	6.057437	6.895194
a	0.790777	0.053952	0.685031	0.896524
b	2.530743	0.231390	2.077220	2.984267

from lifetimes.plotting import plot_frequency_recency_matrix
sns.set(rc={"figure.figsize":(10, 8)})
sns.set_style('dark')
plot_frequency_recency_matrix(bgf, cmap='plasma')
plt.show()

png

from lifetimes.plotting import plot_probability_alive_matrix

plot_probability_alive_matrix(bgf, cmap='plasma')
plt.show()

png

Customer ranking

Let us identify the customers with Top 5 expected purchases within next seven days (t=7) based on their transaction history.

t = 7
ppcolname = 'predicted_purchases_' + str(t)
summary[ppcolname] = bgf.conditional_expected_number_of_purchases_up_to_time(t, summary['frequency'],summary['recency'], summary['T'])
summary.sort_values(by=ppcolname, ascending=False).head(5)

	customer_id	username	name	sex	address	mail	birthdate	frequency	recency	T	predicted_purchases_14	predicted_purchases_30	predicted_purchases_7	monetary_value_average	Expected_cond_average_revenue	error_rev
2910	2910	carlypalmer	Mrs. Joan Heath	M	Studio 41t\nAbigail port\nEast Jayne\nPO5X 1DL	sally36@yahoo.co.uk	1908-01-15	33.0	55.0	55.0	6.851758	14.839702	3.557546	38.734677	38.622364	-0.112313
3129	3129	jsontag	Zeynep Steckel	F	Biengasse 720\n46808 Ansbach	peukertmeike@aol.de	1976-05-28	20.0	30.0	31.0	6.324447	13.947335	3.344909	31.671466	31.566594	-0.104872
1677	1677	bohodarokhrimenko	Болеслав Дубас	M	вулиця Шота Руставелі, 248, місто Лариса, 11399	artemtymchuk@gmail.com	1974-01-16	19.0	34.0	34.0	5.749406	12.524436	3.029490	16.370829	16.441952	0.071123
3149	3149	ujones	Kieran Jennings	F	Flat 3\nRoger port\nKieranton\nSR3 0HP	abigailphillips@gmail.com	2012-02-22	34.0	71.0	72.0	5.554099	12.087839	2.862532	7.135654	7.237132	0.101478
7500	7500	elombardi	Dott. Melania Toldo	F	Incrocio Bragaglia 3\nCostanzi sardo, 94137 Tr...	cgagliano@tim.it	2016-02-24	23.0	47.0	47.0	5.412396	11.756576	2.821701	12.182490	12.282481	0.099991

t = 30
ppcolname = 'predicted_purchases_' + str(t)
summary[ppcolname] = bgf.conditional_expected_number_of_purchases_up_to_time(t, summary['frequency'],summary['recency'], summary['T'])
sorted_summary = summary.sort_values(by=ppcolname, ascending=False)
sorted_summary.head()

	customer_id	username	name	sex	address	mail	birthdate	frequency	recency	T	predicted_purchases_14	predicted_purchases_30	predicted_purchases_7	monetary_value_average	Expected_cond_average_revenue	error_rev
2910	2910	carlypalmer	Mrs. Joan Heath	M	Studio 41t\nAbigail port\nEast Jayne\nPO5X 1DL	sally36@yahoo.co.uk	1908-01-15	33.0	55.0	55.0	6.851758	13.594318	3.557546	38.734677	38.622364	-0.112313
3129	3129	jsontag	Zeynep Steckel	F	Biengasse 720\n46808 Ansbach	peukertmeike@aol.de	1976-05-28	20.0	30.0	31.0	6.324447	12.149527	3.344909	31.671466	31.566594	-0.104872
3149	3149	ujones	Kieran Jennings	F	Flat 3\nRoger port\nKieranton\nSR3 0HP	abigailphillips@gmail.com	2012-02-22	34.0	71.0	72.0	5.554099	11.173048	2.862532	7.135654	7.237132	0.101478
1677	1677	bohodarokhrimenko	Болеслав Дубас	M	вулиця Шота Руставелі, 248, місто Лариса, 11399	artemtymchuk@gmail.com	1974-01-16	19.0	34.0	34.0	5.749406	11.114488	3.029490	16.370829	16.441952	0.071123
7500	7500	elombardi	Dott. Melania Toldo	F	Incrocio Bragaglia 3\nCostanzi sardo, 94137 Tr...	cgagliano@tim.it	2016-02-24	23.0	47.0	47.0	5.412396	10.659459	2.821701	12.182490	12.282481	0.099991

Model fit assessment

from lifetimes.plotting import plot_period_transactions
plot_period_transactions(bgf)
plt.show()

png

calibration_period_end='2021-10-10'

from lifetimes.utils import calibration_and_holdout_data

summary_cal_holdout = calibration_and_holdout_data(trans_df, 'customer_id', 'date',
                                        calibration_period_end=calibration_period_end,
                                        observation_period_end=observation_period_end )
summary_cal_holdout.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4998 entries, 0 to 9998
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   frequency_cal      4998 non-null   float64
 1   recency_cal        4998 non-null   float64
 2   T_cal              4998 non-null   float64
 3   frequency_holdout  4998 non-null   float64
 4   duration_holdout   4998 non-null   float64
dtypes: float64(5)
memory usage: 363.3 KB

from lifetimes.plotting import plot_calibration_purchases_vs_holdout_purchases

bgf = BetaGeoFitter(penalizer_coef=0.001)
bgf.fit(summary_cal_holdout['frequency_cal'], summary_cal_holdout['recency_cal'], summary_cal_holdout['T_cal'])
plot_calibration_purchases_vs_holdout_purchases(bgf, summary_cal_holdout)
plt.show()

png

bgf.fit(summary_cal_holdout['frequency_cal'], summary_cal_holdout['recency_cal'], summary_cal_holdout['T_cal'])
plot_cumulative_transactions(bgf, trans_df, 'date', 'customer_id', 280, 140);

png

plot_incremental_transactions(bgf, trans_df, 'date', 'customer_id',280, 140);

png

customer_X = sorted_summary[200:201]
customer_X

	customer_id	username	name	sex	address	mail	birthdate	frequency	recency	T	predicted_purchases_14	predicted_purchases_30	predicted_purchases_7	monetary_value_average	Expected_cond_average_revenue	error_rev
8403	8403	trubinguenther	Ing. Hanno Ullmann	M	Gudegasse 8/4\n87671 Grevenbroich	riehlmarlen@yahoo.de	1909-03-31	7.0	44.0	46.0	1.601542	3.179439	0.831359	18.652597	18.769815	0.117218

t = 30 # predict number of purchases in next t periods

individual = summary.iloc[customer_X.customer_id]
bgf.predict(t, individual['frequency'], individual['recency'], individual['T'])
# 0.0576511

8403    3.374978
dtype: float64

from lifetimes.plotting import plot_history_alive

days_since_birth = 61 # of this customer X

sp_trans = trans_df.loc[trans_df['customer_id'] == int(customer_X.customer_id)]
plot_history_alive(bgf, days_since_birth, sp_trans, 'date', title='History of P_alive of customer_X='+str(customer_X.username) )
plt.show()

png

sp_trans

	customer_id	date	username	name	sex	address	mail	birthdate
25086	8403	2021-11-15 11:51:07.696598400	trubinguenther	Ing. Hanno Ullmann	M	Gudegasse 8/4\n87671 Grevenbroich	riehlmarlen@yahoo.de	1909-03-31
25087	8403	2021-11-16 21:17:22.970515200	trubinguenther	Ing. Hanno Ullmann	M	Gudegasse 8/4\n87671 Grevenbroich	riehlmarlen@yahoo.de	1909-03-31
25088	8403	2021-11-24 04:46:07.064832	trubinguenther	Ing. Hanno Ullmann	M	Gudegasse 8/4\n87671 Grevenbroich	riehlmarlen@yahoo.de	1909-03-31
25089	8403	2021-12-02 19:17:03.481641600	trubinguenther	Ing. Hanno Ullmann	M	Gudegasse 8/4\n87671 Grevenbroich	riehlmarlen@yahoo.de	1909-03-31
25090	8403	2021-12-13 04:24:15.312902400	trubinguenther	Ing. Hanno Ullmann	M	Gudegasse 8/4\n87671 Grevenbroich	riehlmarlen@yahoo.de	1909-03-31
25091	8403	2021-12-15 16:12:24.811142400	trubinguenther	Ing. Hanno Ullmann	M	Gudegasse 8/4\n87671 Grevenbroich	riehlmarlen@yahoo.de	1909-03-31
25092	8403	2021-12-23 23:41:56.826096	trubinguenther	Ing. Hanno Ullmann	M	Gudegasse 8/4\n87671 Grevenbroich	riehlmarlen@yahoo.de	1909-03-31
25093	8403	2021-12-29 12:19:29.010403200	trubinguenther	Ing. Hanno Ullmann	M	Gudegasse 8/4\n87671 Grevenbroich	riehlmarlen@yahoo.de	1909-03-31

no_transactions = trans_df.groupby('customer_id').count().sort_values('date', ascending=False)['date']
no_transactions 

customer_id
  88
  87
   75
  66
  57
        ..
   1
   1
   1
   1
   1
Name: date, Length: 10000, dtype: int64

df3 = summary[summary['frequency']>0]
df3['prob_alive'] = bgf.conditional_probability_alive(df3['frequency'],df3['recency'],df3['T'])
sns.distplot(df3['prob_alive']);
plt.xlim(0,1)
plt.suptitle('Distribution of Alive Probabilities')
plt.show()

df3['churn'] = ['churned' if p < .1 else 'not churned' for p in df3['prob_alive']]
sns.countplot(df3['churn']);
plt.suptitle('Grouping of alive probabilities')
plt.show()

print('Grouping of alive probabilities')
df3['churn'][(df3['prob_alive']>=.1) & (df3['prob_alive']<.2)] = "high risk"
df3['churn'].value_counts()

png

Grouping of alive probabilities

not churned    3841
churned         772
high risk       589
Name: churn, dtype: int64

Model monetary transaction value data modeled with Gamma-Gamma model

## monetray value data gen

shape, scale = 3, 3.  # mean=4, std=2*sqrt(2)
monetary_value_average = rng.gamma(shape, scale, 10000) * 2
plt.hist(monetary_value_average*2, density=True, edgecolor='black', bins=50)
plt.show()

png

summary['monetary_value_average'] = transaction_amount

returning_customers_summary = summary[summary['frequency']>0]

returning_customers_summary.head()

	customer_id	username	name	sex	address	mail	birthdate	frequency	recency	T	predicted_purchases_14	predicted_purchases_30	predicted_purchases_7	monetary_value_average	Expected_cond_average_revenue	error_rev
0	0	abramstey	Nuran Trupp-Lachmann	F	Ritterallee 5/3\n23598 Staffelstein	babett14@hotmail.de	1952-02-10	2.0	184.0	255.0	0.082734	0.174773	0.041632	32.613560	31.565017	-1.048543
1	1	pichonalex	Marcel Leroy	M	69, rue de Ollivier\n25378 GoncalvesBourg	francoismartin@club-internet.fr	1966-12-13	1.0	184.0	204.0	0.062412	0.131735	0.031418	16.727437	17.774840	1.047403
2	2	le-gallemmanuelle	Monique Roche	F	85, avenue de Thierry\n67649 Maillet	madeleineleroy@dbmail.com	1945-04-22	3.0	98.0	99.0	0.357496	0.738755	0.181780	8.692225	9.658282	0.966057
4	4	shvachkanazar	Тетяна Дергач	F	набережна Лемківська, 2, селище Марʼяна, 101472	leontii90@email.ua	1936-08-12	1.0	45.0	109.0	0.080884	0.168804	0.040935	14.807318	16.211255	1.403937
5	5	petro07	Алла Рябець	F	набережна Дністровська, 843, хутір Ілля, 106205	sviatoslavadurdynets@ukr.net	2002-08-12	1.0	6.0	34.0	0.172044	0.345416	0.088817	25.410872	24.845904	-0.564968

Important assumption for the Gamma-Gamma Model: the relationship between the monetary value and the purchase frequency is near zero.

As this is met we can continue to train the model and start analysing.

returning_customers_summary[['monetary_value_average', 'frequency']].corr()

	monetary_value_average	frequency
monetary_value_average	1.00000	-0.00149
frequency	-0.00149	1.00000

from lifetimes import GammaGammaFitter

ggf = GammaGammaFitter(penalizer_coef = 0.002)
ggf.fit(returning_customers_summary['frequency'], returning_customers_summary['monetary_value_average'])

<lifetimes.GammaGammaFitter: fitted with 5202 subjects, p: 5.53, q: 2.26, v: 5.10>

ggf.summary

	coef	se(coef)	lower 95% bound	upper 95% bound
p	5.532698	0.115701	5.305924	5.759472
q	2.261587	0.044056	2.175237	2.347938
v	5.100490	0.125329	4.854845	5.346135

summary['Expected_cond_average_revenue'] = ggf.conditional_expected_average_profit(summary['frequency'], summary['monetary_value_average'])

summary['Expected_cond_average_revenue'].describe()

count    10000.000000
mean        20.413614
std          7.062209
min          1.499528
25%         16.409391
50%         22.368232
75%         22.368232
max         75.235044
Name: Expected_cond_average_revenue, dtype: float64

# MAPE
from sklearn.metrics import mean_absolute_percentage_error

summary["error_rev"] = summary['Expected_cond_average_revenue'] - summary['monetary_value_average']

mape = mean_absolute_percentage_error(summary['Expected_cond_average_revenue'], summary["monetary_value_average"])
print("MAPE of predicted revenues:", f'{mape:.3f}')

MAPE of predicted revenues: 0.244

summary['Expected_cond_average_revenue'].head(20)

   31.565017
   17.774840
    9.658282
   22.368232
   16.211255
   24.845904
   22.368232
   22.368232
   22.368232
    8.200775
  23.078353
   6.811015
   9.628732
  22.368232
  22.368232
  22.368232
  19.063189
  29.568306
  34.173798
  12.974695
Name: Expected_cond_average_revenue, dtype: float64

print("Expected conditional average profit: %s vs. Average profit: %s" % (
    ggf.conditional_expected_average_profit(
        summary['frequency'],
        summary['monetary_value_average']
    ).mean(),
    summary[summary['frequency']>0]['monetary_value_average'].mean()
))

Expected conditional average profit: 20.41361394230678 vs. Average profit: 18.131239279823276

Calculate the Customer Lifetime Value disconted by DCF and a annual interest rate

# refit the BG model 
bgf.fit(summary['frequency'], summary['recency'], summary['T'])

# modelling CLV
summary['clv'] = ggf.customer_lifetime_value(
    bgf,
    summary['frequency'],
    summary['recency'],
    summary['T'],
    summary['monetary_value_average'],
    time=12, # lifetime in months
    discount_rate=0.006 
)
summary['clv'].head(10)

 52.30111846
 20.98403156
 57.38228431
 47.29754080
 21.46389456
 52.60841130
 14.31632335
 27.28834943
 14.72406602
 15.56113740
Name: clv, dtype: float64

# describe the distribution
pd.options.display.float_format = '{:.8f}'.format
summary['clv'].describe()

count   10000.00000000
mean       56.73016752
std       142.07256046
min         0.00000003
25%        13.83895220
50%        21.00412658
75%        39.37786598
max      3623.81967049
Name: clv, dtype: float64

ax = sns.distplot(summary['clv'], kde=False, rug=True)
ax.set_yscale('log')
plt.suptitle('Distribution of Customer Lifetime Value vs. logarithmic density')

plt.show()

png

Although the 75% percentile is at under 40 bucks, few customers with high variance and 4 digit CLV lift the arithmetic mean of CLV way above that number.

It makes sense to segment these different customer types differently to better approach them. The features from this BG/NBD have been engineered and can be used for such a customer segmentation.

Compare with the original summary df and look at all these new columns:

summary.columns

Index(['customer_id', 'username', 'name', 'sex', 'address', 'mail',
       'birthdate', 'frequency', 'recency', 'T', 'predicted_purchases_14',
       'predicted_purchases_30', 'predicted_purchases_7',
       'monetary_value_average', 'Expected_cond_average_revenue', 'error_rev',
       'clv'],
      dtype='object')

References

[1] Fader, Peter S., Bruce G.S. Hardie, and Jen Shang (2010), “Customer-Base Analysis in a Discrete-Time Noncontractual Setting,” Marketing Science, 29 (6), 1086-1108.