Bayes Theorem applied in Python incl. A/B Testing example - Notes

11 May 2022 - 35 mins read time
Tags: Statistics Python A/B testing Inference Notes

\	Frequentist (classical) approach	Bayesian approach
probability	proportion of outcomes	degree of belief
parameters	fixed values/constants	random variables
names	confidence interval, p-value, power, significance	credible interval, prior, posterior

$\Large P(A\mid B)= \frac{P(B\mid A)P(A)}{P(B)}$

$\ P(A\mid B) = Posterior $

$\ P(B\mid A) = Likelihood $

$\ P(A) = Prior $

$\ P(B) = Evidence :(=scaling factor)$

Bayesian estimation

grid approximation

import numpy as np
import pandas as pd
from scipy.stats import uniform, binom

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")

# Create cured patients array from 1 to 10
num_patients_cured = np.arange(0,11)

# Create efficacy rate array from 0 to 1 by 0.01
efficacy_rate = np.arange(0,1.01, 0.01)

# Combine the two arrays in one DataFrame
df = pd.DataFrame([(x, y) for x in num_patients_cured for y in efficacy_rate])

# Name the columns
df.columns = ["num_patients_cured", "efficacy_rate"]

# Print df
print(df)

      num_patients_cured  efficacy_rate
                    0           0.00
                    0           0.01
                    0           0.02
                    0           0.03
                    0           0.04
...                  ...            ...
                10           0.96
                10           0.97
                10           0.98
                10           0.99
                10           1.00

[1111 rows x 2 columns]

grid approximation without prior knowledge - prior is uniform distribution

# Calculate the prior efficacy rate and the likelihood
df["prior"] = uniform.pdf(df["efficacy_rate"])
df["likelihood"] = binom.pmf(df["num_patients_cured"], 10, df["efficacy_rate"])

# Calculate the posterior efficacy rate and scale it to sum up to one
df["posterior_prob"] = df["prior"] * df["likelihood"]
df["posterior_prob"] /= df["posterior_prob"].sum()

# Compute the posterior probability of observing 9 cured patients
df_9_of_10_cured = df.loc[df["num_patients_cured"] == 9].copy()
df_9_of_10_cured["posterior_prob"] /= df_9_of_10_cured["posterior_prob"].sum()

# Plot the drug's posterior efficacy rate having seen 9 out of 10 patients cured.
sns.set_style("darkgrid")
sns.lineplot(x=df_9_of_10_cured["efficacy_rate"], y=df_9_of_10_cured["posterior_prob"] )
plt.show()

png

The testing of the drug continues, and a group of additional 12 sick patients have been treated, 10 of whom were cured.

Updating the posterior belief: as prior the previous posterior probability is used

# Assign old posterior to new prior and calculate likelihood
df["new_prior"] = df["posterior_prob"]
df["new_likelihood"] = binom.pmf(df["num_patients_cured"], 12, df["efficacy_rate"])

# Calculate new posterior and scale it
df["new_posterior_prob"] = df["new_prior"] * df["new_likelihood"]
df["new_posterior_prob"] /= df["new_posterior_prob"].sum()

# Compute the posterior probability of observing 10 cured patients
df_10_of_12_cured = df[df["num_patients_cured"] == 10].copy()
df_10_of_12_cured["new_posterior_prob"] /= df_10_of_12_cured["new_posterior_prob"].sum()

sns.lineplot(x=df_10_of_12_cured["efficacy_rate"], 
             y=df_10_of_12_cured["new_posterior_prob"], 
             label="new posterior")
sns.lineplot(x=df_9_of_10_cured["efficacy_rate"], 
             y=df_9_of_10_cured["posterior_prob"], 
             label="old posterior = new prior")
plt.show()

png

# compare old and new likelihood
data = {
    'likelihood': binom.pmf(df["num_patients_cured"], 10, df["efficacy_rate"])[0:10],
    'new likelihood': binom.pmf(df["num_patients_cured"], 12, df["efficacy_rate"])[0:10]
}

pd.DataFrame(data=data)

	likelihood	new likelihood
0	1.000000	1.000000
1	0.904382	0.886385
2	0.817073	0.784717
3	0.737424	0.693842
4	0.664833	0.612710
5	0.598737	0.540360
6	0.538615	0.475920
7	0.483982	0.418596
8	0.434388	0.367666
9	0.389416	0.322475

binom.pmf(df["num_patients_cured"], 12, df["efficacy_rate"])

array([1.        , 0.88638487, 0.78471672, ..., 0.02157072, 0.00596892,
       0.        ])

Prior distribution

= chosen and stated before we see the data (no cherry picking)

can be

old prior
expert opinion
common knowledge
previous research
subjective belief

conjugate priors

= Some priors,multiplied with specic likelihoods, yield known posteriors

for example:

# prior: Beta(a, b)
# posterior: Beta(x, y)
# with
# x = NumberOfSuccesses + a
# y = NumberOfObservations - NumberOfSuccesses + b

Two methods to get the posterior

Simulation

draws = np.random.beta(2,4,1000)
draws[0:10]

array([0.24647378, 0.11933372, 0.38872732, 0.4030806 , 0.59654602,
       0.01687355, 0.29056646, 0.14239311, 0.17596316, 0.5461129 ])

#sns.kdeplot(draws)

Calculation

if posterior not know we can calculate using grid approximation

#sns.lineplot(x=df["efficacy_rate"], y=df["posterior_prob"] )
df.head()

	efficacy_rate	prior	likelihood	posterior_prob	new_prior	new_likelihood	new_posterior_prob
0	0.00	1.0	1.000000	0.009901	0.009901	1.000000	0.050946
1	0.01	1.0	0.904382	0.008954	0.008954	0.886385	0.040840
2	0.02	1.0	0.817073	0.008090	0.008090	0.784717	0.032665
3	0.03	1.0	0.737424	0.007301	0.007301	0.693842	0.026067
4	0.04	1.0	0.664833	0.006583	0.006583	0.612710	0.020753

Simulating posterior draws

Beta(5, 2) prior

# Define the number of patients treated and cured
num_patients_treated = 22
num_patients_cured = 19

# Simulate 10000 draws from the posterior distribution
posterior_draws = np.random.beta(num_patients_cured + 5, num_patients_treated - num_patients_cured + 2, 10000)

# Plot the posterior distribution
sns.kdeplot(data=posterior_draws, shade=True)
plt.show()

png

Reporting Bayesian results

sometimes information on distributions needs to be condensed and single number of a parameter is needed

point estimates

mean
median
credible interval (vs. confidence interval [frequentist])

credible interval

the credible interval is the probability that the parameter falls inside credible interval bounds (vs. confidence interval [frequentist])

# Calculate the expected number of people cured
cured_expected = np.mean(posterior_draws) * 100_000 # 100k infections

# Calculate the minimum number of people cured with 50% probability
min_cured_50_perc = np.median(posterior_draws) * 100_000

# Calculate the minimum number of people cured with 90% probability
min_cured_90_perc = np.percentile(posterior_draws, 10) * 100000

# Print the filled-in memo
print(f"Based on the experiments carried out by ourselves and neighboring countries, \nshould we distribute the drug, we can expect {int(cured_expected)} infected people to be cured. \nThere is a 50% probability the number of cured infections \nwill amount to at least {int(min_cured_50_perc)}, and with 90% probability \nit will not be less than {int(min_cured_90_perc)}.")

Based on the experiments carried out by ourselves and neighboring countries, 
should we distribute the drug, we can expect 82798 infected people to be cured. 
There is a 50% probability the number of cured infections 
will amount to at least 83550, and with 90% probability 
it will not be less than 73539.

Highest Posterior Density (HPD)

to calculate the credible intervals, as the low and high bounds in which the parameter lies with X% probability

# Import pymc3 as pm
import pymc3 as pm
# hpd function deprecated
# using hdi (highest density interval) instead

ci_67 = pm.hdi(posterior_draws, hdi_prob=0.67)
#print(ci_67)

# Calculate HPD credible interval of 90%
ci_90 = pm.hdi(posterior_draws, hdi_prob=0.9)

# Calculate HPD credible interval of 95%
ci_95 = pm.hdi(posterior_draws, hdi_prob=0.95)

pm.plot_posterior(np.array(posterior_draws), hdi_prob=0.95)
# Print the memo
print(f"The experimental results indicate that with a 90% probability \nthe new drug's efficacy rate is between {np.round(ci_90[0], 2)} and {np.round(ci_90[1], 2)}, \nand with a 95% probability it is between {np.round(ci_95[0], 2)} and {np.round(ci_95[1], 2)}.")

The experimental results indicate that with a 90% probability 
the new drug's efficacy rate is between 0.72 and 0.94, 
and with a 95% probability it is between 0.69 and 0.95.

png

Bayesian inference

A/B testing

# 1 heads, 0 tails
tosses = [1, 0, 0, 1, 0, 1, 1, 1, 0, 1]

Simulate beta posterior

With Conjugate prior $ Beta(a, b) $ posterior is $ Beta(x, y) $

$ x = \text{NumberOfHeads} + a $

$ y = \text{NumberOfTosses} - \text{NumberOfHeads} + b $

# Set prior parameters and calculate number of successes
beta_prior_a = 1
beta_prior_b = 1
num_successes = np.sum(tosses)

# Generate 10000 posterior draws
posterior_draws = np.random.beta(num_successes + beta_prior_a, 
                                len(tosses) - num_successes + beta_prior_b, 
                                10000)  

# Plot density of posterior_draws
sns.kdeplot(posterior_draws, shade=True)
plt.show()

png

# Set prior parameters and calculate number of successes
beta_prior_a = 1
beta_prior_b = 10
num_successes = np.sum(tosses)

# Generate 10000 posterior draws
posterior_draws = np.random.beta(num_successes + beta_prior_a, 
                                len(tosses) - num_successes + beta_prior_b, 
                                10000)  

# Plot density of posterior_draws
sns.kdeplot(posterior_draws, shade=True)
plt.show()

png

You decide to run A/B testing, modeling the data using the binomial likelihood. You found out that a typical click-through rate for the previous ads has been around 15% recently, with results varying between 5% and 30%. Based on this, you conclude that Beta(10,50) would be a good prior for the click-through rate.

A/B testing - Posterior click rates

two pilot advertising campaigns: one for sneakers, and one for clothes.

ads = pd.read_csv('ads.tsv', delimiter='\t', index_col=0)
ads['time'] = pd.to_datetime(ads['time'])

def simulate_beta_posterior(trials, beta_prior_a, beta_prior_b):
    num_successes = np.sum(trials)    
    posterior_draws = np.random.beta(num_successes + beta_prior_a, len(trials) - num_successes + beta_prior_b, 10000    )
    return posterior_draws

ads.head()

	user_id	product	site_version	time	banner_clicked
id
0	f500b9f27ac611426935de6f7a52b71f	clothes	desktop	2019-01-28 16:47:00	0
1	cb4347c030a063c63a555a354984562f	sneakers	mobile	2019-03-31 17:34:00	0
2	89cec38a654319548af585f4c1c76b51	clothes	mobile	2019-06-02 09:22:00	0
3	1d4ea406d45686bdbb49476576a1a985	sneakers	mobile	2019-05-23 08:07:00	0
4	d14b9468a1f9a405fa801a64920367fe	clothes	mobile	2019-01-28 08:16:00	0

# Generate prior draws
prior_draws = np.random.beta(10, 50, 100000)

# Plot the prior
sns.kdeplot(prior_draws, shade=True, label="prior")
plt.show()

# Extract the banner_clicked column for each product
clothes_clicked = ads.loc[ads["product"] == "clothes"]["banner_clicked"]
sneakers_clicked = ads.loc[ads["product"] == "sneakers"]["banner_clicked"]

# Simulate posterior draws for each product
clothes_posterior = simulate_beta_posterior(clothes_clicked, 10,50)
sneakers_posterior = simulate_beta_posterior(sneakers_clicked, 10,50)

png

sns.kdeplot(clothes_posterior, shade=True, label="clothes")
sns.kdeplot(sneakers_posterior, shade=True, label="sneakers")
plt.legend(labels=['clothes', 'sneakers'])
plt.show()

png

It is more probable that clothes ads are better, but we are not completely sure. We cannot be sure about it, since the two posteriors overlap, so it is actually possible for the sneakers campaign to be bettter.

# Calculate posterior difference and plot it
diff = clothes_posterior - sneakers_posterior
sns.kdeplot(diff, shade=True, label="diff")
plt.axvline(x=0, color='red', linestyle='--')
plt.text(0.0002, 32, "zero difference marker", rotation=90, verticalalignment='center')
plt.title('Difference distribution of clothes - sneakers posteriors')
plt.show()

# Calculate and print 90% credible interval of posterior difference
interval = pm.hdi(diff, hdi_prob=0.9)
print(interval)

# Calculate and print probability of clothes ad being better
clothes_better_prob = (diff > 0).mean()
print(clothes_better_prob)

pm.plot_posterior(np.array(diff), hdi_prob=0.9)
plt.show()

png

[0.00254178 0.02415007]
0.9795

png

Take a look at the posterior density plot of the difference in click rates: it is very likely positive, indicating that clothes are likely better. The credible interval indicates that with 90% probability, the clothes ads click rate is up to 2.4 percentage points higher than the one for sneakers. Finally, the probability that the clothes click rate is higher is 98%. Great! But there is a 2% chance that actually sneakers ads are better!

Risk analysis

there is a 2% risk that it’s the sneakers ads that are actually better. If that’s the case, how many clicks do we lose if we roll out the clothes campaign?

# Slice diff to take only cases where it is negative
loss = diff[diff < 0]

# Compute and print expected loss
expected_loss = loss.mean()
print(f'maximum loss: {round(expected_loss*100,4)} %')

maximum loss: -0.2422 %

You can safely roll out the clothes campaign to a larger audience. You are 98% sure it has a higher click rare, and even if the 2% risk of this being a wrong decision materializes, you will only lose about 0.2 percentage points in the click rate, which is a very small risk!

Decision analysis

translate parameters to relevant metrics to inform decision making

From posteriors to decisions

In order to take strategic decisions, one should know the probabilities of different scenarios.
Bayesian methods allow us to translate parameters into relevant metrics easily.
Change the unit to which decision takes place in, e.g. CTR dist * impressions * revenue/click = revenue dist

# Calculate distributions of the numbers of clicks for clothes and sneakers
clothes_num_clicks = clothes_posterior * 10_000
sneakers_num_clicks = sneakers_posterior * 10_000

# Calculate cost distributions for each product and platform
ads_costs = {
    "clothes_mobile": clothes_num_clicks * 2.5,
    "sneakers_mobile": sneakers_num_clicks * 2.5,
    "clothes_desktop": clothes_num_clicks * 2,
    "sneakers_desktop": sneakers_num_clicks * 2,
}

# Draw a forest plot of ads_costs
pm.plot_forest(ads_costs, hdi_prob=0.99, textsize=15)
plt.show()

png

The ends of the whiskers mark the 99% credible interval, so there is a 1% chance the cost will fall outside of it. It’s very, very unlikely, but there is a slim chance that the clothes-mobile cost will turn out lower. It’s important to stay cautious when communicating possible scenarios – that’s the thing with probability, it’s rarely the case that something is ‘completely impossible’!

# Calculate profit distributions for each product and platform
ads_profit = {
    "clothes_mobile": clothes_num_clicks * 3.4 - ads_costs["clothes_mobile"],
    "sneakers_mobile": sneakers_num_clicks * 3.4 - ads_costs["sneakers_mobile"],
    "clothes_desktop": clothes_num_clicks * 3 - ads_costs["clothes_desktop"],
    "sneakers_desktop": sneakers_num_clicks * 3 - ads_costs["sneakers_desktop"],
}

# Draw a forest plot of ads_profit
pm.plot_forest(ads_profit, hdi_prob=0.99)
plt.show()

png

Notice how shifting focus from costs to profit has changed the optimal decision. The sneakers-desktop campaign which minimizes the cost is not the best choice when you care about the profit. Based on these results, you would be more likely to invest in the clothes-desktop campaign, wouldn’t you? Let’s continue to the final lesson of this chapter, where we look at regression and forecasting, the Bayesian way!

Regression and forecasting

Bayesian regression model

Your linear regression model has four parameters: the intercept, the impact of clothes ads, the impact of sneakers ads, and the variance.

Analyzing regression parameters

#list the files
filelist = ['intercept_draws.txt', 'clothes_draws.txt', 'sneakers_draws.txt', 'sd_draws.txt']
#read them into pandas
df_list = [pd.read_csv(file, names=['col']) for file in filelist]
#concatenate them together
posterior_draws_df = pd.concat(df_list, axis=1)
posterior_draws_df.columns=['intercept_draws', 'clothes_draws', 'sneakers_draws', 'sd_draws']

# Describe parameter posteriors
draws_stats = posterior_draws_df.describe()
print(draws_stats)

# Plot clothes parameter posterior
pm.plot_posterior(np.array(posterior_draws_df.clothes_draws), hdi_prob=0.99)
plt.show()

       intercept_draws  clothes_draws  sneakers_draws     sd_draws
count      2000.000000    2000.000000     2000.000000  2000.000000
mean          1.280420       0.104594        0.103594     2.651661
std           0.903845       0.030282        0.031596     0.159491
min          -2.088446      -0.007500        0.001084     2.211899
25%           0.712354       0.085381        0.081577     2.543340
50%           1.288362       0.104680        0.103554     2.639033
75%           1.849244       0.123830        0.125466     2.754714
max           4.343638       0.229886        0.211751     3.278124

png

The impact parameters of both clothes and sneakers look okay: they are positive, most likely around 0.1, indicating 1 additional click from 10 ad impressions, which makes sense. Let’s now use the model to make predictions!

Let’s now use the linear regression model to make predictions. How many clicks can we expect if we decide to show 10 clothes ads and 10 sneaker ads?

Predictive distribution

Let’s now use the linear regression model to make predictions. How many clicks can we expect if we decide to show 10 clothes ads and 10 sneaker ads?

# Aggregate posteriors of the parameters to point estimates
intercept_coef = np.mean(posterior_draws_df.intercept_draws)
sneakers_coef = np.mean(posterior_draws_df.sneakers_draws)
clothes_coef = np.mean(posterior_draws_df.clothes_draws)
sd_coef = np.mean(posterior_draws_df.sd_draws)

# Calculate the mean of the predictive distribution
pred_mean = intercept_coef + sneakers_coef * 10 + clothes_coef * 10

# Sample 1000 draws from the predictive distribution
pred_draws = np.random.normal(pred_mean, sd_coef, size=1000)

# Plot the density of the predictive distribution
pm.plot_posterior(pred_draws, hdi_prob=0.99)
plt.show()

png

It looks like you can expect more or less three or four clicks if you show 10 clothes and 10 sneaker ads.

Bayesian linear regression with pyMC3

Markov Chain Monte Carlo (MCMC) and model fitting

Problems of Bayesian data analysis in production:

Grid approximation: inconvenient with many parameters
Sampling from known posterior: requires conjugate priors

Solution:

Markov Chains Monte Carlo: sampling from unknown posterior -> same result like others but most flexible

Monte Carlo

Approximating some quantity by generating random numbers

Markov Chains

Markov Chains model a sequence of states, between which one transitions with given probabilities

New Use Case

predict the number of bikes rented per day to plan staff and repairs accordingly

import pandas as pd
import pymc3 as pm
import seaborn as sns
# datacamp has made an error and only provided the test data, no training file downloadable

bikes_test = pd.read_csv('bikes_test.csv')

# very small dataset of just 10 observations
bikes_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   work_day    10 non-null     int64  
 1   temp        10 non-null     float64
 2   humidity    10 non-null     float64
 3   wind_speed  10 non-null     float64
 4   num_bikes   10 non-null     float64
dtypes: float64(4), int64(1)
memory usage: 528.0 bytes

# simple model
formula_first_model = "num_bikes ~ temp + work_day"

# Generate predictive draws
with pm.Model() as model:
    pm.GLM.from_formula(formula_first_model, data=bikes_test, 
                       )
    print(model)
    posterior_predictive_first_model = pm.sample(draws=18000, tune=18000, chains=12, cores=6, return_inferencedata=True) 

The glm module is deprecated and will be removed in version 4.0
We recommend to instead use Bambi https://bambinos.github.io/bambi/
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...


Intercept ~ Flat
     temp ~ Normal
 work_day ~ Normal
 sd_log__ ~ TransformedDistribution
       sd ~ HalfCauchy
        y ~ Normal


Multiprocess sampling (12 chains in 6 jobs)
NUTS: [sd, work_day, temp, Intercept]

100.00% [432000/432000 02:58<00:00 Sampling 12 chains, 13,575 divergences]

Sampling 12 chains for 18_000 tune and 18_000 draw iterations (216_000 + 216_000 draws total) took 215 seconds.
There were 1559 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.6937498861112372, but should be close to 0.8. Try to increase the number of tuning steps.
There were 1081 divergences after tuning. Increase `target_accept` or reparameterize.
There were 475 divergences after tuning. Increase `target_accept` or reparameterize.
There were 3079 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.5765400788999024, but should be close to 0.8. Try to increase the number of tuning steps.
There were 1956 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.6623325935424444, but should be close to 0.8. Try to increase the number of tuning steps.
There were 920 divergences after tuning. Increase `target_accept` or reparameterize.
There were 691 divergences after tuning. Increase `target_accept` or reparameterize.
There were 703 divergences after tuning. Increase `target_accept` or reparameterize.
There were 252 divergences after tuning. Increase `target_accept` or reparameterize.
There were 779 divergences after tuning. Increase `target_accept` or reparameterize.
There were 1530 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.7154111570610734, but should be close to 0.8. Try to increase the number of tuning steps.
There were 550 divergences after tuning. Increase `target_accept` or reparameterize.
The number of effective samples is smaller than 10% for some parameters.

pm.summary(posterior_predictive_first_model)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
Intercept	0.352	0.647	-0.892	1.551	0.004	0.003	29035.0	52519.0	1.0
temp	9.050	2.919	3.644	14.619	0.017	0.012	26739.0	46236.0	1.0
work_day	0.796	0.371	0.097	1.498	0.002	0.001	34369.0	54533.0	1.0
sd	0.396	0.147	0.190	0.651	0.002	0.001	3454.0	1887.0	1.0

# Extend the model with the parameter wind speed

# Define the formula
formula = "num_bikes ~ temp + work_day + wind_speed"

# Generate predictive draws
with pm.Model() as model:
    pm.GLM.from_formula(formula, data=bikes_test, 
                       )
    print(model)
    posterior_predictive = pm.sample(draws=18000, tune=18000, chains=12, cores=6, return_inferencedata=True) 

The glm module is deprecated and will be removed in version 4.0
We recommend to instead use Bambi https://bambinos.github.io/bambi/


   work_day      temp  humidity  wind_speed  num_bikes
0         0  0.265833  0.687917    0.175996      2.947
1         1  0.282609  0.622174    0.153800      3.784
2         1  0.354167  0.496250    0.147379      4.375
3         1  0.256667  0.722917    0.133721      2.802
4         1  0.265000  0.562083    0.194037      3.830
 Intercept ~ Flat
      temp ~ Normal
  work_day ~ Normal
wind_speed ~ Normal
  sd_log__ ~ TransformedDistribution
        sd ~ HalfCauchy
         y ~ Normal


Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (12 chains in 6 jobs)
NUTS: [sd, wind_speed, work_day, temp, Intercept]

100.00% [432000/432000 04:46<00:00 Sampling 12 chains, 45,241 divergences]

Sampling 12 chains for 18_000 tune and 18_000 draw iterations (216_000 + 216_000 draws total) took 322 seconds.
There were 6090 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.4535070540886654, but should be close to 0.8. Try to increase the number of tuning steps.
There were 484 divergences after tuning. Increase `target_accept` or reparameterize.
There were 3314 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.6322839508323852, but should be close to 0.8. Try to increase the number of tuning steps.
There were 5749 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.4657185678254947, but should be close to 0.8. Try to increase the number of tuning steps.
There were 1988 divergences after tuning. Increase `target_accept` or reparameterize.
There were 1572 divergences after tuning. Increase `target_accept` or reparameterize.
There were 3831 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.603721265287982, but should be close to 0.8. Try to increase the number of tuning steps.
There were 5848 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.4531228873120186, but should be close to 0.8. Try to increase the number of tuning steps.
There were 5826 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.47532685097020505, but should be close to 0.8. Try to increase the number of tuning steps.
There were 2830 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.6774930920115295, but should be close to 0.8. Try to increase the number of tuning steps.
There were 4966 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.528125701676579, but should be close to 0.8. Try to increase the number of tuning steps.
There were 2743 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.6825571279708497, but should be close to 0.8. Try to increase the number of tuning steps.
The number of effective samples is smaller than 10% for some parameters.

# 313sec
#pm.save_trace(trace=posterior_predictive, directory='C://Users//SD//Bayes_Theorem_applied_in_Python//trace.dump', overwrite=True)
pm.summary(posterior_predictive)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
Intercept	0.148	2.044	-3.759	3.984	0.034	0.024	3097.0	5728.0	1.01
temp	9.445	4.916	0.149	18.606	0.072	0.051	3878.0	6208.0	1.00
work_day	0.824	0.542	-0.198	1.828	0.008	0.006	3356.0	10050.0	1.01
wind_speed	0.427	4.060	-7.275	8.045	0.065	0.046	3170.0	6536.0	1.00
sd	0.471	0.190	0.224	0.793	0.006	0.004	507.0	850.0	1.03

# reference
# https://bookdown.org/rdpeng/advstatcomp/monitoring-convergence.html#gelman-rubin-statistic
# r_hat 1.1 or 1.2 as close to mean that all chains have reasonably converged

pm.plot_trace(posterior_predictive)
plt.show()

png

hdi_prob = 0.94
pm.plot_forest(posterior_predictive, hdi_prob=hdi_prob)
plt.title(f'Forest plot of all the converged chains with credible interval {hdi_prob*100}% ')
plt.show()

png

Interpret results and compare the two models

We use the Widely Applicable Information Criterion to compare the two models. Here trace 1 as our first simpler model proves to be better a fit for the data with its higher waic score in pmyc’s comparison function. In the weight column each model’s probability of being the true model is shown, here the extended model with wind speed is clearly worse than the simple one.

import warnings
warnings.simplefilter('ignore')

comparison = pm.compare({"trace_1": posterior_predictive_first_model, 
                         "trace_2": posterior_predictive},
                        ic='waic')
comparison

	rank	waic	p_waic	d_waic	weight	se	dse	warning	waic_scale
trace_1	0	-5.979267	3.269994	0.000000	1.0	2.500034	0.000000	True	log
trace_2	1	-7.404534	3.447594	1.425267	0.0	1.684752	0.974662	True	log

pm.compareplot(comparison, textsize=13)
# empty circles are the waic scores
# black bars as standard deviations

<AxesSubplot:xlabel='Log'>

png

Making predictions

# predict with fresh test data
with pm.Model() as model:
    pm.GLM.from_formula(formula_first_model, data=bikes_test)
    predictive_dist = pm.fast_sample_posterior_predictive(posterior_predictive_first_model)

The glm module is deprecated and will be removed in version 4.0
We recommend to instead use Bambi https://bambinos.github.io/bambi/

predictive_dist['y'].shape

(216000, 10)

import numpy as np
import matplotlib.pyplot as plt

# Initialize errors
errors = []

# Iterate over rows of bikes_test to compute error per row
for index, test_example in bikes_test.iterrows():
    error = predictive_dist["y"][:, index] - test_example["num_bikes"]
    errors.append(error)

# Reshape errors
error_distribution = np.array(errors).reshape(-1)

# Plot the error distribution
pm.plot_posterior(error_distribution, kind='hist')
plt.xlim([-3, 3])
plt.show()

png

In practice, you might want to compute the error estimate based on more than just 10 observations, but you can already see some patterns. For example, the error is very marginal the error is more often negative than positive, which means that the model tends to underpredict the number of bikes rented!