Udacity AB testing Online Experiment Design and Analysis - Final Project in Python

Experiment Design Project

Experiment Overview: Free Trial Screener

At the time of this experiment, Udacity courses currently have two options on the course overview page: “start free trial”, and “access course materials”. If the student clicks “start free trial”, they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first. If the student clicks “access course materials”, they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.

In the experiment, Udacity tested a change where if the student clicked “start free trial”, they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead. This screenshot shows what the experiment looks like.

The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn’t have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches’ capacity to support students who are likely to complete the course.

The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.

Please refer to the complete Final Project Instructions

To highlight the Hypothesis:

A popup screener might set clearer expectations for students upfront enrolling, thus reducing the number of frustrated students who leave the free trial because they didn’t have enough time, at the same time without significantly reducing the number of students to continue past the free trial and eventually completing the course

Phases of interest

import plotly.express as px
import pandas as pd

df = pd.DataFrame([
    dict(Diversion="cookie diversion", Start='2021-01-01', End='2021-01-02', Phase="unique user cookie course overview page"),
    dict(Diversion="cookie diversion", Start='2021-01-04', End='2021-01-05', Phase="unique user cookie course overview page"),
    dict(Diversion="cookie diversion", Start='2021-01-05', End='2021-01-06', Phase="unique user cookie course overview page"),
    dict(Diversion="cookie diversion", Start='2021-01-06', End='2021-01-07', Phase="unique user cookie course overview page"),
    dict(Diversion="both diversions", Start='2021-01-06', End='2021-01-07', Phase="'Start free trial' button click"),
    dict(Diversion="user-id diversion", Start='2021-01-05', End='2021-02-20', Phase="user-id login"),
    dict(Diversion="user-id diversion", Start='2021-01-07', End='2021-01-21', Phase="checkout & free trial"),
    dict(Diversion="user-id diversion", Start='2021-01-21', End='2021-02-20', Phase="enrolled past 14 days"),
    dict(Diversion="user-id diversion", Start='2021-02-20', End='2021-02-28', Phase="unenrolled"),
])

# fig = px.timeline(df, x_start="Start", x_end="End", y="Phase", color="Diversion",
#                   color_discrete_sequence=px.colors.qualitative.Vivid,
#                  title='Exemplary phases of the Udacity user signup process' )
# fig.show()
df.head()
Diversion Start End Phase
0 cookie diversion 2021-01-01 2021-01-02 unique user cookie course overview page
1 cookie diversion 2021-01-04 2021-01-05 unique user cookie course overview page
2 cookie diversion 2021-01-05 2021-01-06 unique user cookie course overview page
3 cookie diversion 2021-01-06 2021-01-07 unique user cookie course overview page
4 both diversions 2021-01-06 2021-01-07 'Start free trial' button click

Metric Choice

List which metrics you will use as invariant metrics and evaluation metrics here. (These should be the same metrics you chose in the "Choosing Invariant Metrics" and "Choosing Evaluation Metrics" quizzes.)

Metrics on the table Suitability check Choice
Number of cookies: That is, number of unique cookies to view the course overview page. (dmin=3000) This is clearly an invariant metric as its number is not subject to change and should be the similar before and after the launched change. Therefore, it is not useable as evaluation metric. Invariant
Number of user-ids: That is, number of users who enroll in the free trial. (dmin=50) The count of user-ids is subject of evaluation of this experiment and is supposed to change (not invariant). The use of a simple count may make it difficult to compare to pre-period with overall different number of users. -
Number of clicks: That is, number of unique cookies to click the “Start free trial” button (which happens before the free trial screener is trigger). (dmin=240) The number of clicks to start free trial is unlikely to change as it remains unchanged by the experiment, so invariant in this design. The use of a simple count number may make it difficult to compare to pre-period. Invariant
Click-through-probability: That is, number of unique cookies to click the “Start free trial” button divided by number of unique cookies to view the course overview page. (dmin=0.01) The CTP is a much better candidate as the previous, invariant by nature and comparable. Invariant
Gross conversion: That is, number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the “Start free trial” button. (dmin= 0.01) This ratio is a good complementary evaluation metric. It may be a sensitive metric when others like net conversion do not show a change, in case the experiment fails to retain users better. Evaluation
Retention: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (dmin=0.01) Retention is defined as fraction of both net conversion over gross conversion. As a derived metric of these two conversions it is un clear if it can add value to an evaluation and what direction to expect really. though we will not use it as evaluation metric I continue to calculate and look at it. -
Net conversion: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the “Start free trial” button. (dmin= 0.0075) This ratio is a subsegment of gross conversion only regarding user-ids past 14-day boundary. This ratio is expected to increase significantly, although the absolute number of user-ids might be lower. Evaluation

Measuring Standard Deviation

List the standard deviation of each of your evaluation metrics. (These should be the answers from the "Calculating standard deviation" quiz.)

For each of your evaluation metrics, indicate whether you think the analytic estimate would be comparable to the the empirical variability, or whether you expect them to be different (in which case it might be worth doing an empirical estimate if there is time). Briefly give your reasoning in each case.

# Baseline Values

# udacity baseline google sheet 
google_sheet_url = 'https://docs.google.com/spreadsheets/d/1MYNUtC47Pg8hdoCjOXaHqF-thheGpUshrFA21BAJnNc/export?format=csv&gid=0'

import pandas as pd
df=pd.read_csv(google_sheet_url, names=['observation', 'baseline'], encoding="utf-8")

#df.iloc[2,0] = 'Enrollments'
df['shortname'] = ['cookies',
              'clicks',
              'enrollments',
              'p_click_through',
              'p_enroll_given_click',
               'p_payment_given_enroll',
               'p_payment_given_click'
              ]
df['metric'] = ['',
              '',
              '',
              '',
              'gross_conversion',
               'retention',
               'net_conversion',
              ]
df['dmin'] = [3000,240,50,0.01,0.01,0.01,0.0075]
df['potent_metric'] = [False,False,False,False,True,True,True]
df = df.set_index('shortname')
#df['n'] = 0
df
observation baseline metric dmin potent_metric
shortname
cookies Unique cookies to view course overview page pe... 40000.000000 3000.0000 False
clicks Unique cookies to click "Start free trial" per... 3200.000000 240.0000 False
enrollments Enrollments per day: 660.000000 50.0000 False
p_click_through Click-through-probability on "Start free trial": 0.080000 0.0100 False
p_enroll_given_click Probability of enrolling, given click: 0.206250 gross_conversion 0.0100 True
p_payment_given_enroll Probability of payment, given enroll: 0.530000 retention 0.0100 True
p_payment_given_click Probability of payment, given click 0.109313 net_conversion 0.0075 True

Standard Deviation based on given probabilities

Standard deviation of binomial distributions where successes are used in the form of fraction / proportion of successes (as in our given baseline values). \(SE = \sqrt{\frac{p*(1-p)}{n}}\)

sample based on n=5000 pageviews/cookies with p=probability of success

df['reference_value'] = ['cookie', 'click', 'enroll', 'cookie', 'click', 'enroll', 'click']
df.loc[df['reference_value'] == 'cookie', 'n'] = 5000
df.loc[df['reference_value'] == 'click', 'n'] = df.loc['cookies','n'] * df.loc['p_click_through','baseline'] 
df.loc[df['reference_value'] == 'enroll', 'n'] = df.loc['clicks','n'] * df.loc['p_enroll_given_click','baseline']
import numpy as np

def StandardError(p,n):
    stdev = np.sqrt(p*(1-p)/n)
    return stdev

df['SE'] = df[df['potent_metric']==True].apply( lambda row: StandardError(row['baseline'], row['n']), axis=1)

df
observation baseline metric dmin potent_metric reference_value n SE
shortname
cookies Unique cookies to view course overview page pe... 40000.000000 3000.0000 False cookie 5000.0 NaN
clicks Unique cookies to click "Start free trial" per... 3200.000000 240.0000 False click 400.0 NaN
enrollments Enrollments per day: 660.000000 50.0000 False enroll 82.5 NaN
p_click_through Click-through-probability on "Start free trial": 0.080000 0.0100 False cookie 5000.0 NaN
p_enroll_given_click Probability of enrolling, given click: 0.206250 gross_conversion 0.0100 True click 400.0 0.020231
p_payment_given_enroll Probability of payment, given enroll: 0.530000 retention 0.0100 True enroll 82.5 0.054949
p_payment_given_click Probability of payment, given click 0.109313 net_conversion 0.0075 True click 400.0 0.015602

The standard errors are calculated analytically as the unit of diversion for each evaluation metric is identical to the unit of analysis (denominating factor). Therefore empirical estimation is not required.

The sample proportions are approximated by normal distributions as the size of all referenced sample sizes are deemed as sufficiently large, which can be formally confirmed by the 3-Standard-Deviation Rule for Normal Approximations of Binomial distributions. As both probabilities are not at extremes and the sample sizes n are large enough this check is waived here.

Sizing the experiment

Number of Samples vs. Power

Indicate whether you will use the Bonferroni correction during your analysis phase, and give the number of pageviews you will need to power you experiment appropriately. (These should be the answers from the "Calculating Number of Pageviews" quiz.)

calculate minimum sample sizes in order to have sufficient statistical power, so having enough numbers to be able to tell if there is a signifcant difference between control and treatment group.

$ Gross conversion = C $

$ alpha = 0.05 \quad(Type I error) $

$ beta = 0.2 \quad(Type II error)$

$ H_{0}: C_{cont} - C_{treat} = 0 $

$ H_{A}: C_{cont} - C_{treat} \neq 0 $

from scipy.stats import norm

#Inputs: required alpha value
#Returns: z-score for given alpha
def z_score(alpha):
    return norm.ppf(alpha)

# Inputs p-baseline conversion rate which is our estimated p and dmin detectable change
# Returns
def sds(p,d):
    sd1=np.sqrt(2*p*(1-p))
    sd2=np.sqrt(p*(1-p)+(p+d)*(1-(p+d)))
    sds=[sd1,sd2]
    return sds

# In: sd1-sd for the baseline, 
#     sd2-sd for the expected change,
#     alpha,
#     beta,
#     dmin,
#     p-baseline estimate p
# Returns: the minimum sample size required per group according to metric denominator
def sampSize(sds,alpha,beta,d):
    size=(z_score(1-alpha/2)*sds[0]+z_score(1-beta)*sds[1])**2/d**2
    return size

# bonferroni correction is needed to control the family-wise error rate as both metrics are not independent, 
# and both metrics share the Clicks number therefore the chance for a false positive is certainly higher. 
# it is assumed that one significant test of a single metric is sufficient to reject the Null Hypothesis
alpha = 0.05
numtests = 2 # retention metric excluded
bonferroni_individual_alpha = alpha / numtests

beta = 0.2

df['samplesize']=sampSize(sds(df.baseline[3:7],df.dmin[3:7]),bonferroni_individual_alpha, beta, df.dmin[3:7]) # only probab.
df
observation baseline metric dmin potent_metric reference_value n SE samplesize
shortname
cookies Unique cookies to view course overview page pe... 40000.000000 3000.0000 False cookie 5000.0 NaN NaN
clicks Unique cookies to click "Start free trial" per... 3200.000000 240.0000 False click 400.0 NaN NaN
enrollments Enrollments per day: 660.000000 50.0000 False enroll 82.5 NaN NaN
p_click_through Click-through-probability on "Start free trial": 0.080000 0.0100 False cookie 5000.0 NaN 14204.630379
p_enroll_given_click Probability of enrolling, given click: 0.206250 gross_conversion 0.0100 True click 400.0 0.020231 31270.939498
p_payment_given_enroll Probability of payment, given enroll: 0.530000 retention 0.0100 True enroll 82.5 0.054949 47335.925254
p_payment_given_click Probability of payment, given click 0.109313 net_conversion 0.0075 True click 400.0 0.015602 33170.892254

Let’s multiply the with each reference value denominated sample sizes into the total number of minimum pageviews required for each metric.

df.loc[df.metric == 'gross_conversion', 'required_pageviews'] = df.loc[df.metric == 'gross_conversion', 'samplesize'] / df.loc['p_click_through', 'baseline'] * 2 # 2 groups A + B
df.loc[df.metric == 'gross_conversion', 'required_pageviews'] = df.loc[df.metric == 'gross_conversion', 'required_pageviews'].astype(int)

df.loc[df.metric == 'retention', 'required_pageviews'] = df.loc[df.metric == 'retention', 'samplesize'] / df.loc['p_click_through', 'baseline'] / df.loc[df.metric == 'gross_conversion', 'baseline'][0] * 2 # 2 groups A + B
df.loc[df.metric == 'retention', 'required_pageviews'] = df.loc[df.metric == 'retention', 'required_pageviews'].astype(int)
df.loc[df.metric == 'net_conversion', 'required_pageviews'] = df.loc[df.metric == 'net_conversion', 'samplesize'] / df.loc['p_click_through', 'baseline'] * 2 # 2 groups A + B
df.loc[df.metric == 'net_conversion', 'required_pageviews'] = df.loc[df.metric == 'net_conversion', 'required_pageviews'].astype(int)

Duration vs. Exposure

Indicate what fraction of traffic you would divert to this experiment and, given this, how many days you would need to run the experiment. (These should be the answers from the "Choosing Duration and Exposure" quiz.)

Give your reasoning for the fraction you chose to divert. How risky do you think this experiment would be for Udacity?

from the previously calculated number of required pageviews we can estimate the approximate the duration of days the experiment has to run assuming given constant website traffic

df['duration'] = df.apply( lambda row: row.required_pageviews / df.loc['cookies', 'baseline'], axis=1)
determine_duration = df[df.potent_metric==True & (df.metric!='retention')][['metric', 'required_pageviews', 'duration']]
determine_duration
metric required_pageviews duration
shortname
p_enroll_given_click gross_conversion 781773.0 19.544325
p_payment_given_click net_conversion 829272.0 20.731800
print(f'the test has to run at least {int(np.ceil(max(determine_duration.duration)))} days assuming constant website traffic')
the test has to run at least 21 days assuming constant website traffic

also to be taken into account is the 14 days trial period that delays enrollment and payment data, therefore these 14 days have to be added, amounting to 35 days in total

This tested change with the screener trial popup is not considered high or elevated risk, therefore all traffic can be diverted to the test. As the test has to run for 5 weeks already there is not much wiggle room to prolong that anyway by diverting only partial traffic, then other metrics would have to be considered.

Experiment Analysis

# https://stackoverflow.com/questions/19611729/
def load_from_gspreadsheet(sheet_name, key):
    url = 'https://docs.google.com/spreadsheets/d/{key}/gviz/tq?tqx=out:csv&sheet={sheet_name}&headers=1'.format(
        key=key, sheet_name=sheet_name.replace(' ', '%20'))

    #log.info('Loading google spreadsheet from {}'.format(url))

    df = pd.read_csv(url)
    return df.drop([col for col in df.columns if col.startswith('Unnamed')], axis=1)
cont = load_from_gspreadsheet('Control', '1Mu5u9GrybDdska-ljPXyBjTpdZIUev_6i7t4LRDfXM8')
cont.head()
Date Pageviews Clicks Enrollments Payments
0 Sat, Oct 11 7723 687 134.0 70.0
1 Sun, Oct 12 9102 779 147.0 70.0
2 Mon, Oct 13 10511 909 167.0 95.0
3 Tue, Oct 14 9871 836 156.0 105.0
4 Wed, Oct 15 10014 837 163.0 64.0
exp = load_from_gspreadsheet('Experiment', gsheet_results_key)
exp.head()
Date Pageviews Clicks Enrollments Payments
0 Sat, Oct 11 7716 686 105.0 34.0
1 Sun, Oct 12 9288 785 116.0 91.0
2 Mon, Oct 13 10480 884 145.0 79.0
3 Tue, Oct 14 9867 827 138.0 92.0
4 Wed, Oct 15 9793 832 140.0 94.0

Sanity Checks

For each of your invariant metrics, give the 95% confidence interval for the value you expect to observe, the actual observed value, and whether the metric passes your sanity check. (These should be the answers from the “Sanity Checks” quiz.)

Sanity check of First invariant metric: Pageviews/cookies

# proportion checking

# H0: p = 0.5 # equal click proportions 
cont.Pageviews.sum()
345543
exp.Pageviews.sum()
344660
p = 0.5

stdev = np.sqrt(p*(1-p)/(cont.Pageviews.sum() + exp.Pageviews.sum()))
stdev
0.0006018407402943247
# zscore
alpha = 0.05
z= 1.96

# margin of error
margin_of_error = stdev * z
margin_of_error 
0.0011796078509768765
# confidence interval construction

min_ci = p - margin_of_error
max_ci = p + margin_of_error

print('confidence interval:')
print(round(min_ci,4))
print(p)
print(round(max_ci,4))
confidence interval:
0.4988
0.5
0.5012
cont_cookie_prop = round(cont.Pageviews.sum() / (cont.Pageviews.sum() + exp.Pageviews.sum()),4)
if(cont_cookie_prop > min_ci and cont_cookie_prop < max_ci):
    print(f'sanity check passed: {cont_cookie_prop} is inside the confidence interval')
sanity check passed: 0.5006 is inside the confidence interval

Sanity check of second Invariant metric: Clicks

# proportion checking

# H0: p = 0.5 # equal click proportions 
p = 0.5

stdev = np.sqrt(p*(1-p)/(cont.Clicks.sum() + exp.Clicks.sum()))
stdev
0.002099747079699252
# zscore
alpha = 0.05
z= 1.96

# margin of error
margin_of_error = stdev * z

# confidence interval construction

min_ci = p - margin_of_error
max_ci = p + margin_of_error

print('confidence interval:')
print(round(min_ci,4))
print(p)
print(round(max_ci,4))

confidence interval:
0.4959
0.5
0.5041
cont_click_prop = cont.Clicks.sum() / (cont.Clicks.sum() + exp.Clicks.sum())
if(cont_click_prop > min_ci and cont_click_prop < max_ci):
    print('sanity check passed - '+str(cont_click_prop)+' is inside the confidence interval')

sanity check passed - 0.5004673474066628 is inside the confidence interval

Sanity check of third Invariant metric: CTP

# checking differences in probability-type-numbers

# H0: d = 0
# work with Pandas Series
ctp_cont = cont.Clicks.sum() / cont.Pageviews.sum()
ctp_exp = exp.Clicks.sum() / exp.Pageviews.sum()

d_hat = ctp_exp - ctp_cont 
d_hat
5.662709158693602e-05
# gross conversion across both groups
p_pooled_hat = (cont.Clicks.sum() + exp.Clicks.sum()) / (cont.Pageviews.sum() + exp.Pageviews.sum())
round(p_pooled_hat,4)
0.0822
# sample standard deviations

# SE_pooled
SE_pool = np.sqrt(p_pooled_hat*(1-p_pooled_hat) * (1/cont.Pageviews.sum() + 1/exp.Pageviews.sum()) )
SE_pool
0.0006610608156387222
import scipy.stats as st

# default one tail
# two tails needed
p_value = 0.975  # (1-.05/2)
z = st.norm.ppf(p_value)
z
1.959963984540054
# margin of error
margin_of_error = SE_pool * z
margin_of_error 
0.001295655390242568
# confidence interval construction
# The expected difference d between the two groups is zero. 


min_ci = 0 - margin_of_error
max_ci = 0 + margin_of_error

print('confidence interval:')
print(round(min_ci,4))
print(0)
print(round(max_ci,4))
confidence interval:
-0.0013
0
0.0013
if(d_hat > min_ci and d_hat < max_ci):
    print(f'sanity check passed: d_hat {d_hat} is inside the confidence interval')

sanity check passed: d_hat 5.662709158693602e-05 is inside the confidence interval

All sanity checks have passed successfully - we move on to measure the effect size of the experiment.

Result Analysis

Practical and Statistical Significance - Effect Size Tests

For each of your evaluation metrics, give a 95% confidence interval around the difference between the experiment and control groups. Indicate whether each metric is statistically and practically significant. (These should be the answers from the "Effect Size Tests" quiz.)

Metric: Gross conversion

# disregard cookies which have not completed 14 days trial phase
cont_non_trial = cont.loc[~cont.Enrollments.isnull()]
exp_non_trial = exp.loc[~exp.Enrollments.isnull()]

print(len(exp))
print(len(exp_non_trial))

# gross conversions per group
cont_gc = cont_non_trial.Enrollments.sum() / cont_non_trial.Clicks.sum()
exp_gc = exp_non_trial.Enrollments.sum() / exp_non_trial.Clicks.sum()

# gross conversion across both groups
pooled_gc_hat = (cont_non_trial.Enrollments.sum() + exp_non_trial.Enrollments.sum()) / (cont_non_trial.Clicks.sum() + exp_non_trial.Clicks.sum())
pooled_gc_hat
37
23





0.20860706740369866
# SE_pooled
SE_gc_pool = np.sqrt(pooled_gc_hat*(1-pooled_gc_hat) * (1/cont_non_trial.Clicks.sum() + 1/exp_non_trial.Clicks.sum()) )
SE_gc_pool
0.004371675385225936

Null Hypothesis \(H_0: d = 0 \qquad(\hat{p}_{cont} = \hat{p}_{exp})\qquad \hat{d} \sim N(0, SE_{pool})\)


d_hat = exp_gc - cont_gc
d_hat
-0.020554874580361565
# bonferroni correction
alpha = 0.05
numtests = 2 # retention metric excluded
bonferroni_individual_alpha = alpha / numtests

# margin of error
p_value = 1 - (bonferroni_individual_alpha / 2) # two tails
#print(p_value)

z = st.norm.ppf(p_value)
#print(z)
m = SE_gc_pool * z
m
0.00979868513264882
# confidence interval construction

min_ci = d_hat - m
max_ci = d_hat + m
print('confidence interval:')
print(round(min_ci,4))
print(round(d_hat,4))
print(round(max_ci,4))
confidence interval:
-0.0304
-0.0206
-0.0108
if(0 >= min_ci and 0 <= max_ci):
    print('Null hypothesis cannot be rejected as zero is inside the confidence interval')
else:
    print('Null hypothesis can be rejected as zero is outside the confidence interval')
    print('as per experiment design there is a significant reason to doubt the Null hypothesis and') 
    print('conclude that there is indeed a change in the data between control and treatment groups')

Null hypothesis can be rejected as zero is outside the confidence interval
as per experiment design there is a significant reason to doubt the Null hypothesis and
conclude that there is indeed a change in the data between control and treatment groups

Practical Significance Check

# Gross Conversion
dmin_gc = df.loc[df['metric'] == 'gross_conversion', 'dmin'][0]

if (abs(d_hat) > dmin_gc):
    print(f'This observed change {round(abs(d_hat),4)} is practically signficant, as larger than {dmin_gc} and considered worth the implementation')
else:
    print(f'This observed change {round(abs(d_hat),4)} is not practically signficant, as smaller than {dmin_gc} and not considered worth the implementation')
This observed change 0.0206 is practically signficant, as larger than 0.01 and considered worth the implementation

Metric: Net conversion

# disregard cookies which have not completed 14 days trial phase
#cont_non_trial = cont.loc[~cont.Enrollments.isnull()]
#exp_non_trial = exp.loc[~exp.Enrollments.isnull()]

# net conversions per group
cont_nc = cont_non_trial.Payments.sum() / cont_non_trial.Clicks.sum()
exp_nc = exp_non_trial.Payments.sum() / exp_non_trial.Clicks.sum()

# gross conversion across both groups
pooled_nc_hat = (cont_non_trial.Payments.sum() + exp_non_trial.Payments.sum()) / (cont_non_trial.Clicks.sum() + exp_non_trial.Clicks.sum())
pooled_nc_hat
0.1151274853124186
# sample standard deviations
#stdev_cont = (cont_gc*(1-cont_gc))**0.5
#stdev_exp = (exp_gc*(1-exp_gc))**0.5

# SE_pooled
nc_SE_pool = np.sqrt(pooled_nc_hat*(1-pooled_nc_hat) * (1/cont_non_trial.Clicks.sum() + 1/exp_non_trial.Clicks.sum()) )
nc_SE_pool
0.0034341335129324238
\[H_0: d = 0 \qquad(\hat{p}_{cont} = \hat{p}_{exp})\qquad \hat{d} \sim N(0, SE_{pool})\]

nc_d_hat = exp_nc - cont_nc
nc_d_hat
-0.0048737226745441675
# bonferroni correction
alpha = 0.05
numtests = 2 # retention metric excluded
bonferroni_individual_alpha = alpha / numtests

# margin of error
p_value = 1 - (bonferroni_individual_alpha / 2) # two tails
#p_value = 0.975
print(p_value)
z = st.norm.ppf(p_value)
print(z)
m = nc_SE_pool * z
print(m)
0.9875
2.241402727604947
0.007697276222846293
# confidence interval construction

nc_min_ci = nc_d_hat - m
nc_max_ci = nc_d_hat + m
print(round(nc_min_ci,4))
print(round(nc_d_hat, 4))
print(round(nc_max_ci,4)) 
-0.0126
-0.0049
0.0028
if(0 >= nc_min_ci and 0 <= nc_max_ci):
    print('The Null hypothesis cannot be rejected as zero is inside the confidence interval,')
    print('as per experiment design there is no significant reason to doubt the Null hypothesis and conclude that there is no difference')
else:
    print('Null hypothesis can be rejected as zero is outside the confidence interval')
The Null hypothesis cannot be rejected as zero is inside the confidence interval,
as per experiment design there is no significant reason to doubt the Null hypothesis and conclude that there is no difference

This practical signficance test is waived for the metric net conversion as it was not statistical signficant.

non parametric sign test

For each of your evaluation metrics, do a sign test using the day-by-day data, and report the p-value of the sign test and whether the result is statistically significant. (These should be the answers from the “Sign Tests” quiz.)

# prepare diff series

cont_gc_daily = cont_non_trial.Enrollments / cont_non_trial.Clicks
exp_gc_daily = exp_non_trial.Enrollments / exp_non_trial.Clicks
diff_hat_gc_daily = exp_gc_daily - cont_gc_daily

cont_nc_daily = cont_non_trial.Payments / cont_non_trial.Clicks
exp_nc_daily = exp_non_trial.Payments / exp_non_trial.Clicks
diff_hat_nc_daily = exp_nc_daily - cont_nc_daily
# first approach: built-in statsmodels function sign_test

from statsmodels.sandbox.descstats import sign_test

# gross conversion
print(f'gc sign test p-value: {sign_test(diff_hat_gc_daily, 0)[1]}')

# net conversion
print(f'nc sign test p-value: {sign_test(diff_hat_nc_daily, 0)[1]}')
gc sign test p-value: 0.002599477767944336
nc sign test p-value: 0.6776394844055176
# second approach: calculate manually with which we are able to state on how many days there is a significant change
#                  instead of just receiving the result itself as in first approach

def signtest_diffseries(diff_series, alternative = "two_sided"):
    from scipy.stats import binom
    pos_diff_count = len([k for k in diff_series if k > 0])
    timeunits = len(diff_series)
    median_count = np.median(np.arange(0, timeunits + 1))
    
    if (alternative == "two_sided") and (pos_diff_count >= median_count):
      p_value = round(2 * (1 - binom.cdf((pos_diff_count - 1), timeunits, 0.5)),4)
    elif pos_diff_count < median_count:
      p_value = round(2 * binom.cdf(pos_diff_count, timeunits, 0.5),4)
    
    print(f"Number of positive time units: {pos_diff_count} out of a sample size of {timeunits} time units")
    if(p_value < 0.05):
        print(f"The p-value of {p_value} does indicate a significant change in the data. The Null hypothesis H0: d=0 has to be rejected")
    else:
        print(f"The p-value of {p_value} does not indicate a significant change in the data.")
signtest_diffseries(diff_hat_gc_daily)
Number of positive time units: 4 out of a sample size of 23 time units
The p-value of 0.0026 does indicate a significant change in the data. The Null hypothesis H0: d=0 has to be rejected
signtest_diffseries(diff_hat_nc_daily)
Number of positive time units: 10 out of a sample size of 23 time units
The p-value of 0.6776 does not indicate a significant change in the data.

Technically both sign test implementations confirm the change in the gross conversion to be significant, while there is no change detected for the net conversion.

Summary

The Bonferroni correction was used to control the family wise error rate, as both eval metrics are similar und share the same denominator. The sign test have been conducted to double check validity of the findings, both sign tests have confirmed that Gross conversion changed signficantly while Net conversion did not.

Recommendation

To see a change in the gross conversion and simultaneously no change in the net conversion is unexpected and casts doubt on the effectiveness of the implemented change. The implementation is therefore not recommended for the time being in favor of a new approach and different views on the problem on this very topic, which is outlined in a follow-up experiment setup.

Follow-Up Experiment

Give a high-level description of the follow up experiment you would run, what your hypothesis would be, what metrics you would want to measure, what your unit of diversion would be, and your reasoning for these choices

The change of the current experiment was a query modal window before the trial was clicked and that was obviously too timid of a change to be really significant. The alternative could be to track free trial users (user-ids as unit of diversion) in their learning progress and see if their time devoted to the course is inline with the upfront required 5 hours per week or not. For students trailing behind this could be mirrored to them by text or charts making them aware and consider leaving the trial and instead audit the course resources for free at their own pace, options to re-join the trial anytime available thereafter. This could free companies coaching resourses more effectively that just a single modal window on a ‘start free trial’ click.

Possible Invariant metrics: Number of enrolled free-trial user-ids a few previous Invariant metrics can be kept to control what happens before enrollments

Possible eval metric: Proportion number of payments number of the enrolled user-ids Proportion of User-ids on a par with hour requirement 5 days into the trial Proportion of User-ids on a par with hour requirement 10 days into the trial Proportion of User-ids on a par with hour requirement on last day of trial

We would expect that the to be implemented measures would lead to an increase of the proportion of on-par user-ids in later stages of the 14 days trial.