Table of Contents
- Part I – Probability
- Part II – A/B Test
- Part III – Regression
Introduction
A/B tests are very commonly performed by data analysts and data scientists. For this project, we will be working to understand the results of an A/B test run by an e-commerce website. Our goal is to work through this notebook to help the company understand if they should implement the new page, keep the old page, or perhaps run the experiment longer to make their decision.
Part I – Probability
To get started, let’s import our libraries.
In [1]:
import pandas as pd import numpy as np import random import matplotlib.pyplot as plt %matplotlib inline random.seed(42)
Question 1. Now, read in the ab_data.csv data. Store it in df.
a. Read in the dataset and take a look at the top few rows here:
In [2]:
df = pd.read_csv('ab_data.csv') df.head()
Out[2]:
user_id | timestamp | group | landing_page | converted | |
---|---|---|---|---|---|
0 | 851104 | 2017-01-21 22:11:48.556739 | control | old_page | 0 |
1 | 804228 | 2017-01-12 08:01:45.159739 | control | old_page | 0 |
2 | 661590 | 2017-01-11 16:55:06.154213 | treatment | new_page | 0 |
3 | 853541 | 2017-01-08 18:28:03.143765 | treatment | new_page | 0 |
4 | 864975 | 2017-01-21 01:52:26.210827 | control | old_page | 1 |
b. Use the below cell to find the number of rows in the dataset.
In [3]:
df.shape[0]
Out[3]:
294478
c. The number of unique users in the dataset.
In [4]:
df.user_id.nunique()
Out[4]:
290584
d. The proportion of users converted.
In [5]:
df.converted.mean()*100
Out[5]:
11.96591935560551
e. The number of times the new_page and treatment don't line up.
The new_page and treatment don’t line up when other combinations are provided with each. First lets check the unique combinations:In [6]:
df.group.unique(), df.landing_page.unique()
Out[6]:
(array(['control', 'treatment'], dtype=object), array(['old_page', 'new_page'], dtype=object))
Accordingly, there are two unique values which are treatment and control for groups and old_page and new_page for landing_page. Hence, we need to find the number of rows where treatment was aligned with old_page and control was aligned with new page as follows:In [7]:
treat_old = df.query("group == 'treatment' and landing_page == 'old_page'").shape[0] control_new = df.query("group == 'control' and landing_page == 'new_page'").shape[0] misalignment = treat_old + control_new misalignment
Out[7]:
3893
f. Do any of the rows have missing values?
In [8]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 294478 entries, 0 to 294477 Data columns (total 5 columns): user_id 294478 non-null int64 timestamp 294478 non-null object group 294478 non-null object landing_page 294478 non-null object converted 294478 non-null int64 dtypes: int64(2), object(3) memory usage: 11.2+ MB
Question 2. For the rows where treatment is not aligned with new_page or control is not aligned with old_page, we cannot be sure if this row truly received the new or old page.
A. Using the answer from the previous exercise, we will create a new dataset that meets the specifications then store our new dataframe in df2.In [9]:
df.drop(df.query("group == 'treatment' and landing_page == 'old_page'").index, inplace=True) df.drop(df.query("group == 'control' and landing_page == 'new_page'").index, inplace=True) df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 290585 entries, 0 to 294477 Data columns (total 5 columns): user_id 290585 non-null int64 timestamp 290585 non-null object group 290585 non-null object landing_page 290585 non-null object converted 290585 non-null int64 dtypes: int64(2), object(3) memory usage: 13.3+ MB
In [10]:
df.to_csv('ab_data2.csv', index=False)
In [11]:
df2 = pd.read_csv('ab_data2.csv')
We can confirm all correct rows were removed if 0 is generated in the following:In [12]:
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]
Out[12]:
0
Question 3. Use df2 and the cells below to answer questions for the classroom.
a. How many unique user_ids are in df2?
In [13]:
df2.user_id.nunique()
Out[13]:
290584
b. There is one user_id repeated in df2. What is it?
In [14]:
df2[df2.user_id.duplicated(keep=False)].user_id
Out[14]:
1876 773192 2862 773192 Name: user_id, dtype: int64
c. What is the row information for the repeat user_id?
In [15]:
df2[df2.user_id.duplicated(keep=False)]
Out[15]:
user_id | timestamp | group | landing_page | converted | |
---|---|---|---|---|---|
1876 | 773192 | 2017-01-09 05:37:58.781806 | treatment | new_page | 0 |
2862 | 773192 | 2017-01-14 02:55:59.590927 | treatment | new_page | 0 |
d. Remove one of the rows with a duplicate user_id, but keep your dataframe as df2.
In [16]:
df2.drop_duplicates('user_id', inplace=True) df2.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 290584 entries, 0 to 290584 Data columns (total 5 columns): user_id 290584 non-null int64 timestamp 290584 non-null object group 290584 non-null object landing_page 290584 non-null object converted 290584 non-null int64 dtypes: int64(2), object(3) memory usage: 13.3+ MB
4. Use df2 in the below cells to answer questions related to the classroom.
a. What is the probability of an individual converting regardless of the page they receive?
In [17]:
df2.converted.mean()
Out[17]:
0.11959708724499628
b. Given that an individual was in the control group, what is the probability they converted?
In [18]:
df2.query("group == 'control'").converted.mean()
Out[18]:
0.1203863045004612
c. Given that an individual was in the treatment group, what is the probability they converted?
In [19]:
df2.query("group == 'treatment'").converted.mean()
Out[19]:
0.11880806551510564
d. What is the probability that an individual received the new page?
In [20]:
df2.query("landing_page == 'new_page'").shape[0] / df2.landing_page.shape[0]
Out[20]:
0.5000619442226688
e. Use the results in the previous two portions of this question to suggest if you think there is evidence that one page leads to more conversions?
From the above data, we can see that the number of individuals who converted from either group is almost identical which was equivalent to 12% of each group. Hence, there is no concrete evidence suggesting that those who explore either page will neccessary lead to more conversions.
Part II – A/B Test
Notice that because of the time stamp associated with each event, you could technically run a hypothesis test continuously as each observation was observed.
However, then the hard question is do you stop as soon as one page is considered significantly better than another or does it need to happen consistently for a certain amount of time? How long do you run to render a decision that neither page is better than another?
These questions are the difficult parts associated with A/B tests in general.
Question 1. For now, consider you need to make the decision just based on all the data provided. If you want to assume that the old page is better unless the new page proves to be definitely better at a Type I error rate of 5%, what should your null and alternative hypotheses be?
H1:pold−pnew≥0H1:pold−pnew≥0
poldpold and pnewpnew are the converted rates for the old and new pages respectively.
Question 2. Assume under the null hypothesis, pnewpnew and poldpold both have “true” success rates equal to the converted success rate regardless of page – that is pnewpnew and poldpold are equal. Furthermore, assume they are equal to the converted rate in ab_data.csv regardless of the page.
Use a sample size for each page equal to the ones in ab_data.csv.
Perform the sampling distribution for the difference in converted between the two pages over 10,000 iterations of calculating an estimate from the null.
Use the cells below to provide the necessary parts of this simulation.
a. What is the convert rate for $p_{new}$ under the null?
In [21]:
p_new = df2.converted.mean() p_new
Out[21]:
0.11959708724499628
b. What is the convert rate for $p_{old}$ under the null?
In [22]:
p_old = df2.converted.mean() p_old
Out[22]:
0.11959708724499628
c. What is $n_{new}$?
In [23]:
n_new = df2.query("group == 'treatment'").shape[0] n_new
Out[23]:
145310
d. What is $n_{old}$?
In [24]:
n_old = df2.query("group == 'control'").shape[0] n_old
Out[24]:
145274
e. Simulate $n_{new}$ transactions with a convert rate of $p_{new}$ under the null. Store these $n_{new}$ 1's and 0's in new_page_converted.
In [25]:
new_page_converted = np.random.choice([0, 1], size = n_new, p = [p_new, 1 - p_new])
f. Simulate $n_{old}$ transactions with a convert rate of $p_{old}$ under the null. Store these $n_{old}$ 1's and 0's in old_page_converted.
In [26]:
old_page_converted = np.random.choice([0, 1], size = n_old, p = [p_old, 1 - p_old])
g. Find $p_{new}$ - $p_{old}$ for your simulated values from part (e) and (f).
In [27]:
p_diff = new_page_converted.mean() - old_page_converted.mean() p_diff
Out[27]:
0.001536831987102194
An error occurs when deducing the difference of pnewpnew – poldpold due to mismatch in sizes; hence, a difference in means is calculated instead.
h. Simulate 10,000 $p_{new}$ - $p_{old}$ values using this same process similarly to the one you calculated in parts a. through g. above. Store all 10,000 values in p_diffs.
In [28]:
p_diffs = [] for _ in range(10000): new_page_converted = np.random.choice([0, 1], size = n_new, p = [p_new, 1 - p_new]).mean() old_page_converted = np.random.choice([0, 1], size = n_old, p = [p_old, 1 - p_old]).mean() p_diffs.append(new_page_converted - old_page_converted)
i. Plot a histogram of the p_diffs. Does this plot look like what you expected? Use the matching problem in the classroom to assure you fully understand what was computed here.
In [29]:
plt.hist(p_diffs); plt.ylabel('# of Simulations') plt.xlabel('p_diffs') plt.title('Plot of 10,000 Simulated p_diffs');

j. What proportion of the p_diffs are greater than the actual difference observed in ab_data.csv?
First, we need to convert p_diffs into a numpy array as follows:In [30]:
p_diffs = np.array(p_diffs) p_diffs
Out[30]:
array([-0.00028686, 0.00142686, 0.00234903, ..., 0.00115874, 0.0010417 , -0.00054158])
Next, we need to compute the actual difference observed in the csv dataset as follows:In [31]:
act_diffs = df2.query('group == "treatment"').converted.mean() - df2.query('group == "control"').converted.mean() act_diffs
Out[31]:
-0.0015782389853555567
Finally, we can compute the proportion of p_diffs greater than act_diffIn [32]:
(p_diffs > act_diffs).mean()
Out[32]:
0.9003
k. In words, explain what you just computed in part
j. What is this value called in scientific studies? What does this value mean in terms of whether or not there is a difference between the new and old pages?
In the previous part, we were calculating the p-value which is the probability of getting our statistic or a more extreme value if the null is true.
Having a large p-value goes on to say that the statistic is more likely to come from our null hypothesis; hence, there is no statistical evidence to reject the null hypothesis which states that old pages are the same or slightly better than the new pages.
l. We could also use a built-in to achieve similar results. Though using the built-in might be easier to code, the above portions are a walkthrough of the ideas that are critical to correctly thinking about statistical significance. Let n_old and n_new refer the the number of rows associated with the old page and new pages, respectively.
In [33]:
import statsmodels.api as sm convert_old = df2.query('group == "control"').converted.sum() convert_new = df2.query('group == "treatment"').converted.sum() n_old = df2.query("landing_page == 'old_page'").shape[0] n_new = df2.query("landing_page == 'new_page'").shape[0]
m. Now use stats.proportions_ztest to compute your test statistic and p-value. Here is a helpful link on using the built in.
In [34]:
z_score, p_value = sm.stats.proportions_ztest([convert_old, convert_new], [n_old, n_new], alternative='smaller') z_score, p_value
Out[34]:
(1.3109241984234394, 0.9050583127590245)
Next we import the norm function to compute the significance of our z-score.In [35]:
from scipy.stats import norm norm.cdf(z_score)
Out[35]:
0.9050583127590245
Next we check our critical value at 95% confidence interval.In [36]:
norm.ppf(1-(0.05/2))
Out[36]:
1.959963984540054
n. What do the z-score and p-value computed in the previous question mean for the conversion rates of the old and new pages? Do they agree with the findings in parts j. and k.?
Results above deduced a z_score of 1.31. Since this value does not exceed the critical value at 95% confidence interval (1.96); there is no statistical evidence to reject the null hypothesis. Furthermore, p-value obtained is similar to the result obtained from our previous findings in j. and k. which also fails to reject the null hypothesis as it provides evidence of a higher probability of the null hypothesis
Part III – A regression approach
- In the final part, you will see that the result acheived in the previous A/B test can also be acheived by performing regression.
a. Since each row is either a conversion or no conversion, what type of regression should you be performing in this case?
We are studying rows with either conversions or no conversions which predicts a probability between 0 and 1. Accordingly, may be used.
b. The goal is to use statsmodels to fit the regression model you specified in part a. to see if there is a significant difference in conversion based on which page a customer receives. However, first we need to create a column for the intercept, and create a dummy variable column for which page each user received.
We will add an intercept column, as well as an ab_page column, which is 1 when an individual receives the treatment and 0 if control.
In [37]:
df2['intercept'] = 1 df2[['ab_page2', 'ab_page']] = pd.get_dummies(df2['group']) df2 = df2.drop('ab_page2', axis = 1) df2.head()
Out[37]:
user_id | timestamp | group | landing_page | converted | intercept | ab_page | |
---|---|---|---|---|---|---|---|
0 | 851104 | 2017-01-21 22:11:48.556739 | control | old_page | 0 | 1 | 0 |
1 | 804228 | 2017-01-12 08:01:45.159739 | control | old_page | 0 | 1 | 0 |
2 | 661590 | 2017-01-11 16:55:06.154213 | treatment | new_page | 0 | 1 | 1 |
3 | 853541 | 2017-01-08 18:28:03.143765 | treatment | new_page | 0 | 1 | 1 |
4 | 864975 | 2017-01-21 01:52:26.210827 | control | old_page | 1 | 1 | 0 |
c. Use statsmodels to import the regression model. Instantiate the model, and fit the model using the two columns created in part b. to predict whether or not an individual converts.
In [38]:
log_mod = sm.Logit(df2['converted'], df2[['intercept', 'ab_page']])
d. Provide the summary of your model below, and use it as necessary to answer the following questions.
In [39]:
results = log_mod.fit() results.summary()
Optimization terminated successfully. Current function value: 0.366118 Iterations 6
Out[39]:
Dep. Variable: | converted | No. Observations: | 290584 |
---|---|---|---|
Model: | Logit | Df Residuals: | 290582 |
Method: | MLE | Df Model: | 1 |
Date: | Sat, 24 Aug 2019 | Pseudo R-squ.: | 8.077e-06 |
Time: | 18:26:59 | Log-Likelihood: | -1.0639e+05 |
converged: | True | LL-Null: | -1.0639e+05 |
Covariance Type: | nonrobust | LLR p-value: | 0.1899 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
intercept | -1.9888 | 0.008 | -246.669 | 0.000 | -2.005 | -1.973 |
ab_page | -0.0150 | 0.011 | -1.311 | 0.190 | -0.037 | 0.007 |
e. What is the p-value associated with ab_page? Why does it differ from the value you found in the Part II?
The p-value associated with ab_page was 0.19 which was significantly lower than the one in Part II which was approximately 0.9. The reason for such a significant difference is because the null and alternative hypothesis differed in each exercise.H0:pold−pnew≥0H0:pold−pnew≥0H1:pold−pnew<0H1:pold−pnew<0H0:pold=pnewH0:pold=pnewH1:pold≠pnewH1:pold≠pnew
poldpold and pnewpnew are the converted rates for the old and new pages respectively.
Because the later case relies solely on two possible outcomes, it may be the reason that it yields a lower probability in the null hypothesis than that in the case in Part II of the earlier exercise.
f. Now, we shall consider other things that might influence whether or not an individual converts. The below section discusses why it is a good idea to consider other factors to add into your regression model along with disadvantages to adding additional terms into your regression model?
- Allows for a more sophisticated model to distinguish other factors which may contribute to the outcome.
- May be used to identify outliers.
- May produce inaccurate results due to correlated errors.
g. Now along with testing if the conversion rate changes for different pages, we will also add an effect based on which country a user lives. First, we will need to read in the countries.csv dataset and merge together both datasets on the approporiate rows. Here are the docs for joining tables.
Does it appear that country had an impact on conversion? We will provide the statistical output as well as a written response to answer this question.In [40]:
countries_df = pd.read_csv('./countries.csv') df_new = countries_df.set_index('user_id').join(df2.set_index('user_id'), how='inner') df_new.head()
Out[40]:
country | timestamp | group | landing_page | converted | intercept | ab_page | |
---|---|---|---|---|---|---|---|
user_id | |||||||
834778 | UK | 2017-01-14 23:08:43.304998 | control | old_page | 0 | 1 | 0 |
928468 | US | 2017-01-23 14:44:16.387854 | treatment | new_page | 0 | 1 | 1 |
822059 | UK | 2017-01-16 14:04:14.719771 | treatment | new_page | 1 | 1 | 1 |
711597 | UK | 2017-01-22 03:14:24.763511 | control | old_page | 0 | 1 | 0 |
710616 | UK | 2017-01-16 13:14:44.000513 | treatment | new_page | 0 | 1 | 1 |
Check number of unique rows under country column:In [41]:
df_new.country.unique()
Out[41]:
array(['UK', 'US', 'CA'], dtype=object)
Considering there are three dummy variables, we will need to include two columns.In [42]:
df_new[['UK', 'US']] = pd.get_dummies(df_new['country'])[['UK','US']] df_new.head()
Out[42]:
country | timestamp | group | landing_page | converted | intercept | ab_page | UK | US | |
---|---|---|---|---|---|---|---|---|---|
user_id | |||||||||
834778 | UK | 2017-01-14 23:08:43.304998 | control | old_page | 0 | 1 | 0 | 1 | 0 |
928468 | US | 2017-01-23 14:44:16.387854 | treatment | new_page | 0 | 1 | 1 | 0 | 1 |
822059 | UK | 2017-01-16 14:04:14.719771 | treatment | new_page | 1 | 1 | 1 | 1 | 0 |
711597 | UK | 2017-01-22 03:14:24.763511 | control | old_page | 0 | 1 | 0 | 1 | 0 |
710616 | UK | 2017-01-16 13:14:44.000513 | treatment | new_page | 0 | 1 | 1 | 1 | 0 |
Computing the statistical output:In [43]:
log_mod = sm.Logit(df_new['converted'], df_new[['intercept', 'UK', 'US']])
In [44]:
results = log_mod.fit() results.summary()
Optimization terminated successfully. Current function value: 0.366116 Iterations 6
Out[44]:
Dep. Variable: | converted | No. Observations: | 290584 |
---|---|---|---|
Model: | Logit | Df Residuals: | 290581 |
Method: | MLE | Df Model: | 2 |
Date: | Sat, 24 Aug 2019 | Pseudo R-squ.: | 1.521e-05 |
Time: | 18:27:00 | Log-Likelihood: | -1.0639e+05 |
converged: | True | LL-Null: | -1.0639e+05 |
Covariance Type: | nonrobust | LLR p-value: | 0.1984 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
intercept | -2.0375 | 0.026 | -78.364 | 0.000 | -2.088 | -1.987 |
UK | 0.0507 | 0.028 | 1.786 | 0.074 | -0.005 | 0.106 |
US | 0.0408 | 0.027 | 1.518 | 0.129 | -0.012 | 0.093 |
According to our statistical output the p-value for both countries yields a value larger than 0.05; hence, there is no statistical evidence on country’s significant impact on conversion.
h. Though you have now looked at the individual factors of country and page on conversion, we would now like to look at an interaction between page and country to see if there significant effects on conversion.
We will create the necessary additional columns, and fit the new model then provide the summary results, and conclusions based on the results.
Pages column is already included as per exercise in part b); hence, model may be made similar to previous part while including pages column.In [45]:
log_mod = sm.Logit(df_new['converted'], df_new[['intercept', 'UK', 'US', 'ab_page']])
In [46]:
results = log_mod.fit() results.summary()
Optimization terminated successfully. Current function value: 0.366113 Iterations 6
Out[46]:
Dep. Variable: | converted | No. Observations: | 290584 |
---|---|---|---|
Model: | Logit | Df Residuals: | 290580 |
Method: | MLE | Df Model: | 3 |
Date: | Sat, 24 Aug 2019 | Pseudo R-squ.: | 2.323e-05 |
Time: | 18:27:01 | Log-Likelihood: | -1.0639e+05 |
converged: | True | LL-Null: | -1.0639e+05 |
Covariance Type: | nonrobust | LLR p-value: | 0.1760 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
intercept | -2.0300 | 0.027 | -76.249 | 0.000 | -2.082 | -1.978 |
UK | 0.0506 | 0.028 | 1.784 | 0.074 | -0.005 | 0.106 |
US | 0.0408 | 0.027 | 1.516 | 0.130 | -0.012 | 0.093 |
ab_page | -0.0149 | 0.011 | -1.307 | 0.191 | -0.037 | 0.007 |
According to results above, even after adding there does not seem to be any statistical evidence to indicate an impact on the conversion since p-values were all exceeding 0.05.In [47]:
from subprocess import call call(['python', '-m', 'nbconvert', 'Project_2_-_Analyze_Experiment_Results.ipynb'])
Out[47]:
4294967295