Data Preparation

Preparing your experimental data for NannyML

What data does NannyML need in order to assess an experiment? Conceptually NannyML needs the following information:

The metric being experimented on. Currently it needs to be a binomial metric.
Whether the results refer to the control or treatment group.
The number of successes and failures.

How should that information be encoded for NannyML to consume it? The easiest way is to show an example dataset. We will use an A/B testing dataset from Kaggle. It is a dataset form a company running A/B testing on their marketing campaign looking to improve website sales. Let's have a quick preview of the dataset.

Campaign Name	Date	Spend [USD]	# of Impressions	Reach	# of Website Clicks	# of Searches	# of View Content	# of Add to Cart	# of Purchase
Control Campaign	1.08.2019	2280	82702	56930	7016	2290	2159	1819	618
Control Campaign	2.08.2019	1757	121040	102513	8110	2033	1841	1219	511

NannyML's AB testing module allows running A/B tests for binomial variables. Hence from this dataset we will focus on the efficacy on converting views to purchases.

Here's how we preprocess the dataset so it can be used in NannyML's Experiments Module.

import pandas as pd
control = pd.read_csv("control_group.csv", sep = ";")
test = pd.read_csv("test_group.csv", sep = ";")
# let's measure the probability of buying after our content has been viewed
selected_cols = ['Campaign Name', '# of View Content', '# of Purchase']
experiment = pd.concat([control[selected_cols], test[selected_cols]], ignore_index=True)
# let's remove missing values.
experiment = experiment.loc[experiment[selected_cols[-1]].notna()]
# Preprocess to make data comply with NannyML requirements
experiment['variable'] = 'Purchases from Views'
experiment = experiment.rename(columns={"# of Purchase": "success_count",})
experiment['fail_count'] = experiment['# of View Content'] - experiment['success_count']
experiment = experiment.drop('# of View Content', axis=1)
# Campaign name values must be control and treatment
experiment = experiment.replace({
    'Control Campaign': 'control',
    'Test Campaign': 'treatment'
})
# shufling and splitting for demonstration purposes only, final results are the same 
experiment = experiment.sample(frac=1, random_state=13).reset_index(drop=True)
experiment[:10].to_parquet('ab_test1.pq', index=False)
experiment[10:].to_parquet('ab_test2.pq', index=False)
experiment.head(3)

Campaign Name	variable	success_count	fail_count
treatment	Purchases from Views	668	1949
control	Purchases from Views	764	485
treatment	Purchases from Views	677	871

You can see that NannyML needs 4 columns.

The first column informs us whether the relevant data are for the control or the treatment group. Those values are necessary for NannyML to understand how to process the data.
The second column tells us for which variable we are getting results.
The third and fourth column contain the number of successes and failures respectively.

Let's go through the preparation steps to explain the reasoning behind them.

We select the columns containing the information we need.
We concatenate the data regarding control and treatment campaigns and remove rows containing only missing values.
We add a column containing the name of the variable we are measuring. In this case we call it Purchases from Views.
We are renaming the # of Purchase column to success_count since this is the number of successful purchases after people saw our content.
We are also calculating the difference between # of View Content and success_count because this is the number of people viewing our content but not making purchases. We are naming this column fail_count.
We then drop # of View Content column because we don't need it.
Lastly we are shuffling and splitting the dataset in two. This is just for demonstration purposes. The results would be the same even if we didn't perform this test.

We recommend storing your data as parquet files.

NannyML Cloud supports both parquet and CSV files, but CSV files don't store data type information. CSV files may cause incorrect data types to be inferred. If you later add more data to the model using the SDK or using parquet format, a data type conflict may occur.

We can see how to use this data on the tutorial for Running an A/B test.

PreviousRunning an A/B test NextHow it works