Data Preparation

Preparing your experimental data for NannyML

What data does NannyML need in order to assess an experiment? Conceptually NannyML needs the following information:

  • The metric being experimented on. Currently it needs to be a binomial metric.

  • Whether the results refer to the control or treatment group.

  • The number of successes and failures.

How should that information be encoded for NannyML to consume it? The easiest way is to show an example dataset. We will use an A/B testing dataset from Kaggle. It is a dataset form a company running A/B testing on their marketing campaign looking to improve website sales. Let's have a quick preview of the dataset.

Campaign NameDateSpend [USD]# of ImpressionsReach# of Website Clicks# of Searches# of View Content# of Add to Cart# of Purchase

Control Campaign

1.08.2019

2280

82702

56930

7016

2290

2159

1819

618

Control Campaign

2.08.2019

1757

121040

102513

8110

2033

1841

1219

511

NannyML's AB testing module allows running A/B tests for binomial variables. Hence from this dataset we will focus on the efficacy on converting views to purchases.

Here's how we preprocess the dataset so it can be used in NannyML's Experiments Module.

import pandas as pd
control = pd.read_csv("control_group.csv", sep = ";")
test = pd.read_csv("test_group.csv", sep = ";")
# let's measure the probability of buying after our content has been viewed
selected_cols = ['Campaign Name', '# of View Content', '# of Purchase']
experiment = pd.concat([control[selected_cols], test[selected_cols]], ignore_index=True)
# let's remove missing values.
experiment = experiment.loc[experiment[selected_cols[-1]].notna()]
# Preprocess to make data comply with NannyML requirements
experiment['variable'] = 'Purchases from Views'
experiment = experiment.rename(columns={"# of Purchase": "success_count",})
experiment['fail_count'] = experiment['# of View Content'] - experiment['success_count']
experiment = experiment.drop('# of View Content', axis=1)
# Campaign name values must be control and treatment
experiment = experiment.replace({
    'Control Campaign': 'control',
    'Test Campaign': 'treatment'
})
# shufling and splitting for demonstration purposes only, final results are the same 
experiment = experiment.sample(frac=1, random_state=13).reset_index(drop=True)
experiment[:10].to_parquet('ab_test1.pq', index=False)
experiment[10:].to_parquet('ab_test2.pq', index=False)
experiment.head(3)
Campaign Namevariablesuccess_countfail_count

treatment

Purchases from Views

668

1949

control

Purchases from Views

764

485

treatment

Purchases from Views

677

871

You can see that NannyML needs 4 columns.

  • The first column informs us whether the relevant data are for the control or the treatment group. Those values are necessary for NannyML to understand how to process the data.

  • The second column tells us for which variable we are getting results.

  • The third and fourth column contain the number of successes and failures respectively.

Let's go through the preparation steps to explain the reasoning behind them.

  • We select the columns containing the information we need.

  • We concatenate the data regarding control and treatment campaigns and remove rows containing only missing values.

  • We add a column containing the name of the variable we are measuring. In this case we call it Purchases from Views.

  • We are renaming the # of Purchase column to success_count since this is the number of successful purchases after people saw our content.

  • We are also calculating the difference between # of View Content and success_count because this is the number of people viewing our content but not making purchases. We are naming this column fail_count.

  • We then drop # of View Content column because we don't need it.

  • Lastly we are shuffling and splitting the dataset in two. This is just for demonstration purposes. The results would be the same even if we didn't perform this test.

We recommend storing your data as parquet files.

NannyML Cloud supports both parquet and CSV files, but CSV files don't store data type information. CSV files may cause incorrect data types to be inferred. If you later add more data to the model using the SDK or using parquet format, a data type conflict may occur.

We can see how to use this data on the tutorial for Running an A/B test.

Last updated