Application of Bootstrapping in Advertisement Click Prediction

Yangyin Ke
8 min readNov 20, 2020

Bootstrap is a popular computer intensive method to estimate the distribution of test statistics. With the distribution constructed by bootstrapping, we can conveniently explore related quantiles and probabilities. Consequently, it can be used to test certain statements without wasting time and space collecting numerous samples. Bootstrapping plays a significant role in hypothesis test, estimating the distribution of test statistics in population. This article is going to talk about its application in advertisement click prediction, which is largely based on its application in hypothesis test. In most cases, logistic regression is an efficient way to predict whether audience will click the advertising link. With data preprocessed by standard scaler, the coefficients in logistic regression can provide a basic feature importance score. In other words, if the coefficient of a feature in the logistic function is close to 0, it means the correlation between this feature and the target value is fairly weak. Bootstrapping is helpful here by verifying the coefficient we are interested in is not 0, validating the correlation between test feature (audience age) and the target value (advertisement clicks).

Introduction to Bootstrapping in Logistic Regression

This section explains some mathematical theorems relevant to logistic regression and bootstrapping, which will be used later to implement the test on the correlation between audience age and advertisement clicks.

What is logistic regression?

Logistic regression is a discriminative model for binary classification. The result of logistic function can be considered as the probability of class 1. If the regression function returns a value greater than 0.5, the input data point will be predicted as class 1. Otherwise, the prediction will be class 0. Specifically, in the case of exploring correlation between age and advertisement click, the response value of interest (Yi) is the number of people aged agei clicking the promotion link where Yi ~ Bin(ni , pi), with ni as the number of people aged agei and pi as the proportion of people clicking the link. Focus on age and advertisement click, the logistic regression function can be written as:

log [pi / (1- pi)] = C0 + C1* agei, i = 1, 2, 3, …, n

where C1 is the effect of age on the log probability of clicking advertisement link.

Our null hypothesis statement in this article is that age can hardly impact the probability of advertisement clicks, which means C1 = 0. We are using bootstrapping to estimate the distribution of C1 under null hypothesis, checking corresponding p-value and confidence interval to see whether the null hypothesis can be rejected.

What is bootstrapping?

Basiclly, bootstrapping is to randomly resample from observations without replacement. Consequently, we can get a great number of bootstrapping sample groups of the same size as observation group without practically collecting more data. In terms of its application in hypothesis test, bootstrapping is used to repeatedly generate new bootstrap data, assuming that null hypothesis is correct. In this way, null distribution can be stimulated to compute p-value. In general, bootstrapping can be divided into three different types: non-parametric bootstrap with no assumptions on population distribution, semi-parametric bootstrap with partly assumptions on population distribution, and parametric bootstrap with a particular assumption on population distribution. This article is only going to talk about the use of non-parametric bootstrap and parametric bootstrap.

bootstrap resamples

Non-parametric bootstrap

Without any assumptions of population distribution, non-parametric bootstrap simply resamples from the empirical distribution function assuming null hypothesis is correct. Primarily, we need to extract observed test statistic from observations. Then estimate bootstrap sample statistics by refitting them with logistic regression. According to the estimation above, we can get bootstrap mean and standard error, which can be used to build bootstrap confidence interval at certain confidence level.

Particularly, when it comes to the correlation between age and advertisement clicks, the first step here is to figure out the logistic function parameters based on observation data points. Calculate the observation test statistic according the observation logistic function. Our bootstrap sample can be obtained by only resampling the number representing advertisements clicks, while age data remeaing the same as observation. Under the assumption that null hypothesis is correct, which means age is irrelevant to advertisements clicks, each age can be matched with any target value. Therefore, it is reasonable to construct bootstrap samples by fixing ages and randomly resampling response values to match them.

Let tobs to be the test statistic of observation.

Let the bootstrap dataset to be

(x, y*)b = ((x1, y1*), … , (xn, yn*)) for b = 1, 2, 3, … B,

where n is the sample size of observation and B is the times of bootstrapping. Compute test statistic in each bootstrap group to be tb. p-value, which stands for the probability of having data groups as extreme as the observation under null hypothesis, can be calculated by:

p-value = (1 + # (tb > tobs)) / (1 + B) (# stands for the number of times)

Compare p-value with 0.05 to see whether null hypothesis can be rejected at confidence level of 95%.

Parametric bootstrap

Like non-parametric bootstrap, the first step here is also calculating the test statistic of observation group. However, parametric bootstrap resamples from a particular distribution estimated by approximating observations with maximum likelihood. Maximum likelihood is to find a probability mass function (PMF) for discrete distribution measure (or probability distribution function (PDF) for continuous distribution measure) that can maximize the likelihood function:

L(X) = f(x1) * f(x2) * … * f(xn)

We use the optimized f here satisfying null hypothesis to be the distribution from which bootstrap resamples. Fitting each bootstrap group with logistic regression and calculating corresponding test statistic, we will get bootstrap mean and standard error to estimate the confidence interval at certain confidence level, which determines whether to reject the null hypothesis or not.

Specifically, in the case of the correlation between audience age and advertisement clicks, we assume the distribution of the number of people aged agei clicking promotion links to be binomial distribution. Its parameters are estimated by observed data:

Pi = exp(C0 + C1 * agei) / (1 + exp(C0 + C1 * agei))

where C0, C1 are the parameters in the observed data fitted logistic function. Pi is the probability that people aged agei click the promotion link. Under null hypothesis, there is no correlation between age and advertisement clicks. In other words, C1 = 0. Therefore,

P1 = P2 = P3 = … = P = exp(C0) / (1 + exp(C0))

Let the bootstrap dataset to be

(x, y*)b = ((x1, y1*), … , (xn, yn*)) for b = 1, 2, 3, … B with yi ~ Binomial(ni, P),

where ni is number of subjects aged xi. Compute test statistics tb according to corresponding logistic function. p-value, which represents the probability of having data groups as extreme as the observation under null hypothesis, can be calculated by:

p-value = (1 + # (tb > tobs)) / (1 + B) (# stands for the number of times)

Compare p-value with 0.05 to see whether null hypothesis can be rejected at confidence level of 95%.

Explore bootstrapping in Python

In this section, I am going to explore an advertisement clicks dataset. Pandas, matplotlib and numpy packages in Python are helpful here to do some data analysis, exploring which hypothesis test to use. And then a machine learning package sklearn is used here to do logistic regression and get the value of coefficients and intercepts. Finally, check the p-value to see the strength of correlation between audience age and advertisement clicks.

Simple EDA

The shape of the dataset is (1000, 10), which means the sample size is 1000. With sufficiently large dataset, it is safe to use z-test to check the correlation between audience age and advertisement clicks. Therefore, the test statistic in this example is the value of C1 in the logistic function of each sample group.

Calculate parameters in observations

Fit the observed data with logistic regression. Extract the coefficient and intercept as C1 and C0.

code to calculate parameters of observed data

Non-parametric bootstrap

Run the loop 300 times to get 300 bootstrapping samples. Each loop represents one non-parametric bootstrap. To satisfy the null hypothesis that there is no correlation between audience age and advertisement clicks, “Age” here is fixed by keeping original order but “Clicked on Ad” is randomly resampled from observations. In this way, “Age” and “Clicked on Ad” in the bootstrap samples are no longer corresponding to each other and completely irrelevant. Fit data in each bootstrap group with logistic regression and record every C1, which will be used to estimate the distribution of C1 under null hypothesis.

non-parametric bootstrap code

The p-value here is far less than 0.05. Assuming null hypothesis is correct, we get a 95% confidence interval of C1 excluding our observed C1. Therefore, at the confidence level of 95%, it is safe to reject the null hypothesis that audience age can hardly impact advertisement clicks.

Parametric bootstrap

Run the loop 300 times to get 300 bootstrapping samples. Each loop represents one parametric bootstrap. “Age” here is randomly resampled from observed data. In terms of “Clicked on Ad”, it is resampled from Bernoulli distribution with P = exp(C0) / (1 + exp(C0)). The Bernoulli distribution is estimated by performing maximum likelihood on observed data assuming that C1=0. In this way, null hypothesis is assumed correct in bootstrapping with people in all age range share the same Ad clicking distribution. Fit data in each bootstrap group with logistic regression and record every C1, which will be used to estimate the distribution of C1 under null hypothesis.

code for parametric bootstrap

The p-value here is far less than 0.05. Assuming null hypothesis is correct, we get a 95% confidence interval of C1 excluding our observed C1. Therefore, at the confidence level of 95%, it is safe to reject the null hypothesis that audience age can hardly impact advertisement clicks.

Conclusion

Thank you for your reading, I hope this article can help you develop a better understanding of bootstrapping application. Personally speaking, it is interesting to combine bootstrapping and logistic regression together, considering that they are both widely used but such combination is relatively rare. You can also check out the source code and dataset used in this article here.

Sources/Citations

Adjei, I.A. and Karim, R. (2016) An Application of Boots- trapping in Logistic Regression Model. Open Access Library Journal, 3: e3049. http://dx.doi.org/10.4236/oalib.1103049

Wicklin, R. (2018) The essential guide to bootstrapping in SAS. https://blogs.sas.com/content/iml/2018/12/12/essential-guide-bootstrapping-sas.html#prettyPhoto

--

--