1. Introduction
Off-policy evaluation (OPE) is the method that attempts to estimate the performance of decision making policies using historical data generated by different policies without conducting costly online A/B tests (Dudík et al., 2011; Gilotte et al., 2018). Accurate OPE is essential in domains such as healthcare, marketing or recommender systems to avoid deploying poor performing policies, as such policies may hart human lives or destroy the user experience. Thus, many OPE methods with theoretical backgrounds have been proposed, including Direct Method (DM),
Inverse Probability Weighting
(IPW), and Doubly Robust(DR). One emerging challenge with this trend is that a suitable estimator can be different for each application setting. For example, DM has low variance but has a large bias, and thus, performs better in small sample settings. On the other hand, IPW has a low bias but has a large variance, and thus reveals better performance in large sample settings. It is often unknown for practitioners which estimator to use for their specific applications and purposes. To find out a suitable estimator among many candidates, we use a data-driven estimator selection procedure for off-policy policy performance estimators as a practical solution. As proof of concept, we use our procedure to select the best estimator to evaluate coupon treatment policies on a real-world online content delivery service. In the experiment, we first observe that a suitable estimator might change with different definitions of the outcome variable, and thus the accurate estimator selection is critical in real-world applications of OPE. Then, we demonstrate that, by utilizing our estimator selection procedure, we can easily find out suitable estimators for each purpose. We believe that our estimator selection procedure and case study help practitioners identify the best OPE method for their environments.
2. Setup and Method
We denote
as a context vector and
as a binary treatment assignment indicator^{1}^{1}1Note that our OPE procedure can easily be extended to multiple treatment cases.. When an individual user receives the treatment, , otherwise, . We assume that there exist two potential outcomes denoted as for each individual. is a potential outcome associated with , and is associated with . Note that each individual receives only one treatment, and only a potential outcome for the received treatment is observed. We can represent the observed outcome as: .A policy automatically assigns treatments to users aiming to maximize the outcome. We denote a policy as a function that maps a context vector to one of the possible treatments, i.e., . Then, the performance of a policy is defined as . The goal of OPE is to estimate for a given new policy using log data collected by an old policy different from .
Our strategy to select the suitable estimator is to use two sources of logged bandit feedback collected by two different behavior policies. We denote log data generated by and as and , respectively. To evaluate the performance of an estimator , we first estimate policy performances of and by and . Then, we use on-policy estimates as the ground-truth policy performances, i.e., and . Finally, we compare the off-policy estimates and with their ground-truths (on-policy estimates) and to evaluate the estimation accuracy of an estimator . We evaluate the estimation accuracy of an estimator by the relative root mean-squared-error defined as (same for ) where denotes a different subsample of logged bandit feedback made by the sample splitting or bootstrap sampling. By applying the above procedure to several candidate estimators, we can select the estimator having the best estimation accuracy among candidates in a data-driven manner.
OPE Estimators | |||
3-5 OPE Situation | DM | IPW | DR |
0.0897 | 0.1958 | 0.0955 | |
0.0653 | 0.1589 | 0.0798 | |
OPE Estimators | |||
3-5 OPE Situation | DM | IPW | DR |
0.2231 | 0.1997 | 0.1118 | |
0.4382 | 0.0981 | 0.2936 | |
Note: is a case where we attempt to estimate the performance of using log data generated by . In contrast, is a case where we attempt to estimate the performance of using log data generated by . The bold fonts represent the best off-policy estimator among DM, IPW, and DR for each setting (lower value is better).
3. A Case Study
To show the usefulness of our procedure, we constructed and by randomly assigning two different policies ( and ) to users on our content delivery platform. Here, is a user’s context vector, is a coupon assignment indicator, and is either a user’s content consumption indicator (binary) or the revenue from each user (continuous).
We report the estimator selection results for each definition of in Table 1 and 2. We used DM, IPW, and DR as candidate off-policy estimators. The tables show that different estimators should be used for each setting and purpose. This is because the prediction accuracy of the outcome regressor used in DM and DR can be different for each definition of . We conclude from the results that we should use DM when we want to maximize the users’ content consumption probability. In contrast, we use IPW or DR when we consider the revenue from users as the outcome. After the successful empirical verification, our data-driven estimator selection method has been used to decide which estimators to use to create coupon allocation policies on our platform.
References
- (1)
- Dudík et al. (2011) Miroslav Dudík, John Langford, and Lihong Li. 2011. Doubly Robust Policy Evaluation and Learning. CoRR abs/1103.4601 (2011). arXiv:1103.4601 http://arxiv.org/abs/1103.4601
- Gilotte et al. (2018) Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline a/b testing for recommender systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 198–206.
Comments
There are no comments yet.