Safe Deployment for Counterfactual Learning to Rank with Exposure-Based Risk Minimization

Counterfactual learning to rank (CLTR) relies on exposure-based inverse propensity scoring (IPS), a LTR-specific adaptation of IPS to correct for position bias. While IPS can provide unbiased and consistent estimates, it often suffers from high variance. Especially when little click data is available, this variance can cause CLTR to learn sub-optimal ranking behavior. Consequently, existing CLTR methods bring significant risks with them, as naively deploying their models can result in very negative user experiences. We introduce a novel risk-aware CLTR method with theoretical guarantees for safe deployment. We apply a novel exposure-based concept of risk regularization to IPS estimation for LTR. Our risk regularization penalizes the mismatch between the ranking behavior of a learned model and a given safe model. Thereby, it ensures that learned ranking models stay close to a trusted model, when there is high uncertainty in IPS estimation, which greatly reduces the risks during deployment. Our experimental results demonstrate the efficacy of our proposed method, which is effective at avoiding initial periods of bad performance when little data is available, while also maintaining high performance at convergence. For the CLTR field, our novel exposure-based risk minimization method enables practitioners to adopt CLTR methods in a safer manner that mitigates many of the risks attached to previous methods.


INTRODUCTION
Learning to rank (LTR) methods optimize ranking systems so that Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).SIGIR '23, July 23-27, 2023, Taipei, Taiwan the resulting ranking behavior maximizes a given ranking metric [20].Traditionally, most LTR methods applied a supervised learning procedure based on manually-created relevance judgements.However, obtaining such judgements is time-consuming, expensive and does not scale [2,33].As an alternative, LTR methods have been developed that rely on clicks, as they are much cheaper to obtain in abundance in the form of user interaction logs [16].
Despite its low costs, click data is generally strongly affected by different forms of interaction bias.Interactions with rankings often suffer from position bias [7]: the position at which an item was shown often affects its click through rate (CTR) more than its relevance.As a result, the clicks observed in interaction logs are often more reflective of where items were displayed during logging than how relevant users find them.Thus, naively using this data for LTR, without corrections, can result in heavily biased models with suboptimal ranking performance [18,44].
To mitigate the bias problem in interaction data, the field of counterfactual learning to rank (CLTR) has proposed methods to mitigate bias with unbiased estimation [18].CLTR mainly relies on exposure-based inverse propensity scoring (IPS) [31,45], a LTR specific adaptation of the IPS counterfactual estimation method [11,17,41].Standard exposure-IPS weights clicks by the inverse effect of position-bias on the clicked item.This procedure thus gives more weight to clicks on items that are underrepresented due to position-bias, and vice versa.In expectation, this removes the effect of position-bias from the loss that is optimized.Unsafe CLTR.Despite enabling unbiased optimization, IPS is also known to suffer from high variance [18,25].In cases with a lack of click data or with large amounts of noise, high variance can make IPS-based CLTR unreliable and lead to very sub-optimal ranking models [12,30].This problem can be so severe that the learned ranking models can be worse than the model used to log the interaction data.Deploying such a learned model could result in a substantially degraded user experience.In other words, despite the improvements that IPS-based CLTR can bring, it is also an unsafe approach since it may lead to considerable deteriorations.This (un)safety issue is not unique to IPS-based CLTR.Swaminathan and Joachims [41] address this issue for contextual bandit problems by applying a generalization bound.Such a bound can provide a high-confidence upper limit on the difference between the true and estimated performance of a bandit policy [39,42].This allows for safer conservative optimization.For instance, Wu and Wang [49] introduce a bound based on the divergence between the new policy and the logging policy.This bound avoids policies that stray away from the logging policy, unless there is strong evidence that they are actual improvements.This method might appear to be a great fit for CLTR, but, unfortunately, it is based on action propensities that do not generalize well to the very large action spaces in CLTR.Therefore, there is a need for a conservative generalization bound that is practical and effective in the CLTR setting.Safe CLTR.To address this gap, we propose an exposure-based counterfactual risk minimization (CRM) method that is specifically designed for safe CLTR.Similar to how exposure-based IPS deals with the large action spaces in ranking settings, our method is based on an exposure-based alternative to action-based generalization bounds.We first introduce a divergence measure based on differences between the distributions of exposure of a new policy and a safe logging policy.Then we provide a novel generalization bound and prove that it is a high-confidence lower-bound on the performance of a learned policy.When uncertain, this bound defaults to preferring the logging policy and thus avoids decreases in performance due to variance.In other words, with high-confidence, ranking models optimized with this bound are guaranteed to never deteriorate the user experience, even when little data is available.Main contributions.We are the first to address CRM for CLTR and contribute a novel exposure-based CRM method for safe CLTR.Our experimental results show that our proposed method is effective at avoiding initial periods of bad performance when little date is available, while also maintaining high performance at convergence.Our novel exposure-based CRM method thus enables safe CLTR that can mitigate many of risks attached to previous methods.We hope that our contribution makes the adoption of CLTR methods more attractive to practitioners working on real-world search and recommendation systems.

RELATED WORK
We review related work on CLTR and CRM in off-policy learning.

Counterfactual learning to rank
LTR deals with learning optimal rankings to maximize a pre-defined notion of utility [20].Traditionally, LTR systems were optimized using supervised learning on manually-created relevance judgements [2].But the manual curation of relevance judgements is a time-consuming and costly process [2,33].Also, manually-graded relevance signals do not always align well with actual user preferences [38].Due to these shortcomings, LTR from user interactions has become a popular alternative to supervised LTR [3,15,18,40].
Learning from user interactions/click logs was introduced in the pioneering work of Joachims [16].Click data is relatively cheap to collect and indicative of actual user preferences [34].In spite of these advantages, click data is known to be a noisy and biased estimate of the true user preferences [7,31].Some of the common biases identified in the LTR literature are position bias [7]: trust bias [1], and item-selection bias [27].
To counter the effect of bias, Joachims et al. [18] introduced counterfactual learning in the context of LTR.They proposed the application of inverse propensity scoring (IPS), a causal inference technique that has prevalence in the offline bandit learning literature [17].IPS models the probability of the user examining a document at a given displayed rank.The inverse of the examination probability, i.e., the inverse propensity, is used to correct for the position bias.As a result of the inverse weighing scheme, IPS-based LTR optimization is unaffected by position bias, in expectation [18].Since its introduction, there has been an increasing interest in the area, with several application of IPS in the context of ranking [1,27,43,45].Recent work has also explored CLTR under a stochastic logging policy, where some exploration is introduced, as opposed to pure exploitation [27,29,50].
With regard to safety in learning from user interactions, Jagerman et al. [12] introduced the notation of safe exploration for offline contextual bandit algorithms.The authors introduced safe exploration algorithm (SEA), which applies high-confidence performance bounds to safely choose between the deployment of a logging policy and a learned policy.Oosterhuis and de Rijke [30] applied this context to LTR and introduced a generalization and specialization framework to safely choose between a generalized feature-based LTR model, and a specialized tabular LTR model.The important difference between prior work and our work is that existing methods safely choose between policies, whereas our method safely optimizes a policy.To the best of our knowledge, we are the first to consider notion of safety for the optimization of LTR models.

Counterfactual risk minimization for offline learning from logs
A relevant area closely related to CLTR is off-policy learning, or offline learning from bandit feedback data [10,17,37,41].Off-policy learning tries to bridge the mismatch between the action distributions of a new policy and the logging policy [17].The most common techniques used to achieve that goal are IPS and importance sampling [11].However, as noted by Cortes et al. [6], the IPS estimator can have unbounded variance, which can lead to large errors in its estimation.Consequently, optimization with IPS can result in convergence problems and severely suboptimal policies.
To account for this high-variance problem, Swaminathan and Joachims [41] introduced counterfactual risk minimization (CRM), an off-policy method that explicitly controls for the variance during off-policy learning from bandit feedback data.Their learning objective consists of both the IPS loss and a variance regularization term, which minimizes the dissimilarity between the two policies.This variance regularization term represents the risk that stems from the variance of the IPS estimation.Computing it requires a pass over the entire data which does not scale well.As a scalable alternative, Wu and Wang [49] introduced variational counterfactual risk minimization (VCRM), where the authors estimate the risk of the new policy by random sampling from the logged data.The objective function to be optimized in the VCRM method is derived from a generic theoretical analysis of learning from importance sampling [6].The risk term in the VCRM method is defined in terms of a specific divergence between the logging policy and the new policy, known as the Rényi divergence [35].To the best of our knowledge, there is no work on CRM in a LTR setting, making our work the first to propose a CRM approach for the LTR task.

BACKGROUND 3.1 Learning to rank
The objective of learning to rank methods is to optimize a ranking policy (), so that for user-issued queries () it provides the optimal ranking of their pre-selected candidate document sets (  ) [20].
Formally, this objective can be expressed as the maximization of the following utility function: where  ( | , ) is the weight  gives to document  for query .
The choice of  determines what metric is optimized, for instance, the well-known normalized discounted cumulative gain (NDCG) metric [14]: where  is a ranking sampled from the policy .For this paper, the aim is to optimize the expected number of clicks, the next subsection will explain how we choose  accordingly.

Counterfactual learning to rank
Position bias in clicks.Optimizing the LTR objective in Eq. 1 requires access to the true relevance labels ( ( = 1 | , )), which is often impossible in real-world ranking settings.As an alternative, CLTR uses clicks, since they are present in abundance as logged user interactions.However, clicks are a biased indicator of relevance; for this paper, we will assume the relation between clicks and relevance is determined by a position-based click model [5,18].For a document  displayed in ranking  for query , this means the click probability can be decomposed into a rank-based examination probability and a document-based relevance probability: The key characteristic of the position-based click model is that the probability of examination only depends on the rank at which a document is displayed: Furthermore, this model assumes that clicks only take place when a document is both relevant to a user and examined by them.Consequently, the click signal is an indication of both the relevance and examination of documents.Thus, the position at which a document is displayed can have a stronger effect on its click probability than its actual relevance [7].
Inverse-propensity-scoring for CLTR.We assume a setting where  interactions have been logged using the logging policy  0 , for each interaction  the query   , the displayed ranking   , and the clicks   are logged: We will use   () ∈ {0, 1} to denote whether document  was clicked at interaction .Furthermore, we choose  to match the examination probabilities under : Hence, our optimization objective  () is equal to the expected number of clicks (cf.Eq. 1 and 3).
In order to apply IPS, we need the propensity of each document [18], following Oosterhuis and de Rijke [29] we use: Thus, the exposure of  represents how likely it is examined when using  0 for logging.Thereby, it indicates how much the clicks on  underrepresent its relevance.For the sake of brevity, we drop ,  and  0 from our notation when their values are clear from the context: i.e.,  ( | , ) =  () and  ( | ,  0 ) =  0 ().
The exposure-based IPS estimator takes each click in D and weights it inversely to  0 () to correct for position-bias [18,29]: In other words, to compensate that position bias lowers the click probability a document by a factor of  0 (), clicks are weighted by 1/ 0 () to correct for this effect in expectation.As a result, clicks on documents that  0 is likely to show at positions with low examination probabilities (i.e., the bottom of a ranking) receive a higher IPS weight to compensate.
Statistical properties of the IPS estimator.The IPS estimator Û () (Eq.7) is an unbiased and consistent estimate of our LTR objective  () (Eq. 1) [26].It is unbiased since its expected value is equal to our objective: and it is consistent because this equivalence also holds in the limit of infinite data: For proofs of these properties, we refer to previous work [18,23,27].Importantly, the unbiasedness and consistency properties do not indicate that the actual IPS estimates will be reliable.This is because the estimates produced by IPS are also affected by its variance: The variance is large when some propensities are small, due to the  0 () −2 term.Hence, the actual estimates that IPS produces may contain large errors, especially when  is relatively small or clicks are very noisy.Thus, Û () may be far removed from the true  (), and optimization with IPS may be unsafe and lead to unpredictable results.

Counterfactual risk minimization for offline bandit learning
The foundational work by Swaminathan and Joachims [41] introduced the idea of counterfactual risk minimization (CRM) for offpolicy learning in a contextual bandit setup.To avoid the negative effects of high-variance with IPS estimation during bandit optimization, they utilize a generalization bound through the addition of a risk term [21].With a probability of 1 − , the IPS estimate minus the risk term is a lower bound on the true utility of the policy: Therefore, optimization of the lower bound can be more reliable than solely optimizing the IPS estimate ( Û ()), since it provides a high-confidence guarantee that a lower bound on the true utility of the policy is maximized.Swaminathan and Joachims [41] propose Figure 1: Three rankings and their normalized expected exposure distributions (Eq.16) based on DCG weights (Eq.2).According to our exposure-based divergence, ranking 1 and ranking 2 are quite similar despite only agreeing on the placing of document A. In contrast, ranking 1 and ranking 3 also agree on the placement of a single document (C) but have the highest possible dissimilarity, due to their highly mismatched exposure distributions.
using the sample variance as the risk factor: where  ∈ R >0 is an alternative to the  parameter that also determines how probable it provides a bound on the true utility.Importantly, this bound is based on an action-based IPS estimator.For our LTR setting this would translate to: Action-based IPS estimation does not work well in the LTR setting because the large number of possible rankings results in extremely small action propensities:  0 (  |   ), creating a high-variance problem.As discussed in Section 3.2, for this reason CLTR uses exposure-based propensities instead (Eq.6 and 7), as they effectively avoid extremely small values.As a result, the CRM approach from [41] is not effective for CLTR, since the high-variance of its actionbased IPS make the method impractical in the ranking setting.Another downside of the CRM approach is that the computation of the sample-variance requires a full-pass over the training dataset, which is computationally costly for large-scale datasets.As a solution, Wu and Wang [49] introduce variational CRM (VCRM) which uses an upper bound on the variance term based on the Rényi divergence between the new policy and the logging policy [35].This Rényi divergence is approximated via random sampling, thus making the VCRM method suitable for stochastic gradient descentbased training methods [22].Nevertheless, this CRM approach still relies on action-based propensities, and therefore, does not provide an effective solution for the high-variance problem in CLTR.

A NOVEL EXPOSURE-BASED GENERALIZATION BOUND FOR CLTR
To develop a CRM method for CLTR with safety gaurantees, we aim to find a risk term that gives us a generalization bound as in Eq. 11.Importantly, this bound has to be effective in the LTR setting, therefore, our approach should avoid action-based propensities.We take inspiration from work by Wu and Wang [49], who use the fact that the Rényi divergence is an upper bound on the variance of an IPS estimator: where  2 is the exponentiated Rényi divergence between the new policy and the logging policy [35]: In other words, the dissimilarity between the logging policy and a new policy can be used to bound the variance of the IPS estimate of the new policy's performance.However, because this divergence is based on action propensities, it is not effective in the LTR setting.Below, we introduce an exposure-based measure of divergence that can produce a desired generalization bound for LTR optimization.Section 4.1 introduces the concept of normalized exposure that treats rankings as exposure distributions.Section 4.2 proves that Rényi divergence based on normalized exposure can bound the variance of an exposure-based IPS estimator.Section 4.3 uses this variance bound to construct a generalization bound for CLTR.

Normalized expected exposure
Rényi divergence is only valid for probability distributions, e.g.,  2 ( ∥  0 ) with  ( | ) and  0 ( | ).However, expected exposure is not a probability distribution, i.e., the values of  () (Eq.5) or  0 () (Eq.6) do not necessarily sum up to one, over all documents to be ranked.This is because users generally examine more than a single item in a single displayed ranking [7], as a result, expected exposure can be seen as a distribution of multiple examinations.Our insight is that a valid probability distribution can be obtained by normalizing the expected exposure: where the normalization factor is a constant that only depends on , the (truncated) ranking length: In this way, Z can be seen as the expected amount of examination that any ranking will receive, and  ′ as the probability distribution that indicates how it is expected to spread over documents.An important property is that the ratio between two propensities is always equal to the ratio between their normalized counterparts: This is relevant to IPS estimation since it only requires the ratios between propensities, the proofs in the remainder of this paper make use of this property.
Finally, using the normalized expected exposure, we can introduce the exponentiated exposure-based Rényi divergence: The key difference between our exposure-based divergence and action-based divergence is that it allows policies to be very similar, even when they have no overlap in the rankings they produce.As an intuitive example, Figure 1 displays three different rankings and their associated normalized expected exposure distributions; these are the distributions for deterministic policies that give 100% probability to one of the rankings.Under action-based divergence, these policies would have the highest possible dissimilarity since they have no overlap in their possible actions, i.e., the rankings they give non-zero probability.In contrast, exposure-based divergence gives high similarity between ranking 1 and ranking 2, since the differences in their exposure distribution are minor.We note that these rankings still disagree on the placement of all documents except one.Conversely, for ranking 1 and ranking 3, which also only agree on a single document placement, exposure-based divergence gives the lowest possible similarity score because their exposure distributions are highly mismatched.Importantly, by solely considering differences in exposure distributions, exposure-based divergence naturally weighs differences at the bottom of rankings as less impactful than changes that affect the top.As a result, exposure-based divergence more closely corresponds with common ranking metrics (Eq. 1) than existing action-based divergences.

Exposure-divergence bound on variance
We now provide proof that exposure-based divergence is an upper bound on the variance of IPS estimators for CLTR.
Theorem 4.1.Given a ranking policy  and logging policy  0 , with the expected exposures  () and  0 () respectively, the variance of the exposure-based IPS estimate Û () is upper-bounded by exposure-based divergence: Proof.From the definition of Û () (Eq.7) and the assumption that queries  are independent and identically distributed (i.i.d), the variance of the counterfactual estimator can be rewritten as an expectation over queries [28]: Since we have assumed a rank-based examination model (Section 3.2), the examinations of documents are independent.This allows us to rewrite the variance conditioned on a single query: Since:  (, ) 2 =  (, ), we can further rewrite to: Next, we use Eq. 3 and 6 to substitute the click probability; subsequently, we replace the examination propensities with normalized counterparts using Eq.16 and 18; and lastly, we upper bound the result using the fact that  ( = 1|, ) ≤ 1: = ∑︁ Finally, we place this upper bound for a single query back into the expectation over all queries (Eq.20): Therefore, by Eq.21, 25 and the definition of exposure-based divergence in Eq. 19, it is a proven upper bound of the variance.□

Exposure-divergence bound on performance
Using the upper bound on the variance of an CLTR IPS estimator that was proven in Theorem 4.1, we can now introduce a generalization bound for the CLTR estimator.
Theorem 4.2.Given the true utility  () (Eq. 1) and its exposurebased IPS estimate Û () (Eq.7), for the ranking policy  and the logging policy  0 with expected exposures  () and  0 (), respectively, the following generalization bound holds with probability 1 −: Proof.As per Cantelli's inequality [9], given an estimator X with expected value E[ X ] and variance Var[ X ], the following tailbound holds: Since  > 0 is a free parameter, we can define  such that: Consequently, the following inequality holds:  Building on this inequality, the following inequality must hold with probability 1 − :  11).Clearly, this risk is mostly determined by the exposure-based divergence between the new policy and the logging policy.Thereby, it states that the greater the difference between how exposure is spread over documents by the logging policy and the new policy, the higher the risk involved.Therefore, to optimize this lower bound, one has to balance the maximization of the estimated utility Û () and the minimization of risk by not letting  differ too much from  0 in terms of exposure.
Furthermore, we see that our measure of risk diminishes as  increases.As a result, the risk term will overwhelm the IPS term when  is very low, as there is much risk involved when estimating based on a few interactions.Conversely, when  is very large, the risk term mostly disappears, as the IPS estimate is more reliable when based on large numbers of interactions.Thus, during optimization, the generalization bound is expected to mostly help with avoiding initial decreases in performance, while still converging at the same place as the standard IPS estimator.
Lastly, the  parameter determines the safety that is provided by the risk, where a lower  makes it more likely that the generalization bound holds.Accordingly, as  increases the risk term becomes smaller and will thus have less effect on optimization.
To the best of our knowledge, this is the first exposure-based generalization bound, which makes it the first method designed for safe optimization in the CLTR setting.Illustrative comparison.To emphasize the working and novelty of our exposure-based risk, a comparison of the optimal policies for action-based risk, exposure-based risk, and no risk are shown in Figure 2. We see that IPS without a risk term places the once-clicked document at the first position, with 100% probability.This is very risky, as it greatly impacts the ranking while only being based on a single observation.The action-based risk tries to mitigate this risk with a probabilistic policy that gives most probability to the logging policy ranking (90%) and the remainder to the IPS ranking (10%).In contrast, with exposure-based risk, the optimal policy makes the risk and utility trade-off in a single ranking, that mostly follows the logging policy but places the clicked document slightly higher.
This example illustrates that because action-based risk does not have a similarity measure between rankings, it can only produce a probabilistic interpolation between the logging policy and IPS rankings.Alternatively, because exposure-based risk does have such a measure, it produces a ranking that is neither the logging ranking nor the IPS ranking, but one with an exposure distribution that is similar to both.Thereby, exposure-based risk has a more elegant and natural method of balancing utility maximization and risk minimization in the CLTR setting.

A NOVEL COUNTERFACTUAL RISK MINIMIZATION METHOD FOR LTR
Now that we have the proven generalization bound described in Section 4.3 (Theorem 4.2), we can propose a novel risk-aware CLTR method for optimizing it.The aim of our method is to find the policy that maximizes this high-confidence lower bound on the true performance.In formal terms, we have the following optimization problem: We propose to train a stochastic policy  via stochastic gradient descent, therefore, we need to derive the gradient and find a method of computing it.For the computation of the gradient w.r.t. the utility Û (), the first part of Eq.31, we refer to several prior work that discusses this topic extensively [24,27,50].Thus, we can focus our attention on the second part of Eq. 31: To derive the gradient of the exposure-based divergence function, we use the relation between  and  ′ from Eq. 17 and 18: Thus, we only need the gradient w.r.t. the exposure of a document (∇   ()) to complete our derivation.If  is a Plackett-Luce (PL) ranking model, one can make use of the specialized gradient computation algorithm from [24].However, for this work, we will not make further assumptions about  and apply the more general log-derivate trick from the REINFORCE algorithm [48]: Putting all of the previous elements back together, gives us the gradient w.r.t. the exposure-based risk function: where   is the document at rank  in ranking .For a close approximation of this gradient, we substitute the gradient with the queries from the given dataset, and the rankings sampled from  during optimization [24,48].Similarly, since the exact computation of is  2 ( ∥  0 ) infeasible in practice, we introduce a sample-based empirical divergence estimator: This is an unbiased estimate of the true divergence given that the sampling process is truly monte-carlo [13].

EXPERIMENTAL SETUP
For our experiments, we follow the semi-synthetic experimental setup that is common in the CLTR literature [18,29,30,43].We make use of the three largest publicly available LTR datasets: Yahoo!Webscope [2], MSLR-WEB30k [32], and Istella [8].The datasets consist of queries, a preselected list of documents per query, querydocument feature vectors, and manually-graded relevance judgements for each query-document pair.To generate clicks, we follow previous work [29,30,43] and train a logging policy on a 3% fraction of the relevance judgements.This simulates a real-world setting, where a production ranker trained on manual judgements is used to collect click logs, which can then be used for subsequent click-based optimization.Typically, in real-world ranking settings, given that the production ranker is used on live-traffic, it is deemed as a safe policy that can be trusted with real users.We simulate a top- ranking setup [27] where five documents are presented at once.Clicks are generated with our assumed click model (Eq.3) and the following rank-based position-bias: In real-world click data, the observed CTR is typically very low [4,19,36]; hence, to simulate such a sparse click settings, we apply the following transformation from relevance judgements to relevance probabilities: where  (, ) ∈ {0, 1, 2, 3, 4} is the relevance judgement for the query-document pair and 0.2 is added as click noise.During training, the only available data consists of clicks generated on the training and validation sets, no baseline method has access to the underlying relevance judgements (expect the skyline).
Furthermore, we assume a setting where the exact logging policy is not available during training.As a result, the ρ0 propensities have to estimated, we use a simple frequency estimate following [29]: For the action-based baselines, the action propensities π0 ( | ) are similarly estimated based on observed frequencies: where π0 (  | ) is the estimated probability of  appearing at rank  for query .As is common in CLTR [18,23,37], we clip propensities by 10/ √  in the training set, to reduce variance, but not in the validation set.
We optimize neural PL ranking models [24] with early stopping based on validation clicks to prevent overfitting.For the REIN-FORCE policy-gradient, we follow [50] and use the average reward per query as a control-variate for variance reduction.
As our evaluation metric, we compute NDCG@5 metric using the relevance judgements on the test split of each dataset [14].All reported results are averages over ten independent runs, significant testing is performed with a two-sided student-t test.
Finally, the following methods are included in our comparisons: (i) Naive.As the most basic baseline, we train on the generated clicks without any correction (equivalent to ∀,  0 () = 1).(ii) Skyline.To compare with the highest possible performance, this baseline is trained on the actual relevance judgements.(iii) Action-based IPS.Standard IPS estimation (Eq.13) that is not designed for ranking and thus uses action-based propensities.(iv) Action-based CRM.Standard CRM (Eq.12) that is also not designed for ranking, for the risk function we use the actionbased divergence function in Eq. 15. (v) Exposure-based IPS.The IPS estimator designed for CLTR with exposure-based propensities (Eq.7).The most important baseline, as it is the prevalent approach in the field [27,29].(vi) Exposure-based CRM.Our proposed CRM method (Eq.31) using a risk function based on exposure-based divergence.
Yahoo! Webscope MSLR-WEB30k Istella NDCG@5 The main results of our experimental comparison are presented in Figure 3 and Table 3. Figure 3 displays the performance curves of the different methods as the number of logged interactions ( ) increases.Table 3 presents performance at  ∈ {4 • 10 2 , 4 • 10 7 , 10 9 } and indicates whether the observed differences with our exposurebased CRM method are statistically significant.We start by considering the performance curves in Figure 3.We see that both the action-based and exposure-based IPS baselines have an initial period of very similar performance that is far below the logging policy.Around  ≈ 10 4 their performance is comparable to the logging policy, and finally at  = 10 9 the exposure-based IPS has reached optimal performance, while the performance of action-based IPS is still far from optimal.We can attribute this initial poor performance to the high variance problem of IPS estimation; when  is small, variance is at its highest, resulting in risky and sub-optimal optimization by the IPS estimators.However, even when  = 10 9 , the variance of the action-based IPS estimator is too high to reach optimal performance, due to its extremely small propensities.This illustrates why the introduction of exposurebased propensities was so important to the CLTR field, and that even exposure-based IPS produces unsafe optimization when little data is available or variance from interactions is high.
Next, we consider whether action-based CRM is able to mitigate the high variance problem of action-based IPS.Despite being a proven generalization bound, Figure 3 clearly shows us that actionbased CRM only leads to decreases in performance compared to its IPS counterpart.It appears that this happens because the logging policy is not available in our setup, and the propensities have to be estimated from logged data.Consequently, the action-based risk pushes the optimization to mimic the exact rankings that were observed during logging.Thus, due to the variance introduced from the sampling of rankings from the logging policy, it appears that action-based CRM has an even higher variance problem than action-based IPS.As expected, our results thus clearly indicate that action-based CRM is also unsuited for the CLTR setting, to our surprise; it is substantially worse than its IPS counterpart.
Finally, we examine the performance of our novel exposurebased CRM method.Similar to the other methods, there is an initial period of low performance, but in stark contrast, this period ends very quickly; on Yahoo! logging policy performance is reached when  ≈ 125, on MSLR-WEB30k when  ≈ 350 and on Istella when  ≈ 400.For comparison, exposure-based IPS needs  ≈ 1100 on Yahoo!,  ≈ 10 4 on MLSR-WEB30k and  ≈ 1.1 • 10 4 on Istella to do the same; meaning that our CRM method needs roughly 89%, 97% and 97% fewer interactions, respectively.In addition, Table 3 indicates that the logging policy performance is matched on all datasets when  = 400 by exposure-based CRM, where it also outperforms all baseline methods.We note that there is still an initial period of low performance, because the logging policy is unavailable at training, and thus, its behavior still has to be estimated from logged interactions.It is possible that in settings where the logging policy is fully known during training, this initial period is eliminated entirely.Nevertheless, our results show that exposure-based CRM reduces the initial periods of poor performance due to variance by an enormous magnitude.
Furthermore, while the initial period is clearly improved, we should also consider whether there is a trade-off with the rate of convergence.Surprisingly, Figure 3 does not display any noticeable decrease in performance when compared with exposure-based IPS.Moreover, Table 3 shows the differences between exposure-based IPS and CRM are barely measurable and not statistically significant when  ∈ {4 • 10 7 , 10 9 }.We know from the risk formulation in Eq. 31 that the weight of the risk term decreases as  increases at a rate of 1/ √  .In other words, the more data is available, the more optimization is able to diverge from the logging policy.It appears that this balances utility maximization and risk minimization so well that we are unable to observe any downside of applying exposurebased CRM instead of IPS.Therefore, we conclude that, compared to all baseline methods and across all datasets, exposure-based CRM drastically reduces the initial period of low performance, matches the best rate of convergence of all baseline, and has optimal performance at convergence.

Ablation study on the confidence parameter
To gain insights into how the confidence parameter  affects the trade-off between safety and utility, an ablation study over various  values was performed for both CRM methods.The top-row of Figure 4 shows us the performance of actionbased CRM, and contrary to expectation, a decrease in  corresponds to a considerably worse performance.For the sake of clarity, in theory,  is inversely tied to safety, a lower  should result in less divergence from the safe logging policy [30].Conversely, we see that action-based CRM displays the opposite trend.We think this further confirms our hypothesis that a frequency estimate of action-based divergence has an even higher variance problem than action-based IPS.Consequently, a higher weight to the risk function results in worse performance.This further confirms our previous conclusion that action-based CRM is unsuited for the CLTR setting, regardless of how the  parameter is tuned.
In contrast, the bottom-row of Figure 4 displays the expected trend for exposure-based CRM; as  decreases the resulting performance gets closer to the logging policy.With  = 0.1, CRM performs extremely close to its IPS counterpart, as optimization is less constrained to mimic the logging policy here.Decreasing  appears to have diminishing returns, as the difference between  = 10 −4 and  = 10 −5 is marginal.Importantly, we do not observe any downsides to setting  = 10 −5 , thus we have not reached a point in our experiments where  is set too conservatively.This suggests that exposure-based CRM is very robust to the setting of the  parameter, and that a sufficiently low  does not require fine-tuning.Therefore, this shows that the improvements we observed when comparing with baseline methods, did not stem from a fine-tuning of .Thus, we can conclude that this robustness further increases the safety that is provided by exposure-based CRM, as there is also little risk involved in the tuning of the  parameter.

CONCLUSION
In this paper, we introduced the first counterfactual risk minimization (CRM) method designed for CLTR, that relies on a novel exposure-based divergence function.In contrast with existing actionbased CRM methods, exposure-based divergence avoids the problem of the enormous combinatorial action space when ranking, by measuring the dissimilarity between policies based on how they distribute exposure to documents.As a result, exposure-based CRM optimization produces policies that rank similar to the logging policy when it is risky to follow IPS, i.e., when little data is available or variance is very high.Consequently, our experimental results show that it almost completely removes initial periods of detrimental performance; to be precise, our method needed 89% to 97% fewer interactions than state-of-the-art IPS to match production system performance.Importantly, we observed no downsides in its application, as it maintained the same rate and point of convergence as IPS, in all tested experimental settings.Therefore, we conclude that our exposure-based CRM method provides the safest CLTR methods so far, as it almost completely alleviates the risk of decreasing the performance of a production system.
These improvements have large implications for practitioners who work on ranking systems in real-world settings, since the almost complete reduction of initial detrimental performance removes the main risks involved in applying CLTR.In other words, when applying our novel exposure-based CRM, practitioners can have significantly less worry that the resulting policy will perform worse than their production system and hurt user experience.
We hope future work will further research the promising potential applications of exposure-based CRM, for instance, in settings with fast turn-around times in deployment, or large numbers of tail-queries [46,47], where interaction data is limited.centre.nl.This work used the Dutch national e-infrastructure with the support of the SURF Cooperative using grant no.EINF-4963.All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.

REPRODUCIBILITY
All experimental results in this work were obtained using publicly available data.Our implementation is publicly available at https: //github.com/shashankg7/crm_ultr.
with Action-based Risk Policy for IPS without Risk

Figure 2 :
Figure 2: Example comparison of the optimal policy for a single logged click according to three different risk estimators.

Figure 3 :
Figure 3: Performance in NDCG@5 of various IPS and CRM methods for CLTR.The top-row presents the results when the size of the training data is varied from extremely small (10 2 ) to extremely high (10 9 ).The bottom-row is a zoomed-in view, focusing on the low-data region from 10 2 to 10 5 .Results are averages over 10 runs; shaded areas indicate 80% confidence intervals.7 RESULTS AND DISCUSSION 7.1 Comparison with baseline methods

Figure 4 :
Figure 4: Performance of CRM methods with varying confidence parameters ().Top-row: action-based CRM baseline; bottomrow: our exposure-based CRM method.Results are averages of 10 runs; shaded areas indicate 80% confidence intervals.