Mondrian Conformal Predictor

Mondrian conformal predictors (MCPs) represent a wide class of confidence predictors which is the generalisation of conformal predictors and inductive conformal predictors with a new property of validity, validity within categories; this kind of validity is stronger (for large data sets) than the usual conservative validity for conformal predictors.

Definition

A Mondrian taxonomy is a measurable function $\kappa:N \times Z \to K$, where $Z$ is a set of observations and $K$ is a countable set of elements called categories, that satisfies some natural properties given in Vovk et al. (2005), Section 4.5. The definition involve partitions into rectangles, reminiscent of Piet Mondrian's paintings (see also Scott Locklin's blog). However, in this article we will use a simpler definition given in Balasubramanian et al. (2014), Section 2.2.

A simple Mondrian taxonomy is a function $K$ that maps any finite sequence of observations $z_1,\ldots,z_n$ to a same-length sequence $\kappa_1,\ldots,\kappa_n$ of categories that is equivariant with respect to permutations: \[(\kappa_1,\ldots,\kappa_n) = K(z_1,\ldots,z_n) \Longrightarrow (\kappa_{\pi(1)},\ldots,\kappa_{\pi(n)}) = K(z_{\pi(1)},\ldots,z_{\pi(n)}).\] We refer to $\kappa_i$ as the category of $z_i$. A $K$-conditional conformity measure is a measurable function $A$ that maps every data sequence $z_1,\ldots,z_n$ to a same-length sequence $\alpha_1,\ldots,\alpha_n$ and that is equivariant with respect to permutations leaving the categories intact.

The Mondrian conformal predictor determined by a $K$-conditional conformity measure $A$ and a set of significance levels $\epsilon_k$, $k \in K$, is defined by \[ \Gamma^ {(\epsilon_k: k \in K)} (x_1, y_1, \ldots, x_{n}, y_{n}, x_{n+1}):= \left\{y \in \mathbf{Y} \mid p^y > \epsilon_y\right\},\] where \[ p^y := \frac{|\{i=1,\ldots,n+1 \mid \kappa_i^y = y_{n+1}^y \;\&\; \alpha^y_i \le \alpha^y_{n+1}\}|}{|\{i=1,\ldots,n+1: \kappa^y_i = \kappa^y_{n+1}\}|},\] where $(\kappa^y_1,\ldots,\kappa^y_{n+1}) := K(z_1,\ldots,z_{n+1})$ and \[ (\alpha^y_1,\ldots,\alpha^y_{n+1}) := A(z_1,\ldots,z_n,(x_{n+1},y)).\] (Formally, as defined here, an MCP is a confidence predictor only when all $\epsilon_k$ coincide.)

The standard assumption for MCPs is the randomness assumption (also called the IID assumption).

A simple example is given by label-conditional conformal predictors, which require the label space $\mathbf{Y}$ to be finite. In this case the set of categories $K=\mathbf{Y}$ coincides with the set of labels, and the category of an observation $(x_i,y_i)$ is simply its label, $\kappa_i:=y_i$.

Another example is where we want to make the analysis conditional on a type of objects. For example, if the objects are people, we might want to define $\kappa_i:=\text{male}$ if $x_i$ is male and $\kappa_i:=\text{female}$ if $x_i$ is female.

Validity

All the statements in the section are given under the randomness assumption.

Smoothed MCPs are defined analogously to smoothed conformal predictors: the definition of $p^y$ is now modified to \[ p^y := \frac{|\{i=1,\ldots,n+1 \mid \kappa_i^y = y_{n+1}^y \;\&\; \alpha^y_i < \alpha^y_{n+1}\}| + \eta_y |\{i=1,\ldots,n+1 \mid \kappa_i^y = y_{n+1}^y \;\&\; \alpha^y_i = \alpha^y_{n+1}\}|}{|\{i=1,\ldots,n+1: \kappa^y_i = \kappa^y_{n+1}\}|},\] and the rest of the definition is the same. Here $\eta_y\sim U[0,1]$ are random numbers (chosen independently at different trials and independently of the observations).

In the next theorem we assume that $\kappa_i$ depends only on $z_i$ (without this assumption, we can still guarantee the right coverage probability but the independence of errors may be lost).

Theorem Any smoothed MCP is category-wise exact w.r. to $K$, that is, for all $n$, the conditional probability distribution of $(p_1,p_2,\ldots)$ given $\kappa_1,\kappa_2,\ldots$ is uniform on ${[0,1]}^{\infty}$, where the observations $z_1,z_2,\ldots$ are generated from an exchangeable probability distribution on $Z^{\infty}$ and $p_i:=p_i^{y_i}$ are the p-values for the true labels.

This implies that in the long run the frequency of erroneous predictions given by smoothed MCPs on examples of category {$k$} is equal to {$\epsilon_k$} for each {$k$}. Therefore, the frequency of errors made by MCPs on examples of category $k$ does not exceed $\epsilon_k$ for each $k$.

An important feature of MCPs is that the theorem only requires the observations in the same category to be exchangeable: it will remain true if the equality in the definition of exchangeability is only required to hold for permutations $\pi$ leaving the categories intact. A nice example (due to Daniil Ryabko) showing the significance of this relaxation is recognizing handwritten text (assuming the text has been divided into separate symbols): the label-conditional conformal predictor will be valid provided that different instances of the same symbol are exchangeable; we do not need to require that the sequence of symbols in the text be exchangeable (this assumption would be patently wrong).

Applications

ExCAPE project

Bibliography

Vineeth N. Balasubramanian, Shen-Shyang Ho, and Vladimir Vovk, editors (2014). Conformal Prediction for Reliable Machine Learning: Theory, Adaptations, and Applications. Morgan Kaufmann, Chennai.
Vladimir Vovk, Alexander Gammerman, and Glenn Shafer (2005). Algorithmic learning in a random world. Springer, New York.