Linear Probability Forecasting

This generalization of online linear regression framework describes a way of predicting outcomes from the probability simplex. If there is a set $\Sigma$ with finite number of elements $d$, we say that each outcome belongs to the space of all probability measures over $\Sigma$, $\Omega = \mathcal{P}(\Sigma)$.

In this setting, each outcome has $d$ components. A natural algorithm to predict such an outcome is the coordinate-wise application of the Aggregating Algorithm Regression. The equivalent algorithm is derived if all the experts are mixed directly using the Aggregating Algorithm. However, if the outcome belongs to the simplex, simple linear experts are not very efficient: they give predictions lying in the whole space. In this paper, more efficient linear experts are considered. Their predictions lie in the hyperplane of the simplex.

Learner competes with all linear functions (experts) $\xi_t=(\xi^1_t,\dots,\xi^d_t)'$ of $x$: \begin{align*} \xi^1_t &= 1/d + \alpha_1 ' x_t\\ &\ldots\\ \xi^{d-1}_t &= 1/d + \alpha_{d-1} ' x_t\\ \xi^d_t &= 1 - \xi^1 - \dots - \xi^{d-1} = 1/d - \left(\sum_{i=1}^{d-1}\alpha_i\right) ' x_t, \end{align*} where $\alpha_i=(\alpha_i^1,\dots,\alpha_i^n)',\enspace i=1,\ldots,d-1$. In the model~\eqref{eq:experts1} the prediction for the last component of the outcome is calculated from the predictions for other components. Denote $\alpha = (\alpha_1',\dots,\alpha_{d-1}')' \in \Theta = \mathbb{R}^{n(d-1)}$. Then any expert can be represented as $\xi_t=\xi_t(\alpha)$.

Two algorithms are applied. The first just uses component-wise prediction of the AAR, and then projects it to the hyperplane. It is proven that this algorithm competes with the necessary experts. It achieves the following bound.

Theorem. If $\|x_t\|_\infty \le X$ for all $t$, then for any $a > 0$, every positive integer $T$, every sequence of outcomes of the length $T$, and any $\alpha \in \mathbb{R}^{n(d-1)}$, the loss $L_T$ of the component-wise algorithm satisfies

$\displaystyle{L_T \le L_T(\alpha) + da\|\alpha\|^2_2 + \frac{nd}{4}\ln\left(\frac{TX^2}{a}+1\right).}$

The second algorithms mixes all the experts directly, and is called mAAR (multidimensional AAR). It achieves the following bound.

Theorem. If $\|x_t\|_\infty \le X$ for all $t$, then for any $a > 0$, every positive integer $T$, every sequence of outcomes of the length $T$, and any $\alpha \in \mathbb{R}^{n(d-1)}$, the $\mathrm{mAAR(}2a\mathrm{)}$ satisfies

$\displaystyle{L_T(\mathrm{mAAR(}2a\mathrm{)}) \le L_T(\alpha) + 2a\|\alpha\|^2_2 + \frac{n(d-1)}{2}\ln\left(\frac{TX^2}{a}+1\right).}$

The second upper bound~ has worse constant $\frac{n(d-1)}{2}$ in front of the logarithm than $\frac{nd}{4}$ of the first bound if $d\ge3$. In this case the second bound is worse asymptotically in $T$ than the first one, but it is better in the beginning, especially when the norm of the best expert $\|\alpha\|$ is large. This can happen in the important case when the dimension of the input vectors is larger than the size of the prediction set: $n >> T$.

Experiments showed that on 5 databases the online performance is better than batch logistic regression (trained once), but worse than the online logistic regression (retrained every step). At the same time, the speed is faster than the online logistic regression.

Bibliography

  • F. Zhdanov and Y. Kalnishkan. Linear Probability Forecasting, to appear in AIAI 2010 proceedings.