Sparse Moving Averages

Sparse moving averages (SMAs) are designed for probabilistic prediction & learning on long open-ended streams of items. In these problems, change is paramount, thus 'moving' in the name 'moving average' (and 'averaging' to estimate probabilities here). The number of possibilities, targets of prediction, is large and possibly unbounded (open-ended). When learning (observing and updating weights), one can imagine the vector corresponding to what is observed to be infinite dimensional, with all 0 entries except for a single 1 corresponding to the item observed (more generally a few 1s). We also say the predictions are 'sparse' (or better be! for efficiency), ie at any time point, a small subset of items (eg, up to 10s or 100s) get nonzero probabilities by the predictor. The predictor is not given the list of items it will observe before hand. Instead, it processes a long or an infinite stream of items, and repeats predicting, observing, and updating (learning) (this has been referred to as the prequential loop). The predictor has finite space and should be efficient in its updating and predicting. The frequency of items in the stream can change. For example, a currently frequent item may cease to appear again, and a new highly frequent item (event) can occur for the first time (and become the norm thereafter for some time). Thus the predictor tries to keep track of and predict for the currently salient items, those with a (recent) probability above a threshold.

Our underlying implicit assumption is, informally, that for sufficiencly long spans of time, there exist sufficiently many stable (predictable) items for this task, of learning/tracking changing probabilities, to be useful (stable means that the item occurs several times with some roughly fixed and sufficiently large, ie detectable, frequency in a given time span). In some input streams, most things may stay the same, but there are often times that some of the things change (possibly abruptly), and the predictor is not told which, or when. We develop and explore a number of SMAs, learning-rate based ones such as sparse EMA and counting-based ones, and find that keeping predictand specific parameters , such as learning rates (per prediction edge), can lead to a superior trade-off between plasticity (timely adaptation) and stability. Empirical evaluation, i.e., comparing different SMA techniques, is tricky as well: deciding on which items to evaluate on and what size of history to use are dependent on the item probabilities. We develop a bounded log-loss technique, and we also compare on synthesized streams (where we know the true probabilities).

This open-ended prediction problem arises within Prediction Games, where concepts (structured patterns such as n-grams) predict one another (during every interpretation episode), and new concepts are generated from time to time, causing non-stationarities in existing predictions.

SMAs are applicable to other settings, such as continual personalization and behavior modeling, in particular when needing to predict from within a possibly large number of possibilities fast, and when there exist non-stationarities, e.g. when some aspects of the modeled behavior change.

Sparse Moving Averages (SMAs)