Predicting the Future:

AI Approaches to Time-series Problems

The goal of the workshop was to bring together AI researchers who study time-series problems, along with practitioners and researchers from related fields. These problems are of particular interest because of the large number of high-profile applications today that include historical time series (e.g., prediction of market trends, crisis monitoring). The focus was primarily on machine learning and data mining approaches, but we included perspectives on statistical time-series analysis and on state-space analysis (e.g., work on hidden Markov models). These communities arguably overlap significantly (and indeed, much work on state-space analysis, for example, has fallen under the heading of machine learning). Our goal was to make researchers and practitioners not only aware of approaches similar to their own, but also aware of methods from other communities that might be applied effectively to their problems.

Time-series problems include segmentation and labeling of time series as well as prediction. Classical time-series prediction involves fitting a function to a numeric time series in order to predict future values (e.g, the prediction of a stock's price given its past performance). Time-series prediction may also be required for problems involving categorical, rather than numeric, data. For instance, one might be interested in predicting the next Unix command that will be typed by a user of a system, given their history of Unix commands.

In many situations, the problem is not to predict future values of a time series but rather to label the series. For example, cardiologists examine electrocardiograms in order to diagnose arrhythmias. Problems of segmentation and labeling often occur together. The problem may be to assign a label to an entire time series, or to segment the series into subsequences and label them individually. The latter problem is more difficult because there is temporal information to consider both within each subsequence itself and among the various subsequences. In the extreme case, the subsequences may be single atomic events, where temporal information exists among the events, but there is no temporal information within the event itself.

The labeling problems we have described assume that there are fixed boundaries that define the pieces of the time series that are to be labeled. The boundaries, however, may not be fixed. For instance, consider following a pilot's commands as they fly through a multi-stage flight plan. The problem is to identify the stages and to label them. Here the boundaries of the subsequences will not be fixed for all flight plans, but will be variable. The problem becomes not only labeling the stages, but identifying the boundaries themselves.

In other cases the identification of the boundary may be paramount. For example, consider the problem of detecting credit card fraud. Time-series information exists in the form of a stream of credit card transactions for an account, and the problem is to decide at any given time whether the account has been defrauded. In real time this would involve examining (sub)sequences of account activity, and determining, as soon as possible, whether fraudulent activity exists. Here the problem is not only to label accurately, but to identify as closely as possible the point where fraudulent behavior begins, so that usage can be stopped and losses minimized.

The working notes contain 16 papers, seven of which were selected for presentation at the workshop. In keeping with the goal of making the workshop a resource for researchers and practitioners in this area, we included in the working notes a small selection of relevant papers that also appear elsewhere. The papers presented at the workshop span the range of problems described above.

Brian Davison and Haym Hirsh consider the problem of time-series prediction, where the data are categorical.

Two papers look at the problem of time-series classification. Keogh and Pazzani discuss a general method for representing time series, combining a piecewise linear representation of the series with weights to indicate the relative importance of the segments. Tim Oates, David Jensen, and Paul Cohen consider the related problem of clustering events.

Michael Harries, Kim Horn, and Claude Sammut address the problem of time-series subsequence labeling, where the boundaries are not fixed. The problem of flight plan segmentation and classification described earlier is among the problems discussed in their paper.

Three of the papers presented look at the problem of boundary identification. Terran Lane considers the computer security task of anomaly prediction. The task is to characterize the behaviors of a computer user with a profile so that unusual occurrences can be detected. The main point here is not only to detect legitimate and fraudulent behavior accurately, but to minimize the average time to detection -- i.e., to locate the boundary of fraudulent behavior. In their paper, Pierre Geurts and Louis Wehenkel address a similar problem of crisis monitoring. The task they describe is the prediction of power system blackouts. Again, the task is one of labeling events as critical or normal, but the primary focus is on the identification of the boundary between normal and critical events. Gary Weiss and Haym Hirsh more generally address the problem of predicting rare events, such as predicting the failure of a piece of equipment in a telecommunications network.

In addition to presentations based on the papers in the proceedings, there were two invited talks. Leslie Kaelbling gave a brief tutorial on reinforcement learning and how it might be applied to problems in time series. Padhraic Smyth discussed the importance of not dismissing well-understood techniques from statistics, at least as a first method of attack when addressing new problems.

Discussions were held throughout the day, and a number of common issues arose. One recurring issue is the difficulty of obtaining data. While time-series problems are becoming increasingly important and prominent, much of the data of interest (for instance, financial or marketing data) are difficult to obtain due their sensitive nature. User-profiling data are similarly difficult to share because of privacy concerns. A goal of the researchers at the workshop is to try to overcome some of these difficulties and to establish a common repository of time-series data. By making these data commonly available, researchers can test new algorithms more easily, and the field can benefit from replication of experiments and comparison of results.