Back to the future in time series forecasting
Sundial points to a new dawn of foundation models.
A recent paper from China introduces Sundial, a family of time series models that appears to represent a significant leap forward in time series forecasting. To understand Sundial's potential impact, it's helpful to trace the evolution of this field, which has deep roots stretching back centuries. Early applications focused on astronomical observations and agricultural predictions, but the formal development of time series analysis began in the 20th century, driven by advances in statistical theory and the increasing availability of computational resources.
Early in the 20th century, statisticians like Yule and Slutsky laid the foundation for autoregressive (AR) and moving average (MA) models. These models provided a mathematical framework for describing and forecasting time series by capturing relationships between a data point and its preceding values. An autoregressive model predicts a value based on a linear combination of past values. A moving average model smooths out fluctuations by averaging data points over a specific window. Both are fundamental tools but have limitations in capturing complex patterns.
As the 20th century progressed, the Box-Jenkins methodology, centered around ARIMA (Autoregressive Integrated Moving Average) models, emerged. ARIMA models, capable of handling non-stationarity – the presence of trends and seasonality – through differencing, became a workhorse of time series analysis. A time series is stationary if its statistical properties (like mean and variance) remain constant over time. Differencing involves calculating the difference between consecutive data points, a technique that can effectively remove trends and stabilize the statistical properties of the data, making it suitable for modeling. During this era, exponential smoothing methods, which assign exponentially decreasing weights to older observations, prioritizing more recent data, also gained prominence. These methods are computationally efficient and useful for data with clear trends or seasonality.
The latter half of the 20th century saw researchers exploring more complex models. This included non-linear models like ARCH and GARCH, specifically designed for financial time series to handle volatility clustering—the tendency for large price fluctuations to cluster together, as do small fluctuations. ARCH (Autoregressive Conditional Heteroskedasticity) and GARCH (Generalized ARCH) models are statistical models for time series data that exhibit time-varying volatility. Models incorporating exogenous variables, representing external influences on the time series, also emerged. These are variables outside of the time series itself that are thought to influence it (e.g., rainfall might be an exogenous variable affecting crop yields). The increasing availability of computing power made these more computationally demanding techniques practical.
The 21st century has witnessed an explosion of data and a dramatic increase in computational power, ushering in the era of machine learning and, more recently, deep learning for time series forecasting. Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and, more recently, Transformers have demonstrated, in varying degrees,capabilities in capturing complex, non-linear dependencies within time series data.
Before the deep learning revolution, statistical models were the dominant approach. While simpler than their modern counterparts, they retain value due to their interpretability and computational efficiency. ARIMA models, for instance, are defined by three key parameters: (p), the autoregressive order; (d), the degree of differencing; and (q), the moving average order. While effective for data exhibiting clear linear patterns and limited non-stationarity, ARIMA models often struggle with the complex, non-linear relationships common in real-world time series. Similarly, exponential smoothing methods can be computationally efficient but often fall short when dealing with intricate patterns.
Deep learning models, particularly neural networks, excel at learning complex, non-linear relationships. However, applying them to time series forecasting has presented challenges. Recurrent Neural Networks (RNNs), including Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks, are designed for sequential data. Their recurrent structure enables them to maintain an internal "memory" of past observations, making them well-suited for capturing temporal dependencies. However, RNNs can encounter the vanishing gradient problem. During training, neural networks adjust their internal parameters (weights) based on how well they are performing. This adjustment is guided by gradients, which indicate the direction of improvement. In RNNs, these gradients can become extremely small as they are "backpropagated" through the network, making it difficult for the model to learn from earlier parts of the sequence and thus hindering the learning of long-range dependencies.
Convolutional Neural Networks (CNNs), typically used for image processing, can be adapted for time series forecasting by applying convolutional filters. A convolutional filter is a small matrix of weights that slides across the input data, performing element-wise multiplication and summing the results. This process helps the network learn to recognize specific patterns in the data. While CNNs can extract local features and patterns, their ability to capture long-range dependencies is limited.
Enter Transformers. Initially developed for natural language processing, Transformers have since been applied to time series data. Their core innovation, the self-attention mechanism, allows the model to weigh the importance of different parts of the input sequence in parallel. This parallel processing is a significant advantage over the sequential processing of RNNs, improving computational efficiency. Furthermore, self-attention calculates the relationship between all pairs of elements in the sequence, determining which elements are most relevant for predictions. This enables the model to capture long-range dependencies far more effectively than RNNs and avoids the vanishing gradient problem.
A crucial aspect of using deep learning for time series is tokenization - how the data is represented for the model. Traditional deep learning models use discrete tokenization, where the continuous time series is broken down into discrete, categorical units. This process introduces several problems: fine-grained information can be lost, discretization can introduce quantization errors (errors introduced by approximating continuous values with discrete ones), and the model may struggle with unseen patterns.
The Sundial family of foundation models - explicitly designed to learn from time series data at a massive scale - appears to be a major advancement, directly addressing many of these limitations. Sundial employs continuous tokenization, representing the time series as a continuous signal, preserving valuable information lost in discrete tokenization. Native patching, processing data in overlapping segments, enhances this by providing context while retaining fine-grained detail. Sundial utilizes TimeFlow Loss, a flow-matching-based generative training objective, to learn the underlying probability distribution of the time series directly from the data. This avoids potentially inaccurate assumptions about the data’s distribution, a common limitation of other models. Flow-matching is a technique that aims to learn a transformation that maps a simple distribution (e.g., a standard normal distribution) to the complex data distribution.
Sundial models are trained on TimeBench, a massive, trillion-point dataset spanning diverse domains. This extensive pre-training allows for zero-shot or few-shot transfer learning, enabling the models to generalize far more effectively than models trained on smaller, task-specific datasets.
The models have achieved significant advances on established forecasting benchmarks, showcasing the potential of this approach. It is hoped that TimeBench and the insights gained from training Sundial will inspire future research into time series foundation models, ultimately leading to more robust and widely applicable solutions for real-world forecasting challenges.
Further reading: [2502.00816] Sundial: A Family of Highly Capable Time Series Foundation Models