
Representation Learning applied to time series forecasting
How to extract the most out of real-world data which is not always well organised
Use cases
Representation Learning is a subsection of Machine Learning and a key component of the pretreatment of data in many contexts:
- Text embeddings: Transforming texts to embeddings, like Word2Vec, BERT, etc …
- Visual embeddings: Transforming difficult-to-manage video images into machine-interpretable vectors like ResNet, SimCLR, CLIP …
- Audio & speech embeddings: wav2vec, OpenL3, etc …
- Graph embeddings: Node2Vec, OpenL3 …
Use cases in time series
Anomaly detection: Representation Learning enables algorithms to understand data structures better and it efficiently filters out unhelpful data, enabling models like LSTM AutoEncoder to identify wrong data points.
Sequence generation: Algorithms such as VRAE (Variational Recurrent Autoencoder) are capable of generating time series similar to the ones they were trained on.
Classification and clustering: TS2Vec uses Representation Learning to manage unlabeled data to build multi-scale, timestamp-level embeddings via contrastive learning, delivering top performance on classification, clustering, and anomaly detection.
Forecasting: Algorithms with a Transformer-based structure, such as PatchTST, Autoformer, Informer, use Representation Learning (even self-Representation Learning) to facilitate data ingestion by the model and outperform benchmark models.
Deep dive into Representation Learning
What is Representation Learning?
Representation Learning is a subfield of Machine Learning, involving training a model to learn the structure and semantics of data by analysing it via a representation. An easy example is turning words into vectors, vectors being the representation of words.
Representation Learning was developed to address the challenges associated with manual feature engineering in Machine Learning. Traditionally, creating effective models required domain experts manually to identify and extract relevant features from raw data — a process that is both time-consuming and prone to human bias. Representation Learning automates this process by enabling systems to learn useful features directly from raw data, thereby improving efficiency and model performance.
By choosing the appropriate representation, we are able to train our models to identify relationships between structures, the exact same way Encoders use vectors to represent interactions between words in a sentence, leading a mathematical engine (the transformer) to be able to understand the construction and interactions between the different elements.
Why was it invented?
The data in raw form is too complex to have been processed directly, like an image. However, when turned into a vector, a machine can understand an image.

What is a good representation?
An effective representation should exhibit three key properties: informativeness, compactness, and generalisation.
- Informativeness
The embedding must capture and encode the critical aspects of the original data in a more concise form.
- Compactness
- Dimensionality Reduction: The learned features should occupy far fewer dimensions than the raw input. This not only streamlines storage and retrieval but also filters out noise, enabling the model to train faster and focus on the most relevant patterns.
- Information Retention: Even after reducing dimensions, the representation must preserve the core signals needed for accurate downstream performance. Striking the right balance between brevity and fidelity is crucial.
- Generalisation (Transferability)
The goal is to produce versatile embeddings that can be reused across tasks. In practice, one often starts with a model pre-trained on a large dataset (e.g. ImageNet in computer vision) and then fine-tunes it on a smaller, task-specific dataset, leveraging the broad knowledge encoded in the original representation to achieve strong results with limited new data.
Step-by-step conceptual implementation
- Data preparation
- Choose a sliding window size, which will correspond to the context the algorithm will be able to use: e.g. 100.
- Choose a stride (a number of data points that will not appear in 2 consecutive windows. For example, window 1: 1 to 100, window 2: 11 to 110 → the stride value is 10).
- Model architecture
- Choose your encoder:
LSTM
- Two gates (input, forget) plus an output gate and a cell state.
- Excellent at capturing long-range dependencies and remembering information over many time steps.
- More parameters → higher memory footprint and slower training.
GRU (Gated Recurring Unit)
- Single update gate and a reset gate (it merges LSTM’s input & forget gates).
- Fewer parameters → faster to train, smaller model, similar performance in practice.
- Slightly less flexible than LSTM on very long dependencies, but almost as effective in most tasks.
1D CNN
- No recurrent connections; learns local temporal filters (motifs) of fixed kernel size.
- Inherently parallel (GPU-friendly) and very fast.
- Receptive field grows with depth or via dilation — good at capturing mid-range patterns but requires careful architecture design to see very long contexts.
- Mirror it with the corresponding decoder capable of reconstructing the original window from the encoder output.
- Choose your training parameters
For the sake of simplicity, we used mean squared error (MSE) between the original sequence and the reconstructed output as the loss.
You could optionally add regularisers like KL divergence (for variational autoencoders) or contrastive losses if you want to use self-supervision.
- Embedding extraction: remove the decoder
Once trained, discard the decoder. For any new sequence, feed it through the encoder and grab the d-dimensional code.
You now have your encoder, enabling you to turn traditional time series into a vector representing the series.
Time series is already data. Why then a representation?
Indeed, a time series is already a vector. Even though time-series data are “just numbers”, those raw streams are often high-dimensional, noisy, and laden with complex temporal patterns (trends, seasonality, cross-correlations, anomalies, etc.) that are hard to hand-craft into features. Representation Learning automatically distils each window or sequence into a compact vector that:
- Captures the key temporal dynamics (e.g. repeating motifs, long-range dependencies).
- Denoises and filters out irrelevant fluctuations.
- Yields a fixed-size embedding even if the original series varies in length.
- Improves downstream tasks (classification, clustering, forecasting) by operating in a more “semantically meaningful” space.
At DatumLocus, we are currently exploring the representation of a data point
The current representation:
If you think of Univariate time series with no exogenous variables, we could deconstruct all the information as follows:
In traditional time series, a data point is defined by:
- The value.
- The corresponding date, which contains a sequence of information (this date after this other one), a distance in time information, as well as insights on seasonality and cyclical patterns.
At DatumLocus, we intend to represent our data points differently:
- The value.
- The date: place inside the sequence, position inside the cycles, distance in time.
But also:
- The value range (High, Mid, Low) which will contain information regarding the current value compared to the elasticity of the time series as well as the average value. Indeed, prices tend to evolve differently if they are already high, rather than if they are low.
- The speed (growth rate): For how long have we been increasing? What is the current growth rate? These data contain information on the current uncertainty of the market as well as the confidence of the market in the price increasing.
- The regime: In time series, we can decompose a TS into regimes, identifying moments in which a certain equation performed well to model the TS in a given time, but in econometrics, these equations stop as the macro or micro economic influences change.
- Close window volatility: Identifying the volatility allows us to anticipate if the market will behave in an erratic manner or not in the coming future.
- Specific to trading:
- Number of recent trades: give an indicator on current trading activities.
- Numbers of opened positions: gives an insight on the future trading activities.
Now, all of them have to be calculated using scripts, and we end up with only known information, not being able to identify representation metrics beyond what our data team is able to think of.
Representation Learning: Automatising the concept of identifying the appropriate representation of a number
As we started working more and more with transformers, we realised we had the opportunity of going beyond these metrics we already use. Is there any other pattern, system, or components that our data team has not thought about and that could be used by the model to outperform our current existing models.
Conclusion
Representation Learning in time series is a paradigm shift: instead of painstakingly engineering features, we let neural networks discover the essential temporal patterns on their own. Whether using LSTM autoencoders, contrastive predictors, or Transformer-based sequence models, the goal remains the same: convert raw, noisy, variable-length signals into compact, information-rich embeddings. These representations boost performance on downstream tasks like anomaly detection, classification, clustering, and forecasting, while filtering out irrelevant noise. As you explore this frontier, remember to experiment with different architectures, self-supervised objectives, and hyperparameters — your data’s hidden rhythms are waiting to be uncovered.