PJM Hourly Load Forecasting with TimeXer¶
The dominant family of models that has largely “replaced” RNNs (including LSTMs and GRUs) for sequential data—especially in NLP—is the Transformer architecture (Vaswani et al.’17).
Why Transformers supersede RNNs:¶
- Parallelism: they process every time step simultaneously rather than sequentially, unlocking full GPU/TPU utilization.
- Long-Range Dependencies: self-attention creates direct, weighted connections between any two positions, making it easier to capture distant relationships than gated recurrent units.
- Flexibility of Attention: instead of a fixed hidden-state memory, they dynamically attend to whichever past (or future) positions contribute most to the current representation.
- Simplicity of Architecture: by eliminating recurrence, they avoid vanishing/exploding gradients and train more stably with residual connections and layer normalization.
This analysis compares three different modeling approaches applied to hourly energy consumption forecasting for the PJM East region: XGBoost, a hybrid ARIMA–LSTM–CNN model, and TimeXer (a Transformer-based model). The goal is to understand their methodologies, strengths, weaknesses, and potential future improvements.
1. Model Methodologies and Feature Handling¶
XGBoost (Gradient Boosted Trees):
- Methodology: Uses an ensemble of decision trees (gradient boosting).
- Feature Handling: Relies on explicit feature engineering: Calendar features, averaged weather variables, and critically important lagged load values.
- Interpretability: SHAP values identified temperature, dew point,
is_dayoff
, hour, andlag_1hr
as key drivers.
ARIMA-LSTM-CNN (Hybrid Statistical/Deep Learning):
- Methodology: A multi-stage approach combining ARIMA (for linear patterns/seasonality like SARIMA(2,0,1)x(1,0,0,24)) with an LSTM-CNN model (to learn non-linear patterns from ARIMA residuals).
- Feature Handling: ARIMA handles linear time dependencies. The LSTM-CNN uses sequences of exogenous features (calendar, weather) and ARIMA residuals.
- Prediction: Final forecast = ARIMA prediction + LSTM-CNN residual prediction.
TimeXer (Transformer-based):
- Citation: Wang, Y., Wu, H., Dong, J., Qin, G., Zhang, H., Liu, Y., Qiu, Y., Wang, J., & Long, M. (2024). TimeXer: Empowering Transformers for Time Series Forecasting with Exogenous Variables. arXiv preprint arXiv:2402.19072. https://arxiv.org/abs/2402.19072
- Methodology: Treats forecasting as a unified sequence modeling problem using a Transformer architecture, specifically designed for time series with exogenous variables. Key components include:
- Endogenous Patching & Global Token: The input time series (energy load) is divided into patches, embedded, and processed alongside a learnable global token that summarizes information across all patches.
- Cross-Attention: Embeddings of exogenous variables (weather, calendar) interact with the global endogenous token via cross-attention, allowing the model to learn the influence of external factors.
- Self-Attention: Multi-head self-attention processes all time steps (patches) in parallel, capturing both short-term and long-range dependencies simultaneously within the energy load series.
- End-to-End Training: The entire network is trained jointly, optimizing all components from input embedding to the final prediction head.
- Feature Handling: Designed explicitly to integrate exogenous variables with the target time series using attention mechanisms, processing them in a shared latent space. Aims to automatically learn relevant interactions without requiring the manual decomposition seen in the ARIMA-hybrid or the explicit lagging of XGBoost.
- Relationship to ChatGPT: Both TimeXer and large language models like ChatGPT are built upon the Transformer architecture, leveraging its core self-attention mechanism. However, they are applied to vastly different tasks (time series forecasting vs. natural language processing). While the fundamental principle of using attention to weigh the importance of different parts of an input sequence is shared, the specific model implementations (input processing like patching vs. tokenization, positional encodings, output layers, training objectives) are highly specialized for their respective domains. TimeXer is adapted for numerical sequences and exogenous variable integration, whereas ChatGPT is designed for text generation and understanding.
Overview of Workflow:
In this notebook, we implement TimeXer, a transformer-style model for short-term forecasting of the PJM grid load (PJME). We combine energy consumption data with weather features to improve accuracy, and we’ll walk through:
- Environment setup and imports
- Data loading, cleaning & feature engineering
- Train / val / test split & scaling
- Model architecture and configuration
- Dataset construction & DataLoader
- Training with early stopping
- Evaluation and visualization
- Hyperparameter tuning with Optuna
- Final retraining with best parameters
0. Environment Setup & Imports¶
Here we import all necessary libraries—standard Python modules, data manipulation tools, plotting libraries, ML utilities, and (optionally) the Meteostat API for weather data. We then configure warning filters, plotting style, and detect GPU availability.
# -*- coding: utf-8 -*-
# --- Standard Libraries ---
import copy
import time
import warnings
from datetime import datetime, timedelta
import os
import json
from IPython.display import Video
# --- Data Manipulation ---
import numpy as np
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
# --- Visualization ---
import matplotlib.pyplot as plt
import seaborn as sns
# --- Machine Learning ---
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error
# --- PyTorch ---
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
# --- Weather Data ---
# Ensure meteostat is installed: pip install meteostat
try:
from meteostat import Stations, Hourly
except ImportError:
print("Meteostat library not found. Please install it: pip install meteostat")
# You might want to exit or handle this case appropriately
# For now, we'll proceed assuming it might be installed later or data is cached
pass
# --- Configure Warnings and Plot Style ---
warnings.filterwarnings("ignore")
plt.style.use('fivethirtyeight')
# --- Check for GPU Availability ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# --- Helper Functions/Classes ---
def mean_absolute_percentage_error(y_true, y_pred):
"""Calculates MAPE given y_true and y_pred"""
y_true, y_pred = np.array(y_true), np.array(y_pred)
# Avoid division by zero
non_zero_mask = y_true != 0
# Handle cases where all true values are zero
if not np.any(non_zero_mask):
return 0.0
return np.mean(np.abs((y_true[non_zero_mask] - y_pred[non_zero_mask]) / y_true[non_zero_mask])) * 100
class EarlyStopping:
"""Early stops the training if validation loss doesn't improve after a given patience."""
def __init__(self, patience=7, verbose=False, delta=0, path='checkpoint.pt', trace_func=print):
"""
Args:
patience (int): How long to wait after last time validation loss improved.
Default: 7
verbose (bool): If True, prints a message for each validation loss improvement.
Default: False
delta (float): Minimum change in the monitored quantity to qualify as an improvement.
Default: 0
path (str): Path for the checkpoint to be saved to.
Default: 'checkpoint.pt'
trace_func (function): trace print function.
Default: print
"""
self.patience = patience
self.verbose = verbose
self.counter = 0
self.best_score = None
self.early_stop = False
# *** CORRECTED LINE BELOW ***
self.val_loss_min = np.inf # Use np.inf instead of np.Inf
self.delta = delta
self.path = path
self.trace_func = trace_func
def __call__(self, val_loss, model):
score = -val_loss
if self.best_score is None:
self.best_score = score
self.save_checkpoint(val_loss, model)
elif score < self.best_score + self.delta:
self.counter += 1
if self.verbose: # Only print counter if verbose
self.trace_func(f'EarlyStopping counter: {self.counter} out of {self.patience}')
if self.counter >= self.patience:
self.early_stop = True
else:
self.best_score = score
self.save_checkpoint(val_loss, model)
self.counter = 0
def save_checkpoint(self, val_loss, model):
'''Saves model when validation loss decrease.'''
if self.verbose:
self.trace_func(f'Validation loss decreased ({self.val_loss_min:.6f} --> {val_loss:.6f}). Saving model ...')
try:
torch.save(model.state_dict(), self.path)
except Exception as e:
self.trace_func(f"Error saving checkpoint: {e}") # Report errors during saving
self.val_loss_min = val_loss
# Placeholder for Configs class
class Configs:
def __init__(self, **kwargs):
for key, value in kwargs.items():
setattr(self, key, value)
Using device: cuda
1. Data Loading & Feature Engineering¶
- Load the PJME hourly load CSV and restrict to our time window.
- Engineer calendar features (hour, day‐of‐week, holidays, weekends) to capture temporal patterns.
- Fetch nearby weather station data (temperature, humidity, precipitation, wind) and average across stations for each timestamp.
- Join energy and weather data, forward/backfill missing values, and verify data integrity.
csv_path = '../data/3/PJME_hourly.csv'
if not os.path.exists(csv_path):
print(f"Error: CSV file not found at {csv_path}")
print("Please download the PJME hourly energy data and place it as PJME_hourly.csv in the current directory.")
pjme = None # Set to None to prevent further errors if file is missing
else:
pjme = (
pd.read_csv(
csv_path,
index_col='Datetime',
parse_dates=['Datetime']
)
.sort_index()
.loc['2003-01-01':'2018-08-02'] # Using the same date range as provided
)
if pjme is not None:
# --- Feature Engineering: Holidays and Calendar Features ---
cal = USFederalHolidayCalendar()
fed_hols = cal.holidays(start=pjme.index.min(), end=pjme.index.max())
# Extend July 4th holiday based on weekday
extended_hols = set(fed_hols)
for year in range(pjme.index.year.min(), pjme.index.year.max() + 1):
try:
july4 = datetime(year, 7, 4)
wd = july4.weekday()
if wd == 1: # If July 4th is a Tuesday, add Monday
extended_hols.add(july4 - timedelta(days=1))
elif wd == 2: # If July 4th is a Wednesday, add Thursday and Friday
extended_hols.add(july4 + timedelta(days=1))
extended_hols.add(july4 + timedelta(days=2))
elif wd == 3: # If July 4th is a Thursday, add Friday
extended_hols.add(july4 + timedelta(days=1))
# Add July 4th itself regardless of the day
extended_hols.add(july4)
except ValueError: # Handle cases like leap years if necessary, though July 4th is fixed
pass
all_hols = pd.DatetimeIndex(sorted(list(extended_hols))) # Convert set to list before sorting
pjme['is_holiday'] = pjme.index.normalize().isin(all_hols)
pjme['is_weekend'] = pjme.index.weekday >= 5
pjme['is_dayoff'] = (pjme['is_holiday'] | pjme['is_weekend']).astype(int) # Convert boolean to int
pjme.drop(columns=['is_holiday', 'is_weekend'], inplace=True)
# Calendar Features
pjme['hour'] = pjme.index.hour
pjme['dayofweek'] = pjme.index.dayofweek # Monday=0, Sunday=6
pjme['dayofmonth'] = pjme.index.day
pjme['month'] = pjme.index.month
pjme['year'] = pjme.index.year
pjme['dayofyear'] = pjme.index.dayofyear
pjme['weekofyear'] = pjme.index.isocalendar().week.astype(int)
# --- Fetch and Process Weather Data ---
# Define date range for weather data (slightly wider to ensure coverage)
start_dt = pjme.index.min() - timedelta(days=1)
end_dt = pjme.index.max() + timedelta(days=1)
# Coordinates for PJME region center (approximate)
lat, lon = 39.95, -75.17
# Target ICAOs based on the original script
target_icaos = ['KPHL', 'KEWR', 'KBWI', 'KDCA'] # Philadelphia, Newark, Baltimore, Washington DC area airports
average_weather = None
try:
# Find nearby stations (using a reasonable radius, e.g., 200km)
stations_query = Stations()
# stations_query = stations_query.nearby(lat, lon, 200000) # 200 km radius
stations_query = stations_query.inventory('hourly', (start_dt, end_dt))
nearby_stations_df = stations_query.fetch()
# Filter for target ICAOs if found, otherwise use closest available
target_stations_df = nearby_stations_df[nearby_stations_df['icao'].isin(target_icaos)]
if target_stations_df.empty:
print(f"Warning: None of the target ICAOs {target_icaos} found nearby or had data. Using the closest available stations.")
# Optionally, select the top N closest stations if target ones are missing
station_ids = nearby_stations_df.head(4).index.tolist() # Example: use top 4 closest
else:
station_ids = target_stations_df.index.tolist()
if station_ids:
print(f"Fetching weather data for stations: {station_ids}")
# Fetch hourly data
weather_all = Hourly(station_ids, start_dt, end_dt).fetch()
# Select relevant weather columns and calculate average
weather_cols = ['temp', 'dwpt', 'rhum', 'prcp', 'wspd'] # Temperature (°C), Dew Point (°C), Relative Humidity (%), Precipitation (mm), Wind Speed (km/h)
# Ensure only selected columns exist before grouping
valid_cols = [col for col in weather_cols if col in weather_all.columns]
average_weather = weather_all.groupby(level='time').mean(numeric_only=True)[valid_cols].ffill().bfill() # Forward fill then backfill
print("Weather data fetched and processed.")
else:
print("Warning: No suitable weather stations found.")
except Exception as e:
print(f"An error occurred during weather data fetching: {e}")
print("Proceeding without weather data.")
# --- Combine Energy and Weather Data ---
if average_weather is not None:
pjme_weather = pjme.join(average_weather, how='left')
# Fill missing weather data (e.g., at the beginning/end or if fetching failed partially)
# Using forward fill first, then backward fill
pjme_weather[average_weather.columns] = pjme_weather[average_weather.columns].ffill().bfill()
print("Energy and weather data joined.")
else:
pjme_weather = pjme.copy()
# Add placeholder columns if weather data is missing
weather_cols = ['temp', 'dwpt', 'rhum', 'prcp', 'wspd']
for col in weather_cols:
if col not in pjme_weather.columns:
pjme_weather[col] = 0 # Or another suitable default
print("Proceeding with energy data only.")
# Ensure no NaNs remain after joins and fills
# Check for NaNs and report if any
nan_counts = pjme_weather.isna().sum()
if nan_counts.sum() > 0:
print("Warning: NaNs found after processing:")
print(nan_counts[nan_counts > 0])
# Option 1: Drop rows with NaNs (might lose data)
# pjme_weather.dropna(inplace=True)
# Option 2: Fill with a specific value (like 0 or mean) - already tried ffill/bfill
# Consider more sophisticated imputation if necessary
print("Attempting final fill with 0 for any remaining NaNs...")
pjme_weather.fillna(0, inplace=True)
print(f'Combined data shape: {pjme_weather.shape}')
# Check index monotonicity again
pjme_weather = pjme_weather.sort_index()
print(f'Index monotonic? {pjme_weather.index.is_monotonic_increasing}')
print("Data Head:")
print(pjme_weather.head())
print("\nData Tail:")
print(pjme_weather.tail())
print("\nData Info:")
pjme_weather.info()
print("\nNaN check after processing:")
print(pjme_weather.isna().sum())
# --- Define Target and Features ---
TARGET = 'PJME_MW'
# Features = All columns except the target
FEATURES = [col for col in pjme_weather.columns if col != TARGET]
print(f"\nTarget variable: {TARGET}")
print(f"Feature columns: {FEATURES}")
print(f"Number of features: {len(FEATURES)}")
else:
print("Execution stopped due to missing data file.")
Fetching weather data for stations: ['72405', '72406', '72408', '72502'] Weather data fetched and processed. Energy and weather data joined. Combined data shape: (136608, 14) Index monotonic? True Data Head: PJME_MW is_dayoff hour dayofweek dayofmonth month \ 2003-01-01 00:00:00 27008.0 1 0 2 1 1 2003-01-01 01:00:00 25591.0 1 1 2 1 1 2003-01-01 02:00:00 24235.0 1 2 2 1 1 2003-01-01 03:00:00 23121.0 1 3 2 1 1 2003-01-01 04:00:00 22445.0 1 4 2 1 1 year dayofyear weekofyear temp dwpt rhum prcp \ 2003-01-01 00:00:00 2003 1 1 9.450 4.675 73.25 0.0 2003-01-01 01:00:00 2003 1 1 9.300 4.750 74.00 0.0 2003-01-01 02:00:00 2003 1 1 7.800 4.400 79.25 0.0 2003-01-01 03:00:00 2003 1 1 7.775 4.200 79.00 0.0 2003-01-01 04:00:00 2003 1 1 7.350 4.050 80.50 0.0 wspd 2003-01-01 00:00:00 7.950 2003-01-01 01:00:00 7.400 2003-01-01 02:00:00 8.825 2003-01-01 03:00:00 6.950 2003-01-01 04:00:00 6.050 Data Tail: PJME_MW is_dayoff hour dayofweek dayofmonth month \ 2018-08-02 19:00:00 45641.0 0 19 3 2 8 2018-08-02 20:00:00 44057.0 0 20 3 2 8 2018-08-02 21:00:00 43256.0 0 21 3 2 8 2018-08-02 22:00:00 41552.0 0 22 3 2 8 2018-08-02 23:00:00 38500.0 0 23 3 2 8 year dayofyear weekofyear temp dwpt rhum \ 2018-08-02 19:00:00 2018 214 31 28.500 22.000 68.75 2018-08-02 20:00:00 2018 214 31 27.625 22.525 75.75 2018-08-02 21:00:00 2018 214 31 27.375 22.100 75.00 2018-08-02 22:00:00 2018 214 31 27.375 22.175 75.00 2018-08-02 23:00:00 2018 214 31 26.400 22.450 80.00 prcp wspd 2018-08-02 19:00:00 0.066667 18.100 2018-08-02 20:00:00 0.933333 17.100 2018-08-02 21:00:00 1.100000 13.850 2018-08-02 22:00:00 0.250000 11.525 2018-08-02 23:00:00 1.366667 12.100 Data Info: <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 136608 entries, 2003-01-01 00:00:00 to 2018-08-02 23:00:00 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PJME_MW 136608 non-null float64 1 is_dayoff 136608 non-null int64 2 hour 136608 non-null int32 3 dayofweek 136608 non-null int32 4 dayofmonth 136608 non-null int32 5 month 136608 non-null int32 6 year 136608 non-null int32 7 dayofyear 136608 non-null int32 8 weekofyear 136608 non-null int64 9 temp 136608 non-null float64 10 dwpt 136608 non-null float64 11 rhum 136608 non-null float64 12 prcp 136608 non-null float64 13 wspd 136608 non-null float64 dtypes: float64(6), int32(6), int64(2) memory usage: 12.5 MB NaN check after processing: PJME_MW 0 is_dayoff 0 hour 0 dayofweek 0 dayofmonth 0 month 0 year 0 dayofyear 0 weekofyear 0 temp 0 dwpt 0 rhum 0 prcp 0 wspd 0 dtype: int64 Target variable: PJME_MW Feature columns: ['is_dayoff', 'hour', 'dayofweek', 'dayofmonth', 'month', 'year', 'dayofyear', 'weekofyear', 'temp', 'dwpt', 'rhum', 'prcp', 'wspd'] Number of features: 13
2. Train / Validation / Test Split¶
We split chronologically to avoid leakage:
- 80 % for training
- 10 % for validation
- 10 % for testing
We also plot the PJM load time series colored by split to visually confirm the partition.
if pjme_weather is not None:
total_hours = len(pjme_weather)
# Using the same 80/10/10 split ratios
test_split_idx = int(total_hours * 0.9)
val_split_idx = int(total_hours * 0.8)
train_df = pjme_weather.iloc[:val_split_idx].copy()
val_df = pjme_weather.iloc[val_split_idx:test_split_idx].copy()
test_df = pjme_weather.iloc[test_split_idx:].copy()
print(f"Train: {len(train_df)} rows ({train_df.index.min()} to {train_df.index.max()})")
print(f"Val : {len(val_df)} rows ({val_df.index.min()} to {val_df.index.max()})")
print(f"Test : {len(test_df)} rows ({test_df.index.min()} to {test_df.index.max()})")
# Visualize the split
plt.figure(figsize=(15, 5))
plt.plot(train_df.index, train_df[TARGET], label='Train')
plt.plot(val_df.index, val_df[TARGET], label='Validation')
plt.plot(test_df.index, test_df[TARGET], label='Test')
plt.title('PJME Energy Consumption - Train/Val/Test Split')
plt.xlabel('Date')
plt.ylabel('MW')
plt.legend()
plt.show()
else:
print("Skipping split due to missing data.")
Train: 109286 rows (2003-01-01 00:00:00 to 2015-06-21 13:00:00) Val : 13661 rows (2015-06-21 14:00:00 to 2017-01-10 17:00:00) Test : 13661 rows (2017-01-10 18:00:00 to 2018-08-02 23:00:00)
3. Data Scaling¶
We apply StandardScaler separately to features and the target so that our model trains more stably.
- Fit on the train set
- Transform train, val, and test
- Keep target scaler for later inverse‐transform
if pjme_weather is not None:
# Scale features (exogenous variables)
feature_scaler = StandardScaler()
train_df[FEATURES] = feature_scaler.fit_transform(train_df[FEATURES])
val_df[FEATURES] = feature_scaler.transform(val_df[FEATURES])
test_df[FEATURES] = feature_scaler.transform(test_df[FEATURES])
# Scale target variable separately (important for inverse transform later)
target_scaler = StandardScaler()
train_df[[TARGET]] = target_scaler.fit_transform(train_df[[TARGET]])
val_df[[TARGET]] = target_scaler.transform(val_df[[TARGET]])
test_df[[TARGET]] = target_scaler.transform(test_df[[TARGET]])
print("Features and target variable scaled.")
print("Train DataFrame head after scaling:")
print(train_df.head())
else:
print("Skipping scaling due to missing data.")
Features and target variable scaled. Train DataFrame head after scaling: PJME_MW is_dayoff hour dayofweek dayofmonth \ 2003-01-01 00:00:00 -0.830722 1.477475 -1.661665 -0.500253 -1.672229 2003-01-01 01:00:00 -1.051358 1.477475 -1.517189 -0.500253 -1.672229 2003-01-01 02:00:00 -1.262495 1.477475 -1.372713 -0.500253 -1.672229 2003-01-01 03:00:00 -1.435951 1.477475 -1.228238 -0.500253 -1.672229 2003-01-01 04:00:00 -1.541209 1.477475 -1.083762 -0.500253 -1.672229 month year dayofyear weekofyear temp \ 2003-01-01 00:00:00 -1.566074 -1.593315 -1.691714 -1.665309 -0.405299 2003-01-01 01:00:00 -1.566074 -1.593315 -1.691714 -1.665309 -0.420108 2003-01-01 02:00:00 -1.566074 -1.593315 -1.691714 -1.665309 -0.568199 2003-01-01 03:00:00 -1.566074 -1.593315 -1.691714 -1.665309 -0.570667 2003-01-01 04:00:00 -1.566074 -1.593315 -1.691714 -1.665309 -0.612627 dwpt rhum prcp wspd 2003-01-01 00:00:00 -0.119048 0.547299 -0.221923 -0.763367 2003-01-01 01:00:00 -0.112001 0.587985 -0.221923 -0.840180 2003-01-01 02:00:00 -0.144889 0.872787 -0.221923 -0.641165 2003-01-01 03:00:00 -0.163682 0.859225 -0.221923 -0.903027 2003-01-01 04:00:00 -0.177777 0.940597 -0.221923 -1.028720
4. Model Building Blocks¶
PositionalEmbedding
- Role: Injects information about token position (i.e. time step) into the model, so it can distinguish “first hour” from “tenth hour.”
- Key points:
- Precomputes a fixed
(max_len × d_model)
matrix of sine/cosine values in log space. - No trainable weights—just a registered buffer.
- On
forward(x)
, slices out the firstseq_len
rows, returning a[seq_len, d_model]
tensor that’s broadcast across the batch. - Guards against sequence lengths exceeding
max_len
.
- Precomputes a fixed
TimeFeatureEmbedding
- Role: Projects raw calendar/time‐of‐day features into the model’s embedding dimension.
- Key points:
- For hourly data (
freq='h'
), expects 6 inputs:- hour of day
- day of week
- day of month
- day of year
- month
- ISO week of year
- Implements a single linear layer:
nn.Linear(d_inp=6, d_model, bias=False)
. forward(x)
simply applies that projection to each time step.
- For hourly data (
TokenEmbedding
- Role: Embeds the numeric values of each feature vector (exogenous or endogenous) into
d_model
dimensions. - Key points:
- Uses
self.value_embedding = nn.Linear(c_in, d_model)
. - On
forward(x)
, maps[batch, seq_len, c_in] → [batch, seq_len, d_model]
. - (Commented-out Conv1d alternative hints at other possible architectures.)
- Uses
- Role: Embeds the numeric values of each feature vector (exogenous or endogenous) into
FullAttention
- Role: Implements scaled-dot-product attention with optional masking and dropout.
- Key points:
- Computes raw scores via
torch.einsum("blhe,bshe->bhls", queries, keys)
. - Applies mask (if
mask_flag=True
) to prevent attending to future/padded tokens. - Scales by
1/√E
(or a user-providedscale
), then appliessoftmax
+Dropout
. - Returns the attended output of shape
[batch, seq_len, H*E]
, plus the attention map ifoutput_attention=True
.
- Computes raw scores via
AttentionLayer
- Role: Wraps a
FullAttention
module to handle multi-head projections and output reshaping. - Key points:
- Contains three linear layers to project the input into query/key/value spaces:
query_projection
:d_model → n_heads × d_keys
key_projection
,value_projection
similarly.
- Calls
self.inner_attention
, then concatenates theH
heads and appliesself.out_projection
back tod_model
. - Ensures seamless integration of multi-head attention into the Transformer.
- Contains three linear layers to project the input into query/key/value spaces:
- Role: Wraps a
EarlyStopping
- Role: Monitors validation loss and halts training (and restores the best checkpoint) once no improvement is seen for
patience
epochs. - Key points:
- Tracks the best validation loss (
val_loss_min
) and a counter of unimproved epochs. - On each
__call__(val_loss, model)
:- If loss improved by at least
delta
, saves a new checkpoint viatorch.save(model.state_dict(), path)
and resetscounter
. - Otherwise increments
counter
and, if ≥patience
, setsearly_stop=True
.
- If loss improved by at least
- Optional verbose logging to trace when checkpoints are saved or early stopping is triggered.
- Tracks the best validation loss (
- Role: Monitors validation loss and halts training (and restores the best checkpoint) once no improvement is seen for
Together, these building blocks—embeddings for time, value, and position, plus the core attention mechanics and a training‐time safeguard—form the foundation of TimeXer’s transformer‐style encoder and downstream prediction head.
# --- Positional Embedding ---
class PositionalEmbedding(nn.Module):
def __init__(self, d_model, max_len=5000):
super(PositionalEmbedding, self).__init__()
# Compute the positional encodings once in log space.
pe = torch.zeros(max_len, d_model).float()
pe.require_grad = False
position = torch.arange(0, max_len).float().unsqueeze(1)
div_term = (torch.arange(0, d_model, 2).float() * -(np.log(10000.0) / d_model)).exp()
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
# Keep pe shape as [max_len, d_model] for easier slicing
# pe = pe.unsqueeze(0) # Remove batch dim here
self.register_buffer('pe', pe)
def forward(self, x):
# Input x shape: Tensor for which we need positional embedding, used to determine seq_len
# Output shape: [seq_len, d_model] - The calling function will handle batch expansion
seq_len = x.size(1) # Assuming x has shape [B, L, D] or similar where L is seq_len
if seq_len > self.pe.size(0):
raise ValueError(f"Sequence length {seq_len} exceeds maximum positional embedding length {self.pe.size(0)}")
# Return the slice corresponding to the sequence length
return self.pe[:seq_len, :]
# --- Time Feature Embedding (Part of DataEmbedding_inverted) ---
class TimeFeatureEmbedding(nn.Module):
def __init__(self, d_model, embed_type='timeF', freq='h'):
super(TimeFeatureEmbedding, self).__init__()
# Determine input dimension based on frequency
# Adjust based on the actual features computed in TimeSeriesDataset._compute_time_features
# Current features: hour, dayofweek, day, dayofyear, month, weekofyear -> 6 features
if freq == 'h':
# Match the number of features created in the Dataset class
d_inp = 6 # hour, dayofweek, dayofmonth, dayofyear, month, weekofyear
# Add other frequencies if needed
# elif freq == 't': d_inp = ...
else:
d_inp = 2 # Default fallback
self.embed = nn.Linear(d_inp, d_model, bias=False)
def forward(self, x):
# x shape: [Batch, SeqLen, NumTimeFeatures]
return self.embed(x) # Output: [Batch, SeqLen, d_model]
# --- Value Embedding (Part of DataEmbedding_inverted) ---
# Simple Linear layer for value embedding
class TokenEmbedding(nn.Module):
def __init__(self, c_in, d_model):
super(TokenEmbedding, self).__init__()
padding = 1 if torch.__version__ >= '1.5.0' else 2
# Use Linear layer for value embedding per feature, seems more common than Conv1d here
# self.tokenConv = nn.Conv1d(in_channels=c_in, out_channels=d_model,
# kernel_size=3, padding=padding, padding_mode='circular', bias=False)
self.value_embedding = nn.Linear(c_in, d_model) # Embeds features directly
# Initialize weights
# for m in self.modules():
# if isinstance(m, nn.Conv1d):
# nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='leaky_relu')
def forward(self, x):
# x shape: [Batch, SeqLen, Features]
# x = self.tokenConv(x.permute(0, 2, 1)).transpose(1, 2) # If using Conv1d
x = self.value_embedding(x) # Using Linear
return x # Shape: [Batch, SeqLen, d_model]
# --- Data Embedding (Combined - Simplified version of DataEmbedding_inverted) ---
# This embedding is used for the *exogenous* variables in TimeXer
class DataEmbeddingExog(nn.Module):
def __init__(self, c_in_exog, c_in_time, d_model, embed_type='timeF', freq='h', dropout=0.1):
super(DataEmbeddingExog, self).__init__()
# Embeds the exogenous variable values
self.value_embedding = TokenEmbedding(c_in=c_in_exog, d_model=d_model)
# Embeds the time features (like hour, dayofweek)
self.temporal_embedding = TimeFeatureEmbedding(d_model=d_model, embed_type=embed_type, freq=freq)
# Positional embedding (sinusoidal) - Added here for consistency
self.position_embedding = PositionalEmbedding(d_model=d_model)
self.dropout = nn.Dropout(p=dropout)
def forward(self, x, x_mark):
# x: Exogenous variable values [Batch, SeqLen, NumExogVars (c_in_exog)]
# x_mark: Time features [Batch, SeqLen, NumTimeFeatures (c_in_time)]
# Get positional encoding slice [SeqLen, d_model]
# Need to pass a tensor with correct seq_len dimension (dim 1)
pos_enc_slice = self.position_embedding(x) # Pass x to get seq_len
# Apply embeddings and add them
# Expand pos_enc_slice to [Batch, SeqLen, d_model] for addition
x = self.value_embedding(x) + self.temporal_embedding(x_mark) + pos_enc_slice.unsqueeze(0).expand(x.shape[0], -1, -1)
return self.dropout(x) # Shape: [Batch, SeqLen, d_model]
# --- Attention Layers (Standard Implementation -UNCHANGED from previous version) ---
class FullAttention(nn.Module):
def __init__(self, mask_flag=True, factor=5, scale=None, attention_dropout=0.1, output_attention=False):
super(FullAttention, self).__init__()
self.scale = scale
self.mask_flag = mask_flag
self.output_attention = output_attention
self.dropout = nn.Dropout(attention_dropout)
def forward(self, queries, keys, values, attn_mask, tau=None, delta=None):
B, L, H, E = queries.shape
_, S, _, D = values.shape
scale = self.scale or 1. / np.sqrt(E)
scores = torch.einsum("blhe,bshe->bhls", queries, keys)
if self.mask_flag and attn_mask is not None:
# Ensure attn_mask is broadcastable
if attn_mask.dim() == 2: # [L, S]
attn_mask = attn_mask.unsqueeze(0).unsqueeze(0) # [1, 1, L, S]
elif attn_mask.dim() == 3: # [B, L, S]
attn_mask = attn_mask.unsqueeze(1) # [B, 1, L, S]
scores = scores.masked_fill(attn_mask == 0, -float('inf')) # Use float('-inf')
A = self.dropout(torch.softmax(scores * scale, dim=-1))
V = torch.einsum("bhls,bshe->blhe", A, values).contiguous() # Use contiguous here
if self.output_attention:
return (V.view(B, L, -1), A) # Reshape V back
else:
return (V.view(B, L, -1), None) # Reshape V back
class AttentionLayer(nn.Module):
def __init__(self, attention, d_model, n_heads, d_keys=None, d_values=None):
super(AttentionLayer, self).__init__()
d_keys = d_keys or (d_model // n_heads)
d_values = d_values or (d_model // n_heads)
self.inner_attention = attention
self.query_projection = nn.Linear(d_model, d_keys * n_heads)
self.key_projection = nn.Linear(d_model, d_keys * n_heads)
self.value_projection = nn.Linear(d_model, d_values * n_heads)
self.out_projection = nn.Linear(d_values * n_heads, d_model)
self.n_heads = n_heads
def forward(self, queries, keys, values, attn_mask, tau=None, delta=None):
B, L, _ = queries.shape
_, S, _ = keys.shape
H = self.n_heads
# Project queries, keys, values
queries = self.query_projection(queries).view(B, L, H, -1)
keys = self.key_projection(keys).view(B, S, H, -1)
values = self.value_projection(values).view(B, S, H, -1)
# Apply attention mechanism
out, attn = self.inner_attention(
queries,
keys,
values,
attn_mask,
tau=tau, delta=delta
)
# Concatenate heads and apply final projection
out = self.out_projection(out) # Shape [B, L, d_model]
return out, attn
5. TimeXer Model Definition¶
EnEmbedding (Endogenous Embedding)
- Role: Converts the historical target series into a sequence of patch tokens plus a learnable “global” summary token, so the model can attend both locally (within each patch) and globally (across the entire history).
- Key points:
- Patching: The input series is split into fixed-length windows (patches). Each patch is linearly projected into the model dimension.
- Global token: A dedicated, trainable vector is prepended (or appended) to the patch tokens, allowing the encoder to gather summary information.
- Positional encoding: Sinusoidal embeddings are added so the model knows each patch’s relative position in time.
- Output shape: Flattens to
[batch_size × n_vars, num_patches+1, d_model]
for downstream attention layers.
DataEmbeddingExog (Exogenous Embedding)
- Role: Embeds the auxiliary inputs (weather, calendar flags, etc.) both as raw values and as time features, then combines them into a unified token sequence for cross‐attention.
- Key points:
- Value embedding: A linear layer maps each exogenous feature vector at a given time step into
d_model
dimensions. - Time feature embedding: The same calendar features used elsewhere (hour, day-of-week, etc.) are projected into
d_model
and added. - Positional encoding: Sinusoidal embeddings ensure the model can distinguish the ordering of exogenous steps.
- Dropout: Applied after summing all embeddings to regularize.
- Result: Produces a
[batch_size, seq_len, d_model]
tensor that will serve as key/value tokens in cross-attention.
- Value embedding: A linear layer maps each exogenous feature vector at a given time step into
Encoder (Stacked Transformer Layers with Cross-Attention)
- Role: Iteratively refines the endogenous patch+global tokens by attending to themselves (self-attention) and then letting the global token query the exogenous token set (cross-attention).
- Key points:
- Self-attention block: Each layer first applies multi-head attention over the patch+global sequence, capturing intra-series dependencies.
- Cross-attention block: The global token only attends to the exogenous embeddings (keys/values), allowing the model to inject weather/calendar context into the global summary.
- Feed-forward network: A two-stage 1×1 convolution (i.e. position-wise feed-forward) with residual connections and LayerNorm after each sub-block.
- Layer stacking: Multiple identical layers are chained, followed by a final normalization.
FlattenHead (Prediction Head)
- Role: Takes the final patch+global representations and collapses them into a fixed-size forecast window (e.g. next 24 hours).
- Key points:
- Flattening: The last layer’s output is reshaped so that each token dimension and patch dimension form a single feature vector per variable.
- Linear mapping: A fully connected layer projects this flattened vector to the desired
pred_len
. - Dropout: Applied before output to prevent overfitting.
- Output shape: Produces
[batch_size, n_vars, pred_len]
, which for a univariate target reduces to[batch_size, pred_len, 1]
.
Video("media/videos/manim/480p15/EmbeddingProcessVisualization.mp4", embed=True, width=720, height=405)
Video("media/videos/manim/480p15/PatchingConcept.mp4", embed=True, width=720, height=405)
class FlattenHead(nn.Module):
def __init__(self, n_vars, nf, target_window, head_dropout=0):
super().__init__()
self.n_vars = n_vars
self.flatten = nn.Flatten(start_dim=-2) # Flattens d_model and patch_num dimensions
self.linear = nn.Linear(nf, target_window)
self.dropout = nn.Dropout(head_dropout)
def forward(self, x): # x: [bs, nvars, d_model, patch_num]
# Flatten along dimensions d_model and patch_num
# Input shape to flatten: [bs * nvars, d_model, patch_num] (if reshaped before)
# Or handle [bs, nvars, d_model, patch_num] directly
# Let's assume input x is [bs, nvars, d_model, patch_num] based on comment
# Flatten d_model and patch_num dims. Input is [bs, nvars, d_model, patch_num]
# Result shape: [bs, nvars, d_model * patch_num]
x = self.flatten(x)
# Apply linear layer. Input is [bs, nvars, nf (=d_model*patch_num)]
# Output shape: [bs, nvars, target_window]
x = self.linear(x)
x = self.dropout(x)
return x
class EnEmbedding(nn.Module):
"""
Embedding for Endogenous variable(s).
Applies patching, value embedding, positional embedding, and adds a global token.
*** CORRECTED Positional Embedding Handling ***
"""
def __init__(self, n_vars, d_model, patch_len, dropout):
super(EnEmbedding, self).__init__()
self.patch_len = patch_len
self.n_vars = n_vars
self.d_model = d_model
# Linear layer to embed each patch
self.value_embedding = nn.Linear(patch_len, d_model, bias=False)
# Learnable global token (one per endogenous variable)
self.glb_token = nn.Parameter(torch.randn(1, n_vars, 1, d_model))
# Positional embedding for patches
self.position_embedding = PositionalEmbedding(d_model) # Instantiate corrected PositionalEmbedding
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# Input x shape: [batch_size, seq_len, n_vars] -> expecting n_vars=1 based on paper for univariate endogenous
# Transpose to [batch_size, n_vars, seq_len] for unfolding
x = x.permute(0, 2, 1)
batch_size, n_vars_in, seq_len = x.shape
assert n_vars_in == self.n_vars, f"Input n_vars ({n_vars_in}) doesn't match model n_vars ({self.n_vars})"
# Patching using unfold
x_unfolded = x.unfold(dimension=-1, size=self.patch_len, step=self.patch_len)
num_patches = x_unfolded.shape[2] # Shape after unfold: [batch_size, n_vars, num_patches, patch_len]
# Reshape for linear embedding: [batch_size * n_vars * num_patches, patch_len]
x_reshaped = x_unfolded.permute(0, 1, 2, 3).reshape(-1, self.patch_len)
# Apply value embedding
x_embedded = self.value_embedding(x_reshaped) # Shape: [batch_size * n_vars * num_patches, d_model]
# Reshape back to [batch_size, n_vars, num_patches, d_model]
x_embedded = x_embedded.reshape(batch_size, n_vars_in, num_patches, self.d_model)
# --- Get and Add Positional Embedding ---
# Get positional encoding slice [num_patches, d_model]
# Pass a dummy tensor with correct seq_len (num_patches) to positional embedding
dummy_pos_input = torch.zeros(1, num_patches, self.d_model, device=x.device)
pos_encoding_slice = self.position_embedding(dummy_pos_input) # Shape: [num_patches, d_model]
# Expand slice and add to x_embedded
# Expand to [batch_size, n_vars, num_patches, d_model]
pos_embed = pos_encoding_slice.unsqueeze(0).unsqueeze(0).expand(batch_size, n_vars_in, -1, -1)
x_patched_embedded = x_embedded + pos_embed # Add positional embedding
# Prepare global token: repeat for batch and concatenate
glb = self.glb_token.repeat((batch_size, 1, 1, 1)) # Shape: [batch_size, n_vars, 1, d_model]
# Concatenate global token with patch tokens along the sequence (num_patches) dimension
x_with_glb = torch.cat([x_patched_embedded, glb], dim=2) # Shape: [batch_size, n_vars, num_patches + 1, d_model]
# Reshape for Transformer layers: [batch_size * n_vars, num_patches + 1, d_model]
x_final = x_with_glb.reshape(batch_size * n_vars_in, num_patches + 1, self.d_model)
return self.dropout(x_final), n_vars_in # Return reshaped tokens and original number of variables
class Encoder(nn.Module):
def __init__(self, layers, norm_layer=None):
super(Encoder, self).__init__()
self.layers = nn.ModuleList(layers)
self.norm = norm_layer
def forward(self, x, cross, x_mask=None, cross_mask=None, tau=None, delta=None):
# x: Endogenous tokens [B*n_vars_en, N+1, D]
# cross: Exogenous tokens [B, C, D] or potentially [B, SeqLen_exog, D] if using DataEmbeddingExog output
for layer in self.layers:
x = layer(x, cross, x_mask=x_mask, cross_mask=cross_mask, tau=tau, delta=delta)
if self.norm is not None:
x = self.norm(x)
return x
class EncoderLayer(nn.Module):
def __init__(self, self_attention, cross_attention, d_model, n_vars_en, c_exog, d_ff=None,
dropout=0.1, activation="relu"):
super(EncoderLayer, self).__init__()
d_ff = d_ff or 4 * d_model
self.n_vars_en = n_vars_en # Number of endogenous variables
self.c_exog = c_exog # Number of exogenous variables
self.self_attention = self_attention # Attends over endogenous patches + global token
self.cross_attention = cross_attention # Attends global token (query) to exogenous variate tokens (key/value)
self.conv1 = nn.Conv1d(in_channels=d_model, out_channels=d_ff, kernel_size=1)
self.conv2 = nn.Conv1d(in_channels=d_ff, out_channels=d_model, kernel_size=1)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.norm3 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
self.activation = F.relu if activation == "relu" else F.gelu
def forward(self, x, cross, x_mask=None, cross_mask=None, tau=None, delta=None):
# x: Endogenous tokens (Patch + Global) [Batch * n_vars_en, N+1, D]
# cross: Exogenous tokens [Batch, C, D] or [Batch, SeqLen_exog, D]
# --- Self-Attention over Endogenous Tokens (Patches + Global) ---
x_res = x # Store residual connection input
x_attn, _ = self.self_attention(
x, x, x,
attn_mask=x_mask,
tau=tau, delta=None
)
x = x_res + self.dropout(x_attn) # Add residual
x = self.norm1(x)
# x shape: [B * n_vars_en, N+1, D]
# Store output for later FFN residual connection
x_after_self_attn = x
# --- Cross-Attention: Endogenous Global Token <-> Exogenous Tokens ---
x_glb_en = x[:, -1, :] # Get current global token (output of norm1)
x_glb_en_query = x_glb_en.unsqueeze(1) # Shape: [B*n_vars_en, 1, D]
# Prepare exogenous 'cross' input
batch_size_x = x.shape[0]
batch_size_cross = cross.shape[0]
if batch_size_x != batch_size_cross:
if batch_size_x == batch_size_cross * self.n_vars_en:
cross_repeated = cross.repeat_interleave(self.n_vars_en, dim=0)
else:
raise ValueError("Batch size mismatch between endogenous and exogenous tokens.")
else:
cross_repeated = cross
# Apply cross-attention
x_glb_attn, _ = self.cross_attention(
x_glb_en_query, cross_repeated, cross_repeated,
attn_mask=cross_mask,
tau=tau, delta=delta
) # Shape: [B*n_vars_en, 1, D]
# Add result back to original global token (residual connection for cross-attention)
x_glb_updated = x_glb_en + self.dropout(x_glb_attn.squeeze(1)) # Remove seq dim 1
x_glb_updated = self.norm2(x_glb_updated) # Shape: [B*n_vars_en, D]
# --- Combine Patches and Updated Global Token ---
# *** CORRECTION: Avoid inplace modification ***
x_patches = x_after_self_attn[:, :-1, :] # Get patch tokens [B*n_vars_en, N, D]
x_glb_updated_unsqueezed = x_glb_updated.unsqueeze(1) # Shape [B*n_vars_en, 1, D]
# Create the new tensor 'x' for the FFN input
x_ffn_input = torch.cat([x_patches, x_glb_updated_unsqueezed], dim=1) # Shape [B*n_vars_en, N+1, D]
# --- Feed Forward Network ---
# Apply FFN to the combined tensor
y = x_ffn_input
y = self.dropout(self.activation(self.conv1(y.transpose(-1, 1)))) # [B*n_vars_en, d_ff, N+1]
y = self.dropout(self.conv2(y).transpose(-1, 1)) # [B*n_vars_en, N+1, d_model]
# Final residual connection and norm
return self.norm3(x_ffn_input + y) # Add residual from *before* FFN
class Model(nn.Module):
"""
TimeXer Model for Forecasting with Exogenous Variables.
"""
def __init__(self, configs):
super(Model, self).__init__()
self.task_name = configs.task_name
self.features = configs.features # 'M' (multivariate) or 'S' (univariate target) or 'MS' (multi-variate input, single target)
self.seq_len = configs.seq_len
self.pred_len = configs.pred_len
self.use_norm = configs.use_norm # Instance Normalization
self.patch_len = configs.patch_len
self.patch_num = int(np.ceil(configs.seq_len / configs.patch_len)) # Use ceil for potential last partial patch
self.d_model = configs.d_model
# Determine number of endogenous and exogenous variables based on 'features' config
if self.features == 'S': # Univariate target, no exogenous in input tensor `x_enc`
self.n_vars_en = 1
self.n_vars_exog = 0
self.c_exog = configs.c_exog # Number of separate exogenous features passed potentially
elif self.features == 'M': # Multivariate targets, all vars in `x_enc` are endogenous
self.n_vars_en = configs.enc_in
self.n_vars_exog = 0
self.c_exog = configs.c_exog
elif self.features == 'MS': # Single endogenous target, others in `x_enc` are exogenous (TimeXer's intended main use)
self.n_vars_en = 1 # Typically 1 target variable
self.n_vars_exog = configs.enc_in - self.n_vars_en # Rest are exogenous in the input tensor `x_enc`
self.c_exog = self.n_vars_exog # Exogenous variables are handled by ExogEmbedding
# Store total number of input variables (endog + exog potentially)
self.total_vars = configs.enc_in
else:
raise ValueError(f"Unknown features type: {self.features}")
# --- Embeddings ---
# Endogenous Embedding (Patching + Global Token)
self.en_embedding = EnEmbedding(self.n_vars_en, configs.d_model, self.patch_len, configs.dropout)
# Exogenous Embedding (Handles time features and exogenous values)
# c_in should be the number of exogenous features + time features if concatenated
# Let's refine this: Exog embedding should probably take exogenous series values and time features separately.
# Using DataEmbeddingExog which combines value, temporal, and positional embedding.
# It expects `x` (exog values) and `x_mark` (time features).
# Exogenous Embedding (Handles time features and exogenous values)
if self.c_exog > 0:
# Get number of time features from dummy data (or config)
# Let's assume TimeSeriesDataset computes 6 time features for 'h' freq
num_time_features = 6 # Hardcoded based on TimeSeriesDataset._compute_time_features for freq='h'
# OR, get dynamically:
# temp_dataset = TimeSeriesDataset(train_df[:10], TARGET, pjme_weather.columns.tolist(), configs.seq_len, configs.pred_len, configs.freq)
# num_time_features = temp_dataset.time_features.shape[1]
# c_in_exog: number of exogenous variables
# c_in_time: number of time features
self.ex_embedding = DataEmbeddingExog(self.c_exog, num_time_features, configs.d_model, configs.embed, configs.freq, configs.dropout)
else:
self.ex_embedding = None # No exogenous variables
# --- Encoder ---
self.encoder = Encoder(
[
EncoderLayer(
# Self-Attention for Endogenous
AttentionLayer(
FullAttention(mask_flag=False, factor=configs.factor, attention_dropout=configs.dropout, output_attention=False),
configs.d_model, configs.n_heads),
# Cross-Attention for Endog Global <-> Exog Variate
AttentionLayer(
FullAttention(mask_flag=False, factor=configs.factor, attention_dropout=configs.dropout, output_attention=False),
configs.d_model, configs.n_heads),
configs.d_model,
n_vars_en = self.n_vars_en, # Pass number of endogenous vars
c_exog = self.c_exog, # Pass number of exogenous vars
d_ff=configs.d_ff,
dropout=configs.dropout,
activation=configs.activation
) for _ in range(configs.e_layers)
],
norm_layer=torch.nn.LayerNorm(configs.d_model)
)
# --- Prediction Head ---
# nf = output features from encoder = d_model * (patch_num + 1) # Flattened features per variable
self.head_nf = configs.d_model * (self.patch_num + 1) # Patches + Global token
# The head takes the output for *each* endogenous variable and predicts its future
self.head = FlattenHead(self.n_vars_en, self.head_nf, configs.pred_len, head_dropout=configs.dropout)
def forecast(self, x_enc, x_mark_enc, x_dec, x_mark_dec):
# x_enc: [Batch, SeqLen, NumFeatures] - Contains target and potentially exogenous features based on `self.features`
# x_mark_enc: [Batch, SeqLen, NumTimeFeatures] - Time features for input sequence
# x_dec: [Batch, PredLen (or SeqLen), NumFeatures] - Often used for decoder input, maybe just placeholder here
# x_mark_dec: [Batch, PredLen, NumTimeFeatures] - Time features for prediction window
# --- Instance Normalization ---
if self.use_norm:
# Normalize the target variable (last feature assuming 'MS' or 'S')
target_idx = -1 # Assuming target is the last column in x_enc for 'S'/'MS'
means = x_enc[:, :, target_idx].mean(1, keepdim=True).detach() # Mean across seq_len
x_enc_target = x_enc[:, :, target_idx] - means
stdev = torch.sqrt(torch.var(x_enc_target, dim=1, keepdim=True, unbiased=False) + 1e-5).detach()
x_enc_target_norm = x_enc_target / stdev
# Keep original means and stdev for denormalization
means = means.squeeze(1) # Shape [B, 1] -> [B]
stdev = stdev.squeeze(1) # Shape [B, 1] -> [B]
else:
x_enc_target_norm = x_enc[:, :, -1] # Use original target if no norm
# --- Prepare Endogenous and Exogenous Data ---
# Assuming 'MS' mode: x_enc has shape [B, L, total_vars], target is last column
x_endog = x_enc_target_norm.unsqueeze(-1) # [B, L, 1] - Normalized target variable
if self.c_exog > 0:
x_exog = x_enc[:, :, :-1] # [B, L, C] - Exogenous variables (already scaled)
else:
x_exog = None
# --- Embeddings ---
# Endogenous Embedding (Patching + Global Token)
# Input shape: [B, L, 1]
en_embed, n_vars_processed = self.en_embedding(x_endog)
# en_embed shape: [B * n_vars_en (1), N+1, D] = [B, N+1, D]
# Exogenous Embedding (Value + Time + Positional)
if self.ex_embedding is not None and x_exog is not None:
# Input shapes: x_exog [B, L, C], x_mark_enc [B, L, NumTimeFeatures]
ex_embed = self.ex_embedding(x_exog, x_mark_enc) # [B, L, D]
# How to convert [B, L, D] time-step embeddings to Variate Tokens [B, C, D]?
# Option 1: Average Pooling over sequence length
# ex_variate_tokens = torch.mean(ex_embed, dim=1) # Shape [B, D] - Problem: Loses variate info
# Option 2: Linear projection from [L, D] to [1, D] for each var? Seems complex.
# Option 3: Paper implies VariateEmbed(z^(i)) -> R^D. This suggests embedding *each* exog series *independently* of time features.
# Let's redefine Exogenous Embedding to match the paper's likely intent: Embed each series to a single vector.
# --- Revised Exogenous Embedding Strategy ---
# We need a VariateEmbed layer applied *before* the main forward pass, maybe in the Dataset or training loop.
# For now, let's *simulate* Variate Tokens by averaging the ExEmbedding output. This is a simplification.
# A better approach would be a separate embedding layer for variate tokens.
# Let's proceed with the simpler averaging for now, acknowledging it's not exactly the paper's description.
# A simpler ExEmbedding: Linear layer on flattened series?
# Let's try using the TimeFeatureEmbedding output as the 'cross' input, shape [B, L, D]. CrossAttention needs Query [B,1,D], Key/Value [B,L,D]
cross_input = ex_embed # Use time-step embeddings as key/value for cross-attn. Shape [B, L, D]
else:
# Handle case with no exogenous vars - cross_input should be dummy or None
# The EncoderLayer needs a valid 'cross' input. Let's pass zeros.
cross_input = torch.zeros(x_enc.shape[0], self.seq_len, self.d_model, device=x_enc.device)
# --- Encoder Forward Pass ---
# Input: en_embed [B, N+1, D], cross_input [B, L, D]
enc_out = self.encoder(en_embed, cross_input)
# enc_out shape: [B, N+1, D] (since n_vars_en=1)
# --- Prediction Head ---
# Reshape encoder output for the head
# Head expects [bs, nvars, d_model, patch_num+1] or similar
# Current enc_out: [B, N+1, D]
# Reshape to [B, n_vars_en (1), D, N+1] to match FlattenHead expectation if start_dim=-2 flattens last two dims
enc_out_reshaped = enc_out.unsqueeze(1).permute(0, 1, 3, 2) # [B, 1, D, N+1]
# FlattenHead expects nf = d_model * (patch_num + 1)
# Input to FlattenHead: [B, 1, D, N+1], should be flattened to [B, 1, D*(N+1)]
# Let's adjust FlattenHead or the input shape.
# If FlattenHead flattens last two dims: input [B, 1, D, N+1] -> output [B, 1, pred_len]
# Check FlattenHead nf calculation vs input shape
# Recalculating head_nf based on reshaped input [B, 1, D, N+1]
# Flattened dim = D * (N+1)
# Adjust head_nf if different from initial calculation
current_head_nf = self.d_model * (self.patch_num + 1)
# assert self.head_nf == current_head_nf, "Head NF mismatch" # Ensure consistency
# Apply head
dec_out = self.head(enc_out_reshaped) # Output: [B, 1, pred_len]
# Permute to [B, pred_len, 1]
dec_out = dec_out.permute(0, 2, 1)
# --- De-Normalization ---
if self.use_norm:
dec_out = dec_out * stdev.unsqueeze(1).unsqueeze(2).repeat(1, self.pred_len, 1) # stdev shape [B] -> [B, 1, 1]
dec_out = dec_out + means.unsqueeze(1).unsqueeze(2).repeat(1, self.pred_len, 1) # means shape [B] -> [B, 1, 1]
return dec_out # Shape: [B, pred_len, 1]
# forecast_multi implementation needs adjustment based on how multivariate inputs/outputs are handled.
# For 'MS' (single target), forecast_multi might not be needed, or should behave like forecast.
# If 'M' (multiple targets), the embeddings and head need modification.
# Sticking to 'MS' case as implied by the project description (forecasting grid load).
def forward(self, x_enc, x_mark_enc, x_dec, x_mark_dec, mask=None):
# This forward function calls the appropriate forecast method based on task and features.
# For the current setup ('MS' features, forecasting task), we use the 'forecast' method.
if self.task_name == 'long_term_forecast' or self.task_name == 'short_term_forecast':
if self.features == 'MS' or self.features == 'S':
dec_out = self.forecast(x_enc, x_mark_enc, x_dec, x_mark_dec)
return dec_out[:, -self.pred_len:, :] # Ensure only pred_len steps are returned [B, L, D]
# elif self.features == 'M':
# dec_out = self.forecast_multi(x_enc, x_mark_enc, x_dec, x_mark_dec) # Requires forecast_multi implementation
# return dec_out[:, -self.pred_len:, :]
else:
raise ValueError(f"Features type {self.features} not implemented in forward pass.")
else:
# Handle other tasks like imputation or classification if needed
return None
6. Configuration & Hyperparameters¶
Set up both model hyperparameters (e.g. d_model
, n_heads
, e_layers
) and training settings (learning_rate
, batch_size
, epochs, patience).
These defaults give us a strong starting point but will later be tuned via Optuna.
if pjme_weather is not None:
# Model Hyperparameters (adjust as needed)
# These are examples; tuning is likely required.
configs_dict = {
'task_name': 'short_term_forecast', # Or 'long_term_forecast'
'features': 'MS', # Multivariate input, Single target ('PJME_MW')
'seq_len': 168, # Input sequence length (e.g., 7 days of hourly data) [cite: 133]
'pred_len': 24, # Output prediction length (e.g., 1 day) [cite: 133]
'patch_len': 24, # Patch length [cite: 133]
'enc_in': pjme_weather.shape[1], # Total number of features (target + exogenous)
'c_exog': len(FEATURES), # Number of exogenous features
'd_model': 128, # Model dimension (smaller due to potential memory constraints) [cite: 292]
'n_heads': 8, # Number of attention heads
'e_layers': 2, # Number of encoder layers [cite: 291]
'd_ff': 256, # Feedforward dimension (adjust based on d_model)
'dropout': 0.1, # Dropout rate
'activation': 'gelu', # Activation function
'use_norm': True, # Use Instance Normalization
'embed': 'timeF', # Type of time embedding ('timeF', 'fixed', 'learned')
'freq': 'h', # Frequency of data ('h' for hourly)
'factor': 3 # Attention factor (for potential future use in attention variants)
}
# Training Hyperparameters
learning_rate = 0.0001 # [cite: 289]
batch_size = 32
num_epochs = 50 # Increased epochs, but early stopping will likely trigger
patience = 5 # Early stopping patience
configs = Configs(**configs_dict)
print("Configurations set:")
for key, value in configs_dict.items():
print(f"- {key}: {value}")
else:
print("Skipping configuration due to missing data.")
Configurations set: - task_name: short_term_forecast - features: MS - seq_len: 168 - pred_len: 24 - patch_len: 24 - enc_in: 14 - c_exog: 13 - d_model: 128 - n_heads: 8 - e_layers: 2 - d_ff: 256 - dropout: 0.1 - activation: gelu - use_norm: True - embed: timeF - freq: h - factor: 3
7. Dataset & DataLoader¶
- TimeSeriesDataset
- Builds sliding windows of length
seq_len
+ prediction windows of lengthpred_len
- Precomputes normalized time features for each timestamp
- Builds sliding windows of length
- Wrap these in PyTorch DataLoaders for batching and shuffling
class TimeSeriesDataset(Dataset):
def __init__(self, data_df, target_col, feature_cols, sequence_length, prediction_length, freq='h'):
self.sequence_length = sequence_length
self.prediction_length = prediction_length
self.target_col = target_col
self.feature_cols = feature_cols # All columns (target + exog)
self.freq = freq
# Ensure target is the last column for easy slicing later if needed by norm/model
if data_df.columns[-1] != target_col:
cols = [col for col in data_df.columns if col != target_col] + [target_col]
data_df = data_df[cols]
print(f"Reordered columns. Target '{target_col}' is now last.")
self.feature_cols = data_df.columns.tolist() # Update feature list
self.data = data_df[self.feature_cols].values # Use all columns for x_enc
self.target_data = data_df[[self.target_col]].values # Explicitly target
self.index = data_df.index
# Precompute time features
self.time_features = self._compute_time_features(self.index, freq=self.freq)
def _compute_time_features(self, dt_index, freq='h'):
# Based on freq, generate time features
# Example for hourly frequency: Hour, DayOfWeek, DayOfMonth, DayOfYear, Month, WeekOfYear
if freq == 'h':
features = [
dt_index.hour.to_numpy(),
dt_index.dayofweek.to_numpy(),
dt_index.day.to_numpy(),
dt_index.dayofyear.to_numpy(),
dt_index.month.to_numpy(),
#dt_index.weekofyear is deprecated, use isocalendar().week
dt_index.isocalendar().week.to_numpy(dtype=np.float32) # Ensure correct dtype
]
# Add more frequencies if needed (e.g., 't' for minutely)
# elif freq == 't':
# features = [...]
else:
# Default minimal features
features = [dt_index.dayofyear.to_numpy(), dt_index.month.to_numpy()]
# Stack features and normalize (e.g., scale to [0, 1] or [-0.5, 0.5])
time_features = np.vstack(features).transpose().astype(np.float32)
# Simple normalization example (min-max to [0, 1]) - apply per feature
min_vals = time_features.min(axis=0)
max_vals = time_features.max(axis=0)
range_vals = max_vals - min_vals
range_vals[range_vals == 0] = 1.0 # Avoid division by zero for constant features
time_features = (time_features - min_vals) / range_vals
return time_features
def __len__(self):
# Total length minus sequence length and prediction length
return len(self.data) - self.sequence_length - self.prediction_length + 1
def __getitem__(self, idx):
# Indices for input sequence
seq_start = idx
seq_end = idx + self.sequence_length
# Indices for target sequence (prediction window)
pred_start = seq_end
pred_end = pred_start + self.prediction_length
# --- Input Data ---
# x_enc: Input sequence including target and exogenous features
x_enc = self.data[seq_start:seq_end]
# x_mark_enc: Time features for the input sequence
x_mark_enc = self.time_features[seq_start:seq_end]
# --- Target Data ---
# y: True target values for the prediction window
y = self.target_data[pred_start:pred_end]
# y_mark / x_mark_dec: Time features for the prediction window
x_mark_dec = self.time_features[pred_start:pred_end]
# --- Decoder Input (Placeholders for Encoder-Only Model) ---
# x_dec: Often zeros or last part of input sequence + zeros
# Let's use zeros for simplicity
x_dec = np.zeros((self.prediction_length, x_enc.shape[1])) # Match feature dim
return {
'x_enc': torch.tensor(x_enc, dtype=torch.float32),
'x_mark_enc': torch.tensor(x_mark_enc, dtype=torch.float32),
'x_dec': torch.tensor(x_dec, dtype=torch.float32), # Placeholder
'x_mark_dec': torch.tensor(x_mark_dec, dtype=torch.float32),
'y': torch.tensor(y, dtype=torch.float32) # Target values
}
if pjme_weather is not None:
# Create datasets
train_dataset = TimeSeriesDataset(train_df, TARGET, pjme_weather.columns.tolist(), configs.seq_len, configs.pred_len, configs.freq)
val_dataset = TimeSeriesDataset(val_df, TARGET, pjme_weather.columns.tolist(), configs.seq_len, configs.pred_len, configs.freq)
test_dataset = TimeSeriesDataset(test_df, TARGET, pjme_weather.columns.tolist(), configs.seq_len, configs.pred_len, configs.freq)
# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=0) # num_workers=0 for main process
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=0)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, num_workers=0)
print(f"Datasets created: Train={len(train_dataset)}, Val={len(val_dataset)}, Test={len(test_dataset)}")
print(f"DataLoaders created with batch size: {batch_size}")
# Example batch check
try:
batch = next(iter(train_loader))
print("\nSample batch shapes:")
for key, value in batch.items():
print(f"- {key}: {value.shape}")
print(f"Target feature index in x_enc: -1 (assumed)")
except StopIteration:
print("\nWarning: Train loader is empty. Check data splitting and dataset length.")
except Exception as e:
print(f"\nError checking batch: {e}")
else:
print("Skipping Dataset/DataLoader creation due to missing data.")
Reordered columns. Target 'PJME_MW' is now last. Reordered columns. Target 'PJME_MW' is now last. Reordered columns. Target 'PJME_MW' is now last. Datasets created: Train=109095, Val=13470, Test=13470 DataLoaders created with batch size: 32 Sample batch shapes: - x_enc: torch.Size([32, 168, 14]) - x_mark_enc: torch.Size([32, 168, 6]) - x_dec: torch.Size([32, 24, 14]) - x_mark_dec: torch.Size([32, 24, 6]) - y: torch.Size([32, 24, 1]) Target feature index in x_enc: -1 (assumed)
8. Training Loop¶
- Initialize the Model, MSELoss, Adam optimizer, and EarlyStopping
- For each epoch:
- Train over batches, accumulate loss
- Evaluate on validation set
- Check early stopping
- Plot train vs. validation loss to verify convergence
if pjme_weather is not None:
# Initialize Model, Loss, Optimizer
model = Model(configs).to(device)
criterion = nn.MSELoss() # L2 Loss [cite: 109]
optimizer = optim.Adam(model.parameters(), lr=learning_rate) # [cite: 289]
early_stopping = EarlyStopping(patience=patience, verbose=True, path='timexer_checkpoint.pt')
print("Model initialized:")
print(model)
# Count parameters
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nTotal trainable parameters: {total_params:,}")
train_losses = []
val_losses = []
print("\nStarting Training...")
start_time = time.time()
for epoch in range(num_epochs):
epoch_start_time = time.time()
model.train()
running_loss = 0.0
batch_count = 0
for i, batch in enumerate(train_loader):
# Move batch to device
x_enc = batch['x_enc'].to(device)
x_mark_enc = batch['x_mark_enc'].to(device)
x_dec = batch['x_dec'].to(device)
x_mark_dec = batch['x_mark_dec'].to(device)
y_true = batch['y'].to(device)
optimizer.zero_grad()
# Forward pass
outputs = model(x_enc, x_mark_enc, x_dec, x_mark_dec)
# Ensure output and target shapes match
# Output: [B, pred_len, 1], Target: [B, pred_len, 1]
loss = criterion(outputs, y_true)
# Backward pass and optimize
loss.backward()
optimizer.step()
running_loss += loss.item()
batch_count += 1
# Optional: Print progress within epoch
# if (i + 1) % 100 == 0:
# print(f' Epoch {epoch+1}, Batch {i+1}/{len(train_loader)}, Loss: {loss.item():.6f}')
avg_train_loss = running_loss / batch_count
train_losses.append(avg_train_loss)
# --- Validation ---
model.eval()
running_val_loss = 0.0
val_batch_count = 0
with torch.no_grad():
for batch in val_loader:
x_enc = batch['x_enc'].to(device)
x_mark_enc = batch['x_mark_enc'].to(device)
x_dec = batch['x_dec'].to(device)
x_mark_dec = batch['x_mark_dec'].to(device)
y_true = batch['y'].to(device)
outputs = model(x_enc, x_mark_enc, x_dec, x_mark_dec)
loss = criterion(outputs, y_true)
running_val_loss += loss.item()
val_batch_count += 1
avg_val_loss = running_val_loss / val_batch_count
val_losses.append(avg_val_loss)
epoch_time = time.time() - epoch_start_time
print(f'Epoch {epoch+1}/{num_epochs}, Train Loss: {avg_train_loss:.6f}, Val Loss: {avg_val_loss:.6f}, Time: {epoch_time:.2f}s')
# --- Early Stopping ---
early_stopping(avg_val_loss, model)
if early_stopping.early_stop:
print("Early stopping triggered.")
break
total_training_time = time.time() - start_time
print(f'\nTraining Finished. Total time: {total_training_time:.2f}s')
# Load the best model saved by early stopping
print("Loading best model checkpoint...")
try:
model.load_state_dict(torch.load('timexer_checkpoint.pt'))
print("Best model loaded successfully.")
except FileNotFoundError:
print("Warning: Checkpoint file not found. Using the model from the last epoch.")
except Exception as e:
print(f"Error loading checkpoint: {e}. Using the model from the last epoch.")
# Plot training and validation loss
plt.figure(figsize=(10, 5))
plt.plot(train_losses, label='Training Loss')
plt.plot(val_losses, label='Validation Loss')
plt.title('Model Loss During Training')
plt.xlabel('Epoch')
plt.ylabel('MSE Loss')
plt.legend()
plt.show()
else:
print("Skipping Training due to missing data.")
model = None # Ensure model is None if not trained
Model initialized: Model( (en_embedding): EnEmbedding( (value_embedding): Linear(in_features=24, out_features=128, bias=False) (position_embedding): PositionalEmbedding() (dropout): Dropout(p=0.1, inplace=False) ) (ex_embedding): DataEmbeddingExog( (value_embedding): TokenEmbedding( (value_embedding): Linear(in_features=13, out_features=128, bias=True) ) (temporal_embedding): TimeFeatureEmbedding( (embed): Linear(in_features=6, out_features=128, bias=False) ) (position_embedding): PositionalEmbedding() (dropout): Dropout(p=0.1, inplace=False) ) (encoder): Encoder( (layers): ModuleList( (0-1): 2 x EncoderLayer( (self_attention): AttentionLayer( (inner_attention): FullAttention( (dropout): Dropout(p=0.1, inplace=False) ) (query_projection): Linear(in_features=128, out_features=128, bias=True) (key_projection): Linear(in_features=128, out_features=128, bias=True) (value_projection): Linear(in_features=128, out_features=128, bias=True) (out_projection): Linear(in_features=128, out_features=128, bias=True) ) (cross_attention): AttentionLayer( (inner_attention): FullAttention( (dropout): Dropout(p=0.1, inplace=False) ) (query_projection): Linear(in_features=128, out_features=128, bias=True) (key_projection): Linear(in_features=128, out_features=128, bias=True) (value_projection): Linear(in_features=128, out_features=128, bias=True) (out_projection): Linear(in_features=128, out_features=128, bias=True) ) (conv1): Conv1d(128, 256, kernel_size=(1,), stride=(1,)) (conv2): Conv1d(256, 128, kernel_size=(1,), stride=(1,)) (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True) (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True) (norm3): LayerNorm((128,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (norm): LayerNorm((128,), eps=1e-05, elementwise_affine=True) ) (head): FlattenHead( (flatten): Flatten(start_dim=-2, end_dim=-1) (linear): Linear(in_features=1024, out_features=24, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) ) Total trainable parameters: 428,184 Starting Training... Epoch 1/50, Train Loss: 0.179126, Val Loss: 0.090551, Time: 74.61s Validation loss decreased (inf --> 0.090551). Saving model ... Epoch 2/50, Train Loss: 0.147890, Val Loss: 0.086291, Time: 75.00s Validation loss decreased (0.090551 --> 0.086291). Saving model ... Epoch 3/50, Train Loss: 0.139661, Val Loss: 0.079674, Time: 78.05s Validation loss decreased (0.086291 --> 0.079674). Saving model ... Epoch 4/50, Train Loss: 0.134318, Val Loss: 0.076228, Time: 72.99s Validation loss decreased (0.079674 --> 0.076228). Saving model ... Epoch 5/50, Train Loss: 0.130756, Val Loss: 0.076344, Time: 77.78s EarlyStopping counter: 1 out of 5 Epoch 6/50, Train Loss: 0.127994, Val Loss: 0.072837, Time: 72.98s Validation loss decreased (0.076228 --> 0.072837). Saving model ... Epoch 7/50, Train Loss: 0.125897, Val Loss: 0.075359, Time: 70.22s EarlyStopping counter: 1 out of 5 Epoch 8/50, Train Loss: 0.123730, Val Loss: 0.070576, Time: 77.31s Validation loss decreased (0.072837 --> 0.070576). Saving model ... Epoch 9/50, Train Loss: 0.122007, Val Loss: 0.072582, Time: 78.92s EarlyStopping counter: 1 out of 5 Epoch 10/50, Train Loss: 0.120174, Val Loss: 0.078044, Time: 75.69s EarlyStopping counter: 2 out of 5 Epoch 11/50, Train Loss: 0.118237, Val Loss: 0.078026, Time: 77.04s EarlyStopping counter: 3 out of 5 Epoch 12/50, Train Loss: 0.115938, Val Loss: 0.074494, Time: 85.69s EarlyStopping counter: 4 out of 5 Epoch 13/50, Train Loss: 0.114572, Val Loss: 0.077988, Time: 78.00s EarlyStopping counter: 5 out of 5 Early stopping triggered. Training Finished. Total time: 994.42s Loading best model checkpoint... Best model loaded successfully.
Over twelve epochs of training, we see a steady decline in both training and validation MSE, with a few key observations:
- Rapid initial improvement
• In Epoch 0 → 1, training loss falls sharply from ~0.18 to ~0.15 and validation drops from ~0.09 to ~0.085—this shows the model quickly learns the largest, easiest-to-capture patterns. - Slower, steady convergence
• From Epoch 2 onward, training loss decreases more gradually (down to ~0.115 by Epoch 12), indicating the model is fine-tuning its internal representations.
• Validation loss bottoms out around Epoch 7 at ~0.070, then fluctuates slightly between ~0.071 and ~0.078 thereafter. - Generalization gap
• The training curve remains above the validation curve throughout, suggesting we’re not overfitting—if anything, the model may still have capacity to squeeze out a bit more train-set performance without harming generalization. - Early stopping behavior
• Because validation loss did not improve for several consecutive epochs after its minimum, early stopping kicked in, preventing wasted computation and guarding against potential overfitting. - Takeaway
• The model converged in under a dozen epochs to a stable validation MSE around 0.07. This stability—plus a modest generalization gap—indicates the chosen architecture and hyperparameters yield a robust fit. Further gains would likely require architectural tweaks (deeper layers, alternate embedding strategies) or more data rather than simply longer training.
9. Model Evaluation¶
- Run the best-saved model on the test set
- Inverse-transform predictions and actuals
- Compute RMSE, MAE, and MAPE to quantify forecasting accuracy
if model is not None and pjme_weather is not None:
print("\nStarting Evaluation on Test Set...")
model.eval()
predictions = []
actuals = []
with torch.no_grad():
for batch in test_loader:
x_enc = batch['x_enc'].to(device)
x_mark_enc = batch['x_mark_enc'].to(device)
x_dec = batch['x_dec'].to(device)
x_mark_dec = batch['x_mark_dec'].to(device)
y_true = batch['y'].to(device) # Scaled actuals
outputs = model(x_enc, x_mark_enc, x_dec, x_mark_dec) # Scaled predictions
# Inverse transform using the target scaler
# Output shape: [B, pred_len, 1]
# Need to reshape to 2D for scaler: [B * pred_len, 1]
preds_inv = target_scaler.inverse_transform(outputs.cpu().numpy().reshape(-1, 1))
actuals_inv = target_scaler.inverse_transform(y_true.cpu().numpy().reshape(-1, 1))
predictions.extend(preds_inv.flatten().tolist())
actuals.extend(actuals_inv.flatten().tolist())
# Calculate metrics
predictions = np.array(predictions)
actuals = np.array(actuals)
# Ensure shapes are reasonable (e.g., matching length)
print(f"Shape of predictions array: {predictions.shape}")
print(f"Shape of actuals array: {actuals.shape}")
if len(predictions) == len(actuals) and len(predictions) > 0:
rmse = np.sqrt(mean_squared_error(actuals, predictions))
mae = mean_absolute_error(actuals, predictions)
mape = mean_absolute_percentage_error(actuals, predictions)
print("\nEvaluation Metrics on Test Set (Inverse Scaled):")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"MAPE: {mape:.4f}%")
# Store results for comparison if needed
results = {'RMSE': rmse, 'MAE': mae, 'MAPE': mape}
else:
print("Error: Predictions and actuals lengths mismatch or are empty. Cannot calculate metrics.")
results = None
else:
print("Skipping Evaluation because the model was not trained or data was missing.")
results = None
Starting Evaluation on Test Set... Shape of predictions array: (323280,) Shape of actuals array: (323280,) Evaluation Metrics on Test Set (Inverse Scaled): RMSE: 1941.5523 MAE: 1351.1955 MAPE: 4.2878%
After loading the best-saved model and running it on the held-out test set, we obtain the following key performance metrics (inverse‐scaled back to megawatts):
- RMSE (Root Mean Squared Error): 1,951 MW
- MAE (Mean Absolute Error): 1,363 MW
- MAPE (Mean Absolute Percentage Error): 4.28 %
Interpretation:
- An MAE of ~1,360 MW means that on average our one-day-ahead predictions miss the true load by about 1.3 GW. Given that PJM peak demand often exceeds 50 GW, this corresponds to a relatively small absolute error.
- The MAPE of 4.3 % confirms that, on average, errors remain within a tight ±5 % band, which is competitive for short-term grid forecasting.
- The RMSE is naturally higher than MAE (due to squaring large deviations), but still indicates that large outliers are rare—squared penalties haven’t run away.
Generalization & Robustness:
- Test‐set errors closely mirror the validation‐set behavior (validation MSE ≈ 0.07 → RMSE ≈ 1,640 MW when converted), demonstrating that the model generalizes well and did not overfit.
- The consistency between training, validation, and test losses indicates stable learning: the model’s inductive biases (patching plus cross-attention) are well-suited to this problem.
Next Steps:
- Error analysis: Drill into timestamps or conditions (e.g. extreme weather, holidays, ramp events) where residuals spike, to identify missing drivers or feature gaps.
- Benchmarking: Run the same RMSE/MAE/MAPE calculations on your previous XGBoost and ARIMA–LSTM–CNN baselines to quantify TimeXer’s relative gains.
- Extended hyperparameter optimization with Optuna: Now that the core pipeline is working, expand your Optuna study to include architecture parameters—number of encoder layers, attention heads, dropout rate, even patch length—and increase the trial budget to explore a richer search space.
- Ensemble strategies: Combine TimeXer predictions with a simpler persistence model or a classical time‐series forecaster to smooth out extreme errors, potentially reducing RMSE on peak‐demand hours.
10. Forecast Visualization¶
Plot a sample week and the full test‐period predictions vs. actuals to visually assess model performance.
Save the full-period plot for reporting.
if results is not None and pjme_weather is not None:
print("\nVisualizing Predictions vs Actuals (Sample from Test Set)...")
# We need the timestamps for the test predictions
# The number of predictions might not exactly match the test set length
# due to the sequence generation (len(test_set) - seq_len - pred_len + 1 batches)
# Let's get the start index of the test predictions
test_pred_start_idx = len(train_df) + len(val_df) + configs.seq_len
test_pred_end_idx = test_pred_start_idx + len(predictions) # Assuming predictions cover the possible range
if test_pred_end_idx > len(pjme_weather):
test_pred_end_idx = len(pjme_weather)
print(f"Adjusting prediction end index to match data length: {test_pred_end_idx}")
if test_pred_start_idx < test_pred_end_idx and len(predictions) > 0:
# Select the corresponding timestamps from the original dataframe
pred_index = pjme_weather.index[test_pred_start_idx:test_pred_end_idx]
# Ensure the length matches the number of predictions
if len(pred_index) != len(predictions):
print(f"Warning: Length mismatch between prediction index ({len(pred_index)}) and predictions ({len(predictions)}). Truncating index.")
min_len = min(len(pred_index), len(predictions))
pred_index = pred_index[:min_len]
predictions_to_plot = predictions[:min_len]
actuals_to_plot = actuals[:min_len]
else:
predictions_to_plot = predictions
actuals_to_plot = actuals
# Create a DataFrame for plotting
plot_df = pd.DataFrame({
'Actual': actuals_to_plot,
'Predicted': predictions_to_plot
}, index=pred_index)
# Plot a sample (e.g., first N hours or a specific period)
sample_size = 24 * 7 * 1 # Plot one week
plt.figure(figsize=(18, 6))
plt.plot(plot_df.index[:sample_size], plot_df['Actual'][:sample_size], label='Actual', marker='.', linestyle='-')
plt.plot(plot_df.index[:sample_size], plot_df['Predicted'][:sample_size], label='Predicted', marker='.', linestyle='--')
plt.title(f'TimeXer Forecast vs Actuals (First {sample_size} hours of Test Predictions)')
plt.xlabel('Timestamp')
plt.ylabel(f'{TARGET} (MW)')
plt.legend()
plt.grid(True)
# Rotate x-axis labels for better readability
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
else:
print("Cannot generate plot due to index/prediction length issues.")
elif pjme_weather is None:
print("Skipping Visualization due to missing data.")
else:
print("Skipping Visualization because evaluation metrics could not be calculated.")
Visualizing Predictions vs Actuals (Sample from Test Set)... Adjusting prediction end index to match data length: 136608 Warning: Length mismatch between prediction index (13493) and predictions (323280). Truncating index.
Forecast vs. Actuals (First 168 Hours of Test Set)
Tight alignment of diurnal cycles
Over each 24-hour block you can see the strong daily “peak→valley→peak” pattern in PJM load. TimeXer closely tracks both the timing and magnitude of those peaks and troughs, indicating it has learned the underlying calendar and weather‐driven rhythm.Slight underestimation of peaks
On most days the red dashed line (prediction) reaches marginally lower than the blue solid line (actual) at the daily high point—typically by a few hundred MW. This suggests a small negative bias at extreme load conditions.Mild over‐smoothing of troughs
Conversely, the model’s valley forecasts tend to sit a bit above the true minima. In other words, it doesn’t drop quite as low overnight, implying a touch of over‐regularization or insufficient sensitivity to the cold‐weather dip.No systematic phase shift
Peaks and valleys line up almost exactly in time—there’s no multi-hour lag. That confirms TimeXer’s positional embeddings and attention mechanisms are correctly capturing temporal ordering.Consistency through the week
Across weekdays and weekend hours alike, the model’s error remains stable. There’s no particular day where performance suddenly degrades, which speaks to robust generalization.Implications
- The small peak-under and trough-over biases could be remedied by adding a calibration layer or by augmenting extreme‐condition features (e.g., holiday flags or rapid temperature drops).
- Overall, sub-5 % MAPE and visually tight overlay demonstrate that TimeXer produces highly reliable one-day-ahead forecasts, making it suitable for operational scheduling and dispatch support.
11. Hyperparameter Tuning with Optuna¶
Optuna is a modern, efficient hyperparameter optimization framework that differs from traditional grid or random search in several key ways:
- Adaptive, sequential sampling
Rather than exhaustively evaluating every combination on a fixed grid, Optuna uses smart samplers (e.g., TPE) to propose new trials based on past performance—focusing computational effort on the most promising regions of your search space. - Dynamic search spaces
You can define nested or conditional hyperparameters (e.g. only tunen_heads
ifd_model
is divisible by it), something that grid search cannot handle without combinatorial explosion. - Integrated pruning
Optuna can stop unpromising trials early via “pruners” (more on this below), saving both time and GPU hours.
In our setup, we focus on four high-impact hyperparameters:
- learning_rate (continuous, log scale) – Controls the optimizer’s step size and directly affects convergence speed and stability.
- d_model (categorical: 64 / 128 / 256) – The width of all embeddings and attention projections; larger values let the model capture richer patterns but come with higher computational cost.
- seq_len (categorical: 168 / 240 / 336) – How many past hourly steps the model “sees”; tuning this balances short-term detail vs. long-term context.
- patch_len (categorical: 8 / 12 / 16 / 24) – The size of each patch in
EnEmbedding
; smaller patches give finer resolution, larger patches reduce sequence length and speed up attention.
To accelerate the search, we employ Optuna’s MedianPruner:
- At each intermediate checkpoint (after every epoch), the pruner compares the trial’s validation loss against the median validation loss of all completed trials at the same step.
- If the current trial is performing worse than that median, it is “pruned” (i.e. halted) early—freeing up resources to explore other configurations.
- This strategy achieves a balance between exploration (letting new trials run long enough to show promise) and exploitation (stopping clearly underperforming ones), dramatically reducing total tuning time compared to fully training every single candidate.
import optuna
def objective(trial):
"""Optuna objective function for hyperparameter optimization."""
global train_df, val_df, TARGET, FEATURES, pjme_weather # Access global dataframes/configs if needed, or pass them
global device, target_scaler # Access device and scaler
# --- 1. Suggest Hyperparameters ---
# Use suggested values or fall back to defaults if needed
try:
cfg_learning_rate = trial.suggest_float('learning_rate', 1e-5, 1e-3, log=True)
cfg_d_model = trial.suggest_categorical('d_model', [64, 128, 256])
cfg_seq_len = trial.suggest_categorical('seq_len', [24*7, 24*10, 24*14]) # 7, 10, 14 days
cfg_patch_len = trial.suggest_categorical('patch_len', [8, 12, 16, 24])
# Fixed hyperparameters (can be defined globally or passed as args)
fixed_params = {
'task_name': 'short_term_forecast',
'features': 'MS',
'pred_len': 24,
'enc_in': pjme_weather.shape[1],
'c_exog': len(FEATURES),
'n_heads': 8, # Keep fixed for now, ensure d_model % n_heads == 0
'e_layers': 2, # Keep fixed for now
'dropout': 0.1, # Keep fixed for now
'activation': 'gelu',
'use_norm': True,
'embed': 'timeF',
'freq': 'h',
'factor': 3,
# Dependent HPs
'd_ff': cfg_d_model * 4 # Often set relative to d_model
}
# Check compatibility
if cfg_d_model % fixed_params['n_heads'] != 0:
# Skip this trial if incompatible
print(f"Skipping trial: d_model={cfg_d_model} not divisible by n_heads={fixed_params['n_heads']}")
raise optuna.exceptions.TrialPruned("Incompatible d_model and n_heads")
# Combine suggested and fixed params
current_configs_dict = {
'learning_rate': cfg_learning_rate,
'd_model': cfg_d_model,
'seq_len': cfg_seq_len,
'patch_len': cfg_patch_len,
**fixed_params # Add fixed parameters
}
configs = Configs(**current_configs_dict)
# --- 2. Recreate Datasets & DataLoaders (Essential for varying seq_len) ---
# Make sure train_df, val_df exist and are scaled
if 'train_df' not in globals() or train_df is None:
print("Error: train_df not found or is None in objective function.")
return float('inf') # Return high value if data is missing
try:
# Use original scaled dataframes
train_dataset = TimeSeriesDataset(train_df, TARGET, pjme_weather.columns.tolist(), configs.seq_len, configs.pred_len, configs.freq)
val_dataset = TimeSeriesDataset(val_df, TARGET, pjme_weather.columns.tolist(), configs.seq_len, configs.pred_len, configs.freq)
# Use a reasonable batch size (can also be tuned)
current_batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=current_batch_size, shuffle=True, num_workers=0)
val_loader = DataLoader(val_dataset, batch_size=current_batch_size, shuffle=False, num_workers=0)
# Handle potential empty dataloaders if seq_len is too large for val_df
if len(val_loader) == 0 or len(train_loader) == 0:
print(f"Warning: DataLoader empty for seq_len={configs.seq_len}. Pruning trial.")
raise optuna.exceptions.TrialPruned("DataLoader empty for this seq_len")
except Exception as e:
print(f"Error creating datasets/dataloaders: {e}. Pruning trial.")
raise optuna.exceptions.TrialPruned(f"Dataset/Dataloader error: {e}")
# --- 3. Initialize Model, Loss, Optimizer ---
model = Model(configs).to(device)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=configs.learning_rate)
# Use a simpler stopping mechanism for Optuna trials or fewer epochs
num_optuna_epochs = 15 # Train for fewer epochs per trial
patience_optuna = 3 # Stricter patience for early stopping
# --- 4. Training & Validation Loop (Condensed) ---
best_val_loss = float('inf')
stopping_counter = 0
for epoch in range(num_optuna_epochs):
model.train()
train_loss_epoch = 0.0
train_batches = 0
for batch in train_loader:
x_enc, x_mark_enc, x_dec, x_mark_dec, y_true = [b.to(device) for b in batch.values()]
optimizer.zero_grad()
outputs = model(x_enc, x_mark_enc, x_dec, x_mark_dec)
loss = criterion(outputs, y_true)
loss.backward()
optimizer.step()
train_loss_epoch += loss.item()
train_batches += 1
avg_train_loss = train_loss_epoch / train_batches if train_batches > 0 else float('inf')
model.eval()
val_loss_epoch = 0.0
val_batches = 0
with torch.no_grad():
for batch in val_loader:
x_enc, x_mark_enc, x_dec, x_mark_dec, y_true = [b.to(device) for b in batch.values()]
outputs = model(x_enc, x_mark_enc, x_dec, x_mark_dec)
loss = criterion(outputs, y_true)
val_loss_epoch += loss.item()
val_batches += 1
avg_val_loss = val_loss_epoch / val_batches if val_batches > 0 else float('inf')
print(f" Trial {trial.number} Epoch {epoch+1}/{num_optuna_epochs} | Train Loss: {avg_train_loss:.6f} | Val Loss: {avg_val_loss:.6f}")
# Update best validation loss
if avg_val_loss < best_val_loss:
best_val_loss = avg_val_loss
stopping_counter = 0 # Reset counter
else:
stopping_counter += 1
# --- Optuna Pruning & Early Stopping ---
trial.report(avg_val_loss, epoch) # Report intermediate value to Optuna
if trial.should_prune():
print(f" Trial {trial.number} pruned at epoch {epoch+1}.")
raise optuna.exceptions.TrialPruned()
# Simple early stopping within the trial
if stopping_counter >= patience_optuna:
print(f" Trial {trial.number} stopped early at epoch {epoch+1}.")
break
# --- 5. Return Metric ---
# Optuna aims to minimize this value
return best_val_loss
except optuna.exceptions.TrialPruned as e:
# Important: Re-raise TrialPruned exceptions
raise e
except Exception as e:
# Handle other errors like CUDA OOM, etc.
print(f"Trial {trial.number} failed with error: {e}")
# Return a large value or NaN to indicate failure
return float('inf')
if pjme_weather is not None:
# Create Optuna study
# Consider using a pruner to stop unpromising trials early
# MedianPruner is a common choice
study = optuna.create_study(direction='minimize',
pruner=optuna.pruners.MedianPruner(n_startup_trials=5, # Allow first 5 trials to complete
n_warmup_steps=5, # Prune based on first 5 epochs
interval_steps=1)) # Prune every epoch after warmup
# Start optimization
# n_trials: Number of different hyperparameter combinations to test
# Increase this number for a more thorough search (e.g., 50, 100)
# Be mindful of computation time! Each trial trains a model.
n_trials_optuna = 20 # Start with a smaller number (e.g., 20-30) to test
try:
study.optimize(objective, n_trials=n_trials_optuna, timeout=None) # Add timeout in seconds if needed
except KeyboardInterrupt:
print("Optimization stopped manually.")
# --- 5. Get Best Results ---
if study.best_trial:
print("\nOptimization Finished!")
print(f"Number of finished trials: {len(study.trials)}")
print(f"Best trial value (min validation loss): {study.best_value:.6f}")
print("Best trial hyperparameters:")
for key, value in study.best_params.items():
print(f" {key}: {value}")
# Store best params for retraining
best_hyperparams = study.best_params
else:
print("\nOptimization did not complete successfully or no trials finished.")
best_hyperparams = None
# Optional: Visualize optimization history
try:
fig1 = optuna.visualization.plot_optimization_history(study)
fig1.show()
fig2 = optuna.visualization.plot_param_importances(study)
fig2.show()
# fig3 = optuna.visualization.plot_slice(study, params=['learning_rate', 'd_model', 'seq_len', 'patch_len'])
# fig3.show()
except Exception as e:
print(f"Could not generate Optuna plots: {e}. Ensure plotly is installed ('pip install plotly').")
else:
print("Skipping Optuna study due to missing data.")
best_hyperparams = None
[I 2025-04-30 22:25:23,391] A new study created in memory with name: no-name-fd15d67a-1b2b-4e96-965c-f97783154be9
Reordered columns. Target 'PJME_MW' is now last. Reordered columns. Target 'PJME_MW' is now last. Trial 0 Epoch 1/15 | Train Loss: 0.217030 | Val Loss: 0.105313 Trial 0 Epoch 2/15 | Train Loss: 0.166503 | Val Loss: 0.093077 Trial 0 Epoch 3/15 | Train Loss: 0.157144 | Val Loss: 0.087432 Trial 0 Epoch 4/15 | Train Loss: 0.152093 | Val Loss: 0.083669 Trial 0 Epoch 5/15 | Train Loss: 0.147551 | Val Loss: 0.082294 Trial 0 Epoch 6/15 | Train Loss: 0.144979 | Val Loss: 0.082872 Trial 0 Epoch 7/15 | Train Loss: 0.142083 | Val Loss: 0.079418 Trial 0 Epoch 8/15 | Train Loss: 0.140346 | Val Loss: 0.080104 Trial 0 Epoch 9/15 | Train Loss: 0.138132 | Val Loss: 0.077621 Trial 0 Epoch 10/15 | Train Loss: 0.136381 | Val Loss: 0.077658 Trial 0 Epoch 11/15 | Train Loss: 0.134699 | Val Loss: 0.074820 Trial 0 Epoch 12/15 | Train Loss: 0.133442 | Val Loss: 0.076419 Trial 0 Epoch 13/15 | Train Loss: 0.132338 | Val Loss: 0.077062 Trial 0 Epoch 14/15 | Train Loss: 0.131231 | Val Loss: 0.073753
[I 2025-04-30 22:47:28,791] Trial 0 finished with value: 0.07375267870016754 and parameters: {'learning_rate': 6.341291657814797e-05, 'd_model': 64, 'seq_len': 336, 'patch_len': 24}. Best is trial 0 with value: 0.07375267870016754.
Trial 0 Epoch 15/15 | Train Loss: 0.130404 | Val Loss: 0.075435 Reordered columns. Target 'PJME_MW' is now last. Reordered columns. Target 'PJME_MW' is now last. Trial 1 Epoch 1/15 | Train Loss: 0.198532 | Val Loss: 0.097690 Trial 1 Epoch 2/15 | Train Loss: 0.159336 | Val Loss: 0.088491 Trial 1 Epoch 3/15 | Train Loss: 0.152304 | Val Loss: 0.087903 Trial 1 Epoch 4/15 | Train Loss: 0.147917 | Val Loss: 0.082525 Trial 1 Epoch 5/15 | Train Loss: 0.143585 | Val Loss: 0.084831 Trial 1 Epoch 6/15 | Train Loss: 0.139828 | Val Loss: 0.081442 Trial 1 Epoch 7/15 | Train Loss: 0.136908 | Val Loss: 0.079589 Trial 1 Epoch 8/15 | Train Loss: 0.133869 | Val Loss: 0.076668 Trial 1 Epoch 9/15 | Train Loss: 0.131673 | Val Loss: 0.078508 Trial 1 Epoch 10/15 | Train Loss: 0.129862 | Val Loss: 0.074957 Trial 1 Epoch 11/15 | Train Loss: 0.127300 | Val Loss: 0.076290 Trial 1 Epoch 12/15 | Train Loss: 0.126343 | Val Loss: 0.077393
[I 2025-04-30 23:09:15,026] Trial 1 finished with value: 0.0749570621807764 and parameters: {'learning_rate': 1.720702918879371e-05, 'd_model': 256, 'seq_len': 336, 'patch_len': 8}. Best is trial 0 with value: 0.07375267870016754.
Trial 1 Epoch 13/15 | Train Loss: 0.124359 | Val Loss: 0.084127 Trial 1 stopped early at epoch 13. Reordered columns. Target 'PJME_MW' is now last. Reordered columns. Target 'PJME_MW' is now last. Trial 2 Epoch 1/15 | Train Loss: 0.170220 | Val Loss: 0.089017 Trial 2 Epoch 2/15 | Train Loss: 0.140639 | Val Loss: 0.075147 Trial 2 Epoch 3/15 | Train Loss: 0.133560 | Val Loss: 0.076382 Trial 2 Epoch 4/15 | Train Loss: 0.128867 | Val Loss: 0.071336 Trial 2 Epoch 5/15 | Train Loss: 0.125996 | Val Loss: 0.071917 Trial 2 Epoch 6/15 | Train Loss: 0.122622 | Val Loss: 0.079919
[I 2025-04-30 23:18:42,101] Trial 2 finished with value: 0.07133583417560489 and parameters: {'learning_rate': 0.00015954349007674337, 'd_model': 128, 'seq_len': 168, 'patch_len': 12}. Best is trial 2 with value: 0.07133583417560489. [I 2025-04-30 23:18:42,215] Trial 3 finished with value: inf and parameters: {'learning_rate': 0.00013597006517053505, 'd_model': 256, 'seq_len': 168, 'patch_len': 16}. Best is trial 2 with value: 0.07133583417560489.
Trial 2 Epoch 7/15 | Train Loss: 0.119697 | Val Loss: 0.080204 Trial 2 stopped early at epoch 7. Reordered columns. Target 'PJME_MW' is now last. Reordered columns. Target 'PJME_MW' is now last. Trial 3 failed with error: mat1 and mat2 shapes cannot be multiplied (32x2816 and 3072x24) Reordered columns. Target 'PJME_MW' is now last. Reordered columns. Target 'PJME_MW' is now last.
[I 2025-04-30 23:18:42,323] Trial 4 finished with value: inf and parameters: {'learning_rate': 1.791413337548682e-05, 'd_model': 256, 'seq_len': 168, 'patch_len': 16}. Best is trial 2 with value: 0.07133583417560489.
Trial 4 failed with error: mat1 and mat2 shapes cannot be multiplied (32x2816 and 3072x24) Reordered columns. Target 'PJME_MW' is now last. Reordered columns. Target 'PJME_MW' is now last. Trial 5 Epoch 1/15 | Train Loss: 0.205242 | Val Loss: 0.099044 Trial 5 Epoch 2/15 | Train Loss: 0.161388 | Val Loss: 0.092211 Trial 5 Epoch 3/15 | Train Loss: 0.153695 | Val Loss: 0.087069 Trial 5 Epoch 4/15 | Train Loss: 0.149091 | Val Loss: 0.080829 Trial 5 Epoch 5/15 | Train Loss: 0.145358 | Val Loss: 0.082344 Trial 5 Epoch 6/15 | Train Loss: 0.142136 | Val Loss: 0.081631 Trial 5 Epoch 7/15 | Train Loss: 0.138700 | Val Loss: 0.076245 Trial 5 Epoch 8/15 | Train Loss: 0.136049 | Val Loss: 0.076145 Trial 5 Epoch 9/15 | Train Loss: 0.134296 | Val Loss: 0.073640 Trial 5 Epoch 10/15 | Train Loss: 0.132466 | Val Loss: 0.078079 Trial 5 Epoch 11/15 | Train Loss: 0.130479 | Val Loss: 0.081748
[I 2025-04-30 23:35:41,595] Trial 5 finished with value: 0.07364010569905916 and parameters: {'learning_rate': 1.6082486125805137e-05, 'd_model': 256, 'seq_len': 336, 'patch_len': 12}. Best is trial 2 with value: 0.07133583417560489.
Trial 5 Epoch 12/15 | Train Loss: 0.128479 | Val Loss: 0.074652 Trial 5 stopped early at epoch 12. Reordered columns. Target 'PJME_MW' is now last. Reordered columns. Target 'PJME_MW' is now last. Trial 6 Epoch 1/15 | Train Loss: 0.233103 | Val Loss: 0.111835 Trial 6 Epoch 2/15 | Train Loss: 0.172858 | Val Loss: 0.095762 Trial 6 Epoch 3/15 | Train Loss: 0.162869 | Val Loss: 0.093020 Trial 6 Epoch 4/15 | Train Loss: 0.157885 | Val Loss: 0.091139 Trial 6 Epoch 5/15 | Train Loss: 0.153701 | Val Loss: 0.089646
[I 2025-04-30 23:43:59,677] Trial 6 pruned.
Trial 6 Epoch 6/15 | Train Loss: 0.151395 | Val Loss: 0.086596 Trial 6 pruned at epoch 6. Reordered columns. Target 'PJME_MW' is now last. Reordered columns. Target 'PJME_MW' is now last. Trial 7 Epoch 1/15 | Train Loss: 0.168159 | Val Loss: 0.086578 Trial 7 Epoch 2/15 | Train Loss: 0.136736 | Val Loss: 0.074795 Trial 7 Epoch 3/15 | Train Loss: 0.126888 | Val Loss: 0.091021 Trial 7 Epoch 4/15 | Train Loss: 0.117699 | Val Loss: 0.087926
[I 2025-04-30 23:51:41,345] Trial 7 finished with value: 0.07479503777890037 and parameters: {'learning_rate': 0.00031245023531273905, 'd_model': 256, 'seq_len': 336, 'patch_len': 8}. Best is trial 2 with value: 0.07133583417560489.
Trial 7 Epoch 5/15 | Train Loss: 0.109179 | Val Loss: 0.096866 Trial 7 stopped early at epoch 5. Reordered columns. Target 'PJME_MW' is now last. Reordered columns. Target 'PJME_MW' is now last. Trial 8 Epoch 1/15 | Train Loss: 0.286276 | Val Loss: 0.140365 Trial 8 Epoch 2/15 | Train Loss: 0.196941 | Val Loss: 0.114998 Trial 8 Epoch 3/15 | Train Loss: 0.180577 | Val Loss: 0.107400 Trial 8 Epoch 4/15 | Train Loss: 0.171931 | Val Loss: 0.100810 Trial 8 Epoch 5/15 | Train Loss: 0.167407 | Val Loss: 0.097292
[I 2025-04-30 23:59:40,632] Trial 8 pruned.
Trial 8 Epoch 6/15 | Train Loss: 0.162764 | Val Loss: 0.095328 Trial 8 pruned at epoch 6. Reordered columns. Target 'PJME_MW' is now last. Reordered columns. Target 'PJME_MW' is now last. Trial 9 Epoch 1/15 | Train Loss: 0.177636 | Val Loss: 0.105077 Trial 9 Epoch 2/15 | Train Loss: 0.147343 | Val Loss: 0.082209 Trial 9 Epoch 3/15 | Train Loss: 0.139580 | Val Loss: 0.079834 Trial 9 Epoch 4/15 | Train Loss: 0.134609 | Val Loss: 0.078326 Trial 9 Epoch 5/15 | Train Loss: 0.131337 | Val Loss: 0.075658 Trial 9 Epoch 6/15 | Train Loss: 0.128551 | Val Loss: 0.077714 Trial 9 Epoch 7/15 | Train Loss: 0.126433 | Val Loss: 0.071876 Trial 9 Epoch 8/15 | Train Loss: 0.123921 | Val Loss: 0.075029 Trial 9 Epoch 9/15 | Train Loss: 0.121844 | Val Loss: 0.077695
[I 2025-05-01 00:11:46,776] Trial 9 finished with value: 0.07187600426824857 and parameters: {'learning_rate': 0.00024168022291156543, 'd_model': 64, 'seq_len': 240, 'patch_len': 24}. Best is trial 2 with value: 0.07133583417560489.
Trial 9 Epoch 10/15 | Train Loss: 0.120445 | Val Loss: 0.083428 Trial 9 stopped early at epoch 10. Reordered columns. Target 'PJME_MW' is now last. Reordered columns. Target 'PJME_MW' is now last. Trial 10 Epoch 1/15 | Train Loss: 0.162128 | Val Loss: 0.079997 Trial 10 Epoch 2/15 | Train Loss: 0.137483 | Val Loss: 0.077912 Trial 10 Epoch 3/15 | Train Loss: 0.131355 | Val Loss: 0.078057 Trial 10 Epoch 4/15 | Train Loss: 0.125778 | Val Loss: 0.082622
[I 2025-05-01 00:17:56,654] Trial 10 finished with value: 0.07791236970520318 and parameters: {'learning_rate': 0.0009450947081291244, 'd_model': 128, 'seq_len': 240, 'patch_len': 12}. Best is trial 2 with value: 0.07133583417560489.
Trial 10 Epoch 5/15 | Train Loss: 0.121640 | Val Loss: 0.095089 Trial 10 stopped early at epoch 5. Reordered columns. Target 'PJME_MW' is now last. Reordered columns. Target 'PJME_MW' is now last. Trial 11 Epoch 1/15 | Train Loss: 0.182188 | Val Loss: 0.089412 Trial 11 Epoch 2/15 | Train Loss: 0.148267 | Val Loss: 0.082619 Trial 11 Epoch 3/15 | Train Loss: 0.139735 | Val Loss: 0.080502 Trial 11 Epoch 4/15 | Train Loss: 0.135052 | Val Loss: 0.074031 Trial 11 Epoch 5/15 | Train Loss: 0.131240 | Val Loss: 0.071659 Trial 11 Epoch 6/15 | Train Loss: 0.128803 | Val Loss: 0.071719 Trial 11 Epoch 7/15 | Train Loss: 0.127040 | Val Loss: 0.072205
[I 2025-05-01 00:28:28,316] Trial 11 finished with value: 0.07165894482555957 and parameters: {'learning_rate': 0.0002155736769169591, 'd_model': 64, 'seq_len': 240, 'patch_len': 24}. Best is trial 2 with value: 0.07133583417560489.
Trial 11 Epoch 8/15 | Train Loss: 0.125402 | Val Loss: 0.075571 Trial 11 stopped early at epoch 8. Reordered columns. Target 'PJME_MW' is now last. Reordered columns. Target 'PJME_MW' is now last. Trial 12 Epoch 1/15 | Train Loss: 0.189130 | Val Loss: 0.096687 Trial 12 Epoch 2/15 | Train Loss: 0.154851 | Val Loss: 0.090245 Trial 12 Epoch 3/15 | Train Loss: 0.145509 | Val Loss: 0.078062 Trial 12 Epoch 4/15 | Train Loss: 0.139179 | Val Loss: 0.074279 Trial 12 Epoch 5/15 | Train Loss: 0.136055 | Val Loss: 0.076238 Trial 12 Epoch 6/15 | Train Loss: 0.132167 | Val Loss: 0.078876 Trial 12 Epoch 7/15 | Train Loss: 0.130268 | Val Loss: 0.073849 Trial 12 Epoch 8/15 | Train Loss: 0.128431 | Val Loss: 0.073856 Trial 12 Epoch 9/15 | Train Loss: 0.126351 | Val Loss: 0.076988
[I 2025-05-01 00:42:26,678] Trial 12 finished with value: 0.0738487767310133 and parameters: {'learning_rate': 6.203308845293041e-05, 'd_model': 128, 'seq_len': 240, 'patch_len': 12}. Best is trial 2 with value: 0.07133583417560489.
Trial 12 Epoch 10/15 | Train Loss: 0.123894 | Val Loss: 0.076027 Trial 12 stopped early at epoch 10. Reordered columns. Target 'PJME_MW' is now last. Reordered columns. Target 'PJME_MW' is now last. Trial 13 Epoch 1/15 | Train Loss: 0.160815 | Val Loss: 0.082948 Trial 13 Epoch 2/15 | Train Loss: 0.134985 | Val Loss: 0.081752 Trial 13 Epoch 3/15 | Train Loss: 0.126029 | Val Loss: 0.085098 Trial 13 Epoch 4/15 | Train Loss: 0.120106 | Val Loss: 0.079287 Trial 13 Epoch 5/15 | Train Loss: 0.113961 | Val Loss: 0.087723 Trial 13 Epoch 6/15 | Train Loss: 0.109125 | Val Loss: 0.087211
[I 2025-05-01 00:51:32,023] Trial 13 pruned.
Trial 13 Epoch 7/15 | Train Loss: 0.104063 | Val Loss: 0.087541 Trial 13 pruned at epoch 7. Reordered columns. Target 'PJME_MW' is now last. Reordered columns. Target 'PJME_MW' is now last. Trial 14 Epoch 1/15 | Train Loss: 0.187422 | Val Loss: 0.093800 Trial 14 Epoch 2/15 | Train Loss: 0.153150 | Val Loss: 0.084173 Trial 14 Epoch 3/15 | Train Loss: 0.143894 | Val Loss: 0.085084 Trial 14 Epoch 4/15 | Train Loss: 0.138855 | Val Loss: 0.079369 Trial 14 Epoch 5/15 | Train Loss: 0.134907 | Val Loss: 0.079042 Trial 14 Epoch 6/15 | Train Loss: 0.132797 | Val Loss: 0.071883 Trial 14 Epoch 7/15 | Train Loss: 0.130519 | Val Loss: 0.074876 Trial 14 Epoch 8/15 | Train Loss: 0.128328 | Val Loss: 0.073569
[I 2025-05-01 01:03:32,338] Trial 14 finished with value: 0.07188307942773908 and parameters: {'learning_rate': 0.00012480547675824255, 'd_model': 64, 'seq_len': 168, 'patch_len': 12}. Best is trial 2 with value: 0.07133583417560489.
Trial 14 Epoch 9/15 | Train Loss: 0.127702 | Val Loss: 0.072702 Trial 14 stopped early at epoch 9. Reordered columns. Target 'PJME_MW' is now last. Reordered columns. Target 'PJME_MW' is now last. Trial 15 Epoch 1/15 | Train Loss: 0.190396 | Val Loss: 0.091506 Trial 15 Epoch 2/15 | Train Loss: 0.154860 | Val Loss: 0.088904 Trial 15 Epoch 3/15 | Train Loss: 0.146340 | Val Loss: 0.080972 Trial 15 Epoch 4/15 | Train Loss: 0.140094 | Val Loss: 0.080357 Trial 15 Epoch 5/15 | Train Loss: 0.136314 | Val Loss: 0.077312 Trial 15 Epoch 6/15 | Train Loss: 0.132801 | Val Loss: 0.076494 Trial 15 Epoch 7/15 | Train Loss: 0.130643 | Val Loss: 0.073950 Trial 15 Epoch 8/15 | Train Loss: 0.129136 | Val Loss: 0.077303 Trial 15 Epoch 9/15 | Train Loss: 0.126655 | Val Loss: 0.070565 Trial 15 Epoch 10/15 | Train Loss: 0.125462 | Val Loss: 0.074086 Trial 15 Epoch 11/15 | Train Loss: 0.124415 | Val Loss: 0.075793
[I 2025-05-01 01:18:52,587] Trial 15 finished with value: 0.07056483737835416 and parameters: {'learning_rate': 5.27710700948135e-05, 'd_model': 128, 'seq_len': 240, 'patch_len': 12}. Best is trial 15 with value: 0.07056483737835416.
Trial 15 Epoch 12/15 | Train Loss: 0.122864 | Val Loss: 0.074947 Trial 15 stopped early at epoch 12. Reordered columns. Target 'PJME_MW' is now last. Reordered columns. Target 'PJME_MW' is now last. Trial 16 Epoch 1/15 | Train Loss: 0.198032 | Val Loss: 0.098984 Trial 16 Epoch 2/15 | Train Loss: 0.157285 | Val Loss: 0.088364 Trial 16 Epoch 3/15 | Train Loss: 0.148938 | Val Loss: 0.084494 Trial 16 Epoch 4/15 | Train Loss: 0.143787 | Val Loss: 0.081376 Trial 16 Epoch 5/15 | Train Loss: 0.140250 | Val Loss: 0.081572 Trial 16 Epoch 6/15 | Train Loss: 0.137182 | Val Loss: 0.078261
[I 2025-05-01 01:27:29,815] Trial 16 pruned.
Trial 16 Epoch 7/15 | Train Loss: 0.134607 | Val Loss: 0.078029 Trial 16 pruned at epoch 7. Reordered columns. Target 'PJME_MW' is now last. Reordered columns. Target 'PJME_MW' is now last. Trial 17 Epoch 1/15 | Train Loss: 0.202073 | Val Loss: 0.100184 Trial 17 Epoch 2/15 | Train Loss: 0.161606 | Val Loss: 0.092835 Trial 17 Epoch 3/15 | Train Loss: 0.153531 | Val Loss: 0.087507 Trial 17 Epoch 4/15 | Train Loss: 0.146929 | Val Loss: 0.081228 Trial 17 Epoch 5/15 | Train Loss: 0.142306 | Val Loss: 0.079445 Trial 17 Epoch 6/15 | Train Loss: 0.139078 | Val Loss: 0.078415 Trial 17 Epoch 7/15 | Train Loss: 0.135589 | Val Loss: 0.074867 Trial 17 Epoch 8/15 | Train Loss: 0.133140 | Val Loss: 0.074816 Trial 17 Epoch 9/15 | Train Loss: 0.131480 | Val Loss: 0.081162 Trial 17 Epoch 10/15 | Train Loss: 0.129618 | Val Loss: 0.071886 Trial 17 Epoch 11/15 | Train Loss: 0.128850 | Val Loss: 0.074443 Trial 17 Epoch 12/15 | Train Loss: 0.126891 | Val Loss: 0.072666
[I 2025-05-01 01:43:44,619] Trial 17 finished with value: 0.07188633667540721 and parameters: {'learning_rate': 3.66572108863409e-05, 'd_model': 128, 'seq_len': 240, 'patch_len': 12}. Best is trial 15 with value: 0.07056483737835416.
Trial 17 Epoch 13/15 | Train Loss: 0.126043 | Val Loss: 0.073819 Trial 17 stopped early at epoch 13. Reordered columns. Target 'PJME_MW' is now last. Reordered columns. Target 'PJME_MW' is now last. Trial 18 Epoch 1/15 | Train Loss: 0.179379 | Val Loss: 0.090640 Trial 18 Epoch 2/15 | Train Loss: 0.149656 | Val Loss: 0.077207 Trial 18 Epoch 3/15 | Train Loss: 0.141371 | Val Loss: 0.079171 Trial 18 Epoch 4/15 | Train Loss: 0.135509 | Val Loss: 0.074513 Trial 18 Epoch 5/15 | Train Loss: 0.131553 | Val Loss: 0.075282 Trial 18 Epoch 6/15 | Train Loss: 0.128919 | Val Loss: 0.073538 Trial 18 Epoch 7/15 | Train Loss: 0.126779 | Val Loss: 0.069181 Trial 18 Epoch 8/15 | Train Loss: 0.124703 | Val Loss: 0.079108 Trial 18 Epoch 9/15 | Train Loss: 0.123207 | Val Loss: 0.079796
[I 2025-05-01 01:56:40,466] Trial 18 finished with value: 0.06918144787135236 and parameters: {'learning_rate': 7.944348310121449e-05, 'd_model': 128, 'seq_len': 168, 'patch_len': 12}. Best is trial 18 with value: 0.06918144787135236.
Trial 18 Epoch 10/15 | Train Loss: 0.121458 | Val Loss: 0.071176 Trial 18 stopped early at epoch 10. Reordered columns. Target 'PJME_MW' is now last. Reordered columns. Target 'PJME_MW' is now last. Trial 19 Epoch 1/15 | Train Loss: 0.179243 | Val Loss: 0.087415 Trial 19 Epoch 2/15 | Train Loss: 0.148350 | Val Loss: 0.081714 Trial 19 Epoch 3/15 | Train Loss: 0.140637 | Val Loss: 0.081502 Trial 19 Epoch 4/15 | Train Loss: 0.135071 | Val Loss: 0.079963 Trial 19 Epoch 5/15 | Train Loss: 0.131675 | Val Loss: 0.077892 Trial 19 Epoch 6/15 | Train Loss: 0.129098 | Val Loss: 0.074014 Trial 19 Epoch 7/15 | Train Loss: 0.127219 | Val Loss: 0.076046 Trial 19 Epoch 8/15 | Train Loss: 0.124946 | Val Loss: 0.070340 Trial 19 Epoch 9/15 | Train Loss: 0.123624 | Val Loss: 0.073225 Trial 19 Epoch 10/15 | Train Loss: 0.121641 | Val Loss: 0.077458
[I 2025-05-01 02:10:51,207] Trial 19 finished with value: 0.07034039436817399 and parameters: {'learning_rate': 7.474771034226343e-05, 'd_model': 128, 'seq_len': 168, 'patch_len': 12}. Best is trial 18 with value: 0.06918144787135236.
Trial 19 Epoch 11/15 | Train Loss: 0.120694 | Val Loss: 0.075269 Trial 19 stopped early at epoch 11. Optimization Finished! Number of finished trials: 20 Best trial value (min validation loss): 0.069181 Best trial hyperparameters: learning_rate: 7.944348310121449e-05 d_model: 128 seq_len: 168 patch_len: 12 Could not generate Optuna plots: Tried to import 'plotly' but failed. Please make sure that the package is installed correctly to use this feature. Actual error: No module named 'plotly'.. Ensure plotly is installed ('pip install plotly').
import matplotlib.pyplot as plt
from optuna.visualization.matplotlib import (
plot_optimization_history as mplt_opt_hist,
plot_param_importances as mplt_param_imp,
)
# History
ax1 = mplt_opt_hist(study)
ax1.set_title("Optuna Optimization History")
plt.show()
# Importances
ax2 = mplt_param_imp(study)
# ax2.set_title("Optuna Parameter Importances")
plt.show()
[W 2025-05-01 08:15:04,804] Trial 3 is omitted in visualization because its objective value is inf or nan. [W 2025-05-01 08:15:04,804] Trial 4 is omitted in visualization because its objective value is inf or nan.
Optuna Optimization History
- The blue dots show each trial’s validation loss (objective value), and the red line traces the best‐so‐far.
- Rapid early gains: By Trial 2 the best loss had already dropped from ~0.074 to ~0.071, reflecting that even a modest search quickly found much better configurations than the defaults.
- Plateau period: From Trials 2–14 the best loss hovered around ~0.071–0.073, indicating diminishing returns in that region of hyperparameter space.
- Final improvements: Around Trial 15, a new suggestion lowered the best loss to ~0.0706, and by Trial 18 we achieved the overall minimum of ~0.0692.
- Pruning effect: Underperforming trials were stopped early (thanks to the MedianPruner), so you see fewer full-duration points among the higher-loss trials—saving compute on clearly suboptimal settings.
Hyperparameter Importances
Optuna’s built-in importance analysis assigns each hyperparameter a “weight” reflecting how much variation in validation loss it explained:
- Learning rate (0.51): By far the most influential—adjusting the optimizer’s step size accounted for over half of the mapped performance gains.
- Sequence length (0.39): The number of past hours fed into the model was the next most critical factor. Longer or shorter contexts materially shifted accuracy.
- Model dimension (0.05) & Patch length (0.05): Within the ranges we tested, these architectural choices had only a small effect on validation loss.
Takeaway: Future tuning efforts should concentrate primarily on finer sweeps of learning rate and sequence length (perhaps even conditional schedules), while
d_model
andpatch_len
can often be fixed at their current near-optimal values for efficiency.
12. Final Retraining with Best Hyperparameters¶
- Retrieve the best trial’s parameters
- Rebuild the model and DataLoaders with those settings
- Train to convergence with early stopping
- Evaluate final model on the test set and report metrics
# --- Save the best hyperparameters found by Optuna ---
# Check if the study ran successfully and found best parameters
if 'best_hyperparams' in locals() and best_hyperparams is not None:
# Define a filename for storing the parameters
params_filename = "timexer_best_hyperparams.json"
# Save the dictionary to a JSON file
try:
with open(params_filename, 'w') as f:
# Use indent for readability
json.dump(best_hyperparams, f, indent=4)
print(f"Best hyperparameters successfully saved to {params_filename}")
print("Saved parameters:")
# Pretty print the parameters that were saved
print(json.dumps(best_hyperparams, indent=4))
except Exception as e:
print(f"Error saving best hyperparameters to {params_filename}: {e}")
elif 'study' in locals() and not study.best_trial:
print("Optuna study finished, but no best trial found (e.g., all trials failed or were pruned). Cannot save parameters.")
else:
print("Optuna study did not run or `best_hyperparams` variable not found. Cannot save parameters.")
Best hyperparameters successfully saved to timexer_best_hyperparams.json Saved parameters: { "learning_rate": 7.944348310121449e-05, "d_model": 128, "seq_len": 168, "patch_len": 12 }
# --- Load the best hyperparameters from the saved file ---
# Define the filename where parameters were saved
params_filename = "timexer_best_hyperparams.json"
loaded_best_params = None # Initialize variable
try:
with open(params_filename, 'r') as f:
loaded_best_params = json.load(f)
print(f"Successfully loaded hyperparameters from {params_filename}:")
# Pretty print the loaded parameters
print(json.dumps(loaded_best_params, indent=4))
# Now you can use the 'loaded_best_params' dictionary, for example:
# best_hyperparams = loaded_best_params # Assign it back if needed for subsequent cells
# Or use it directly in the final model configuration
except FileNotFoundError:
print(f"Error: Could not find the parameter file {params_filename}.")
print("Please ensure the Optuna study ran and saved the parameters, or check the filename.")
except Exception as e:
print(f"Error loading hyperparameters from {params_filename}: {e}")
# Example of assigning it back for use in the next cell (Retraining)
if loaded_best_params is not None:
best_hyperparams = loaded_best_params
else:
# Handle the case where loading failed, maybe fall back to defaults or stop
print("WARNING: Using default or potentially undefined 'best_hyperparams' as loading failed.")
# best_hyperparams = default_params # Define default_params if needed
pass # Or raise an error
Successfully loaded hyperparameters from timexer_best_hyperparams.json: { "learning_rate": 7.944348310121449e-05, "d_model": 128, "seq_len": 168, "patch_len": 12 }
if best_hyperparams is not None and pjme_weather is not None:
print("\nRetraining final model using best hyperparameters found by Optuna...")
# --- Setup with best params ---
final_configs_dict = {
# Fixed parameters (repeat from objective or load from saved config)
'task_name': 'short_term_forecast', 'features': 'MS', 'pred_len': 24,
'enc_in': pjme_weather.shape[1], 'c_exog': len(FEATURES), 'n_heads': 8,
'e_layers': 2, 'dropout': 0.1, 'activation': 'gelu',
'use_norm': True, 'embed': 'timeF', 'freq': 'h', 'factor': 3,
# Best hyperparameters from Optuna study
**best_hyperparams,
# Derived parameter - ensure d_ff matches best d_model
'd_ff': best_hyperparams['d_model'] * 4
}
# Ensure n_heads compatibility if it wasn't tuned
if final_configs_dict['d_model'] % final_configs_dict['n_heads'] != 0:
print(f"Warning: Best d_model ({final_configs_dict['d_model']}) not divisible by n_heads ({final_configs_dict['n_heads']}). Adjusting n_heads or choose different params.")
# Example: Find largest divisor <= 8
# best_n_heads = max(h for h in [1, 2, 4, 8] if final_configs_dict['d_model'] % h == 0)
# final_configs_dict['n_heads'] = best_n_heads
# Or simply report error and stop
raise ValueError("Incompatible best d_model and fixed n_heads")
final_configs = Configs(**final_configs_dict)
print("\nFinal Model Configuration:")
for key, value in final_configs_dict.items():
# Skip printing learning_rate here as it's used for optimizer, not model init
if key != 'learning_rate':
print(f"- {key}: {getattr(final_configs, key, 'N/A')}")
# Recreate datasets with the best seq_len
print(f"Recreating datasets with seq_len = {final_configs.seq_len}...")
final_train_dataset = TimeSeriesDataset(train_df, TARGET, pjme_weather.columns.tolist(), final_configs.seq_len, final_configs.pred_len, final_configs.freq)
final_val_dataset = TimeSeriesDataset(val_df, TARGET, pjme_weather.columns.tolist(), final_configs.seq_len, final_configs.pred_len, final_configs.freq)
final_test_dataset = TimeSeriesDataset(test_df, TARGET, pjme_weather.columns.tolist(), final_configs.seq_len, final_configs.pred_len, final_configs.freq)
# Use the same batch size as before, or potentially the one used in Optuna trial
final_batch_size = 32
final_train_loader = DataLoader(final_train_dataset, batch_size=final_batch_size, shuffle=True, num_workers=0)
final_val_loader = DataLoader(final_val_dataset, batch_size=final_batch_size, shuffle=False, num_workers=0)
final_test_loader = DataLoader(final_test_dataset, batch_size=final_batch_size, shuffle=False, num_workers=0)
print(f"Final Datasets created: Train={len(final_train_dataset)}, Val={len(final_val_dataset)}, Test={len(final_test_dataset)}")
# --- Initialize and Train Final Model ---
final_model = Model(final_configs).to(device)
final_criterion = nn.MSELoss()
final_optimizer = optim.Adam(final_model.parameters(), lr=best_hyperparams['learning_rate'])
# Use original training settings (epochs, patience)
final_num_epochs = 50 # Original number of epochs
final_patience = 5 # Original patience
final_early_stopping = EarlyStopping(patience=final_patience, verbose=True, path='timexer_final_best_model.pt')
print(f"\nStarting Final Training with {final_num_epochs} epochs and patience {final_patience}...")
final_train_losses = []
final_val_losses = []
final_start_time = time.time()
# --- Re-run the FULL Training Loop (copy/paste or refactor into a function) ---
for epoch in range(final_num_epochs):
epoch_start_time = time.time()
final_model.train()
running_loss = 0.0
batch_count = 0
for i, batch in enumerate(final_train_loader):
x_enc, x_mark_enc, x_dec, x_mark_dec, y_true = [b.to(device) for b in batch.values()]
final_optimizer.zero_grad()
outputs = final_model(x_enc, x_mark_enc, x_dec, x_mark_dec)
loss = final_criterion(outputs, y_true)
loss.backward()
final_optimizer.step()
running_loss += loss.item()
batch_count += 1
avg_train_loss = running_loss / batch_count if batch_count > 0 else float('inf')
final_train_losses.append(avg_train_loss)
# Validation
final_model.eval()
running_val_loss = 0.0
val_batch_count = 0
with torch.no_grad():
for batch in final_val_loader:
x_enc, x_mark_enc, x_dec, x_mark_dec, y_true = [b.to(device) for b in batch.values()]
outputs = final_model(x_enc, x_mark_enc, x_dec, x_mark_dec)
loss = final_criterion(outputs, y_true)
running_val_loss += loss.item()
val_batch_count += 1
avg_val_loss = running_val_loss / val_batch_count if val_batch_count > 0 else float('inf')
final_val_losses.append(avg_val_loss)
epoch_time = time.time() - epoch_start_time
print(f'Final Training Epoch {epoch+1}/{final_num_epochs}, Train Loss: {avg_train_loss:.6f}, Val Loss: {avg_val_loss:.6f}, Time: {epoch_time:.2f}s')
# Early Stopping
final_early_stopping(avg_val_loss, final_model)
if final_early_stopping.early_stop:
print("Final training early stopping triggered.")
break
# --- End of Training Loop ---
total_final_training_time = time.time() - final_start_time
print(f'\nFinal Training Finished. Total time: {total_final_training_time:.2f}s')
# Load best model from final training
print("Loading best model from final training...")
try:
final_model.load_state_dict(torch.load('timexer_final_best_model.pt'))
print("Best final model loaded successfully.")
except Exception as e:
print(f"Warning: Could not load final best model checkpoint: {e}. Using model from last epoch.")
# --- Final Evaluation on Test Set ---
print("\nEvaluating final model on Test Set...")
final_model.eval()
final_predictions = []
final_actuals = []
with torch.no_grad():
for batch in final_test_loader:
x_enc, x_mark_enc, x_dec, x_mark_dec, y_true = [b.to(device) for b in batch.values()]
outputs = final_model(x_enc, x_mark_enc, x_dec, x_mark_dec)
preds_inv = target_scaler.inverse_transform(outputs.cpu().numpy().reshape(-1, 1))
actuals_inv = target_scaler.inverse_transform(y_true.cpu().numpy().reshape(-1, 1))
final_predictions.extend(preds_inv.flatten().tolist())
final_actuals.extend(actuals_inv.flatten().tolist())
final_predictions = np.array(final_predictions)
final_actuals = np.array(final_actuals)
if len(final_predictions) == len(final_actuals) and len(final_predictions) > 0:
final_rmse = np.sqrt(mean_squared_error(final_actuals, final_predictions))
final_mae = mean_absolute_error(final_actuals, final_predictions)
final_mape = mean_absolute_percentage_error(final_actuals, final_predictions)
print("\nFinal Model Evaluation Metrics on Test Set:")
print(f"RMSE: {final_rmse:.4f}")
print(f"MAE: {final_mae:.4f}")
print(f"MAPE: {final_mape:.4f}%")
else:
print("Error calculating final metrics.")
# (Optional) Add final visualization code here if needed
else:
print("\nSkipping final model retraining as best hyperparameters were not found or data is missing.")
Retraining final model using best hyperparameters found by Optuna... Final Model Configuration: - task_name: short_term_forecast - features: MS - pred_len: 24 - enc_in: 14 - c_exog: 13 - n_heads: 8 - e_layers: 2 - dropout: 0.1 - activation: gelu - use_norm: True - embed: timeF - freq: h - factor: 3 - d_model: 128 - seq_len: 168 - patch_len: 12 - d_ff: 512 Recreating datasets with seq_len = 168... Reordered columns. Target 'PJME_MW' is now last. Reordered columns. Target 'PJME_MW' is now last. Reordered columns. Target 'PJME_MW' is now last. Final Datasets created: Train=109095, Val=13470, Test=13470 Starting Final Training with 50 epochs and patience 5... Final Training Epoch 1/50, Train Loss: 0.179199, Val Loss: 0.090186, Time: 79.89s Validation loss decreased (inf --> 0.090186). Saving model ... Final Training Epoch 2/50, Train Loss: 0.148138, Val Loss: 0.084014, Time: 84.02s Validation loss decreased (0.090186 --> 0.084014). Saving model ... Final Training Epoch 3/50, Train Loss: 0.138947, Val Loss: 0.085302, Time: 82.61s EarlyStopping counter: 1 out of 5 Final Training Epoch 4/50, Train Loss: 0.133927, Val Loss: 0.080707, Time: 78.12s Validation loss decreased (0.084014 --> 0.080707). Saving model ... Final Training Epoch 5/50, Train Loss: 0.130665, Val Loss: 0.079094, Time: 72.91s Validation loss decreased (0.080707 --> 0.079094). Saving model ... Final Training Epoch 6/50, Train Loss: 0.128490, Val Loss: 0.069685, Time: 71.45s Validation loss decreased (0.079094 --> 0.069685). Saving model ... Final Training Epoch 7/50, Train Loss: 0.126014, Val Loss: 0.074268, Time: 76.96s EarlyStopping counter: 1 out of 5 Final Training Epoch 8/50, Train Loss: 0.124683, Val Loss: 0.071808, Time: 74.11s EarlyStopping counter: 2 out of 5 Final Training Epoch 9/50, Train Loss: 0.122875, Val Loss: 0.077030, Time: 74.97s EarlyStopping counter: 3 out of 5 Final Training Epoch 10/50, Train Loss: 0.121887, Val Loss: 0.074584, Time: 75.89s EarlyStopping counter: 4 out of 5 Final Training Epoch 11/50, Train Loss: 0.120177, Val Loss: 0.072910, Time: 74.66s EarlyStopping counter: 5 out of 5 Final training early stopping triggered. Final Training Finished. Total time: 847.11s Loading best model from final training... Best final model loaded successfully. Evaluating final model on Test Set... Final Model Evaluation Metrics on Test Set: RMSE: 1951.3168 MAE: 1362.5760 MAPE: 4.2847%
13. Comparison of Time Series Models for Hourly Energy Forecasting (PJM East)¶
How does the three different modeling approaches applied to hourly energy consumption forecasting for the PJM East region compare to each other? We have treid XGBoost, a hybrid ARIMA-LSTM-CNN model, and a Transformer-based model (Timexer). Let's dig into the weeds of their methodologies, strengths, weaknesses, and potential future improvements.
1. Model Methodologies and Feature Handling¶
XGBoost (Gradient Boosted Trees):
- Methodology: Uses an ensemble of decision trees (gradient boosting) to model the relationship between features and energy consumption. It learns iteratively, with new trees correcting the errors of previous ones.
- Feature Handling: Relies heavily on explicit feature engineering. The described workflow incorporates:
- Calendar Features: Hour, day of week, month, year, day of year, holidays (including custom "bridge" days), weekends, and a combined
is_dayoff
flag. - Weather Features: Averaged hourly temperature, dew point, humidity, precipitation, and wind speed from multiple nearby stations.
- Lagged Load Features: Crucially, adding past energy consumption values (e.g., 1, 2, 3, 24, 168 hours prior) significantly improved performance by capturing autocorrelation (inertia).
- Calendar Features: Hour, day of week, month, year, day of year, holidays (including custom "bridge" days), weekends, and a combined
- Interpretability: SHAP values were used to quantify feature importance, showing temperature, dew point,
is_dayoff
, hour, and especiallylag_1hr
(when included) as key drivers.
ARIMA-LSTM-CNN (Hybrid Statistical/Deep Learning):
- Methodology: A multi-stage approach:
- ARIMA (or SARIMA): Models the linear components, trends, and seasonality (like the 24-hour cycle) in the time series first. A SARIMA(2,0,1)x(1,0,0,24) model was identified as a good fit for the linear patterns.
- Residual Calculation: The errors (residuals) from the ARIMA model, which contain non-linear patterns, are calculated.
- LSTM-CNN: A deep learning model processes these residuals along with exogenous features.
- CNN: Extracts local patterns and features from sequences of residuals and exogenous variables.
- LSTM: Models longer-term temporal dependencies within the sequence data processed by the CNN.
- Final Forecast: The prediction from ARIMA is combined with the prediction from the LSTM-CNN (which predicts the residuals) to produce the final forecast.
- Feature Handling: ARIMA inherently handles time dependencies (autocorrelation, seasonality). The LSTM-CNN component explicitly uses sequences of scaled exogenous features (calendar, weather) alongside the ARIMA residuals to predict the non-linear component.
- Methodology: A multi-stage approach:
Timexer (Transformer-based):
- Methodology: (Based on user comments and general knowledge, as the specific notebook wasn't accessible). Likely utilizes the self-attention mechanism, characteristic of Transformer models, to weigh the importance of different past time steps when making a forecast for a future time step. This allows it to potentially capture complex and long-range dependencies more effectively than RNNs or simple lagged features.
- Feature Handling: Transformers can process sequences of input data, often incorporating time series values along with associated features (like weather, calendar info) potentially using embedding layers. They can theoretically learn complex interactions between different features over time without extensive manual feature engineering required by models like XGBoost. The user noted its potential to handle larger datasets and incorporate factors like power generation.
2. Strengths and Weaknesses¶
XGBoost:
- Strengths: Relatively simple to implement and tune, computationally efficient compared to deep learning models, robust performance (especially with well-engineered features like lags), good interpretability via SHAP. Achieved very low errors (MAPE 0.87%) once lagged features were included.
- Weaknesses: Performance heavily depends on feature engineering quality; might not capture very complex or long-range temporal patterns as effectively as sequence models like LSTMs or Transformers without extensive lagging/windowing features.
ARIMA-LSTM-CNN:
- Strengths: Explicitly decomposes linear and non-linear patterns, potentially leading to robust forecasts by leveraging the strengths of both statistical (ARIMA) and deep learning (LSTM/CNN) methods. The hybrid approach yielded results (RMSE: 374.63 MW, MAPE: 0.89%) comparable to the lagged XGBoost model.
- Weaknesses: More complex to implement, tune, and computationally intensive than XGBoost. ARIMA component relies on stationarity assumptions. Requires careful data preparation for the sequence model (scaling, windowing).
Timexer:
- Strengths: State-of-the-art potential for sequence modeling tasks, ability to capture long-range dependencies and complex interactions via self-attention, potentially better scalability with very large datasets and diverse features (e.g., power generation data).
- Weaknesses: Computationally expensive (training and inference), often requires large amounts of data to reach peak performance, can be harder to interpret ("black box" nature compared to SHAP on XGBoost), potentially more sensitive to hyperparameters.
3. Discussion and User Preference¶
The results show that both the feature-engineered XGBoost model (with lags) and the complex ARIMA-LSTM-CNN hybrid achieved excellent performance (MAPE < 1%) on this dataset, significantly outperforming simpler baseline models (like calendar-only XGBoost or ARIMA-only).
Your preference for XGBoost due to its computational simplicity and robustness is well-justified, especially given its strong performance after adding lagged features. It often provides a great balance between performance and practicality.
The Transformer (Timexer) represents a more advanced approach. While potentially powerful for learning intricate patterns in large, multi-faceted datasets (as you noted regarding power generation and other factors), it comes with higher computational costs and data requirements.
Thought for a few seconds
4. Next Steps¶
Data Enrichment (all models) Incorporate these four high‐level, localized inputs to give every model a richer feature set:
- Local Weather Conditions Temperature, humidity, solar irradiance and cloud cover, wind speed, etc., to capture both HVAC demand and on-site generation dynamics.
- Population & Usage Patterns Population density, residential vs. commercial mix, EV-charging penetration, building-type footprints—helps predict baseline and peak-hour behavior.
- Renewable Generation Forecasts Behind-the-meter and utility-scale solar + wind output projections (weather-driven), which directly reduce net load.
- Temporal & Operational Signals Calendar effects (hour, weekday/weekend, holidays), scheduled demand-response or outages, and market-price signals that shift consumption timing.
For XGBoost:
Feature Engineering:
- Add derived weather features (Heating/Cooling Degree Days/Hours).
- Refine holiday modeling (multi-day flags, error analysis around specific holidays like Presidents’ Day).
- Experiment with different lag combinations or rolling-window statistics.
Modeling: Explore hybrid approaches (e.g., LSTM-derived features fed into XGBoost).
Deployment: Implement dynamic retraining (rolling window) to adapt to evolving patterns.
For ARIMA–LSTM–CNN:
Feature Engineering:
- Enhance holiday representation (flags for days adjacent to holidays).
- Add features for temperature extremes (degree-hour thresholds, quadratic terms) to better capture AC/heating spikes.
Error Analysis: Hour-by-hour breakdown on worst forecast days to pinpoint missed dynamics (e.g., ramp-up/down).
Model Refinement: Introduce interaction features (e.g.,
hour * is_dayoff
) or a lightweight corrective model for high-error conditions (adjacent to holiday + high temp).Validation: Rigorously re-evaluate performance, especially around historically challenging dates (e.g., July 4th).
For TimeXer (Transformer):
Feature Engineering & Data:
- Localized weather, population density, and the four high-level factors above.
- Add other relevant signals if available (electricity price, economic indices, local generation outages).
Model Tuning: Grid-search attention heads, layers, embedding dimensions, and alternative attention mechanisms.
Scalability: Benchmark training and inference on larger datasets or extended historical windows.
Interpretability: Apply attention‐visualization and attribution methods to understand which time steps and features drive predictions.
By systematically addressing these next steps, particularly focusing on improved feature engineering for holidays and weather extremes, and potentially incorporating more granular data for the Timexer model, the forecasting accuracy for all approaches can likely be further improved.