The Temporal Fusion Transformers (TFT) is an advanced model for time series forecasting. It includes the Variable Selection Network (VSN), which is a key component of the model. It’s specifically designed to automatically identify and focus on the most relevant features within a dataset. It achieves this by assigning learned weights to each input variable, effectively highlighting which features contribute most to the predictive task.

This VSN-based approach will be our second reduction technique. We’ll implement it using PyTorch Forecasting, which allows us to leverage the Variable Selection Network from the TFT model.

We’ll use a basic configuration. Our goal isn’t to create the highest-performing model possible, but rather to identify the most relevant features while using minimal resources.

`from pytorch_forecasting import TemporalFusionTransformer, TimeSeriesDataSet`

from pytorch_forecasting.metrics import QuantileLoss

from lightning.pytorch.callbacks import EarlyStopping

import lightning.pytorch as pl

import torchpl.seed_everything(42)

max_encoder_length = 32

max_prediction_length = 1

VAL_SIZE = .2

VARIABLES_IMPORTANCE = .8

model_data_feature_sel = initial_model_train.join(stationary_df_train)

model_data_feature_sel = model_data_feature_sel.join(pca_df_train)

model_data_feature_sel['price'] = model_data_feature_sel['price'].astype(float)

model_data_feature_sel['y'] = model_data_feature_sel['price'].pct_change()

model_data_feature_sel = model_data_feature_sel.iloc[1:].reset_index(drop=True)

model_data_feature_sel['group'] = 'spy'

model_data_feature_sel['time_idx'] = range(len(model_data_feature_sel))

train_size_vsn = int((1-VAL_SIZE)*len(model_data_feature_sel))

train_data_feature = model_data_feature_sel[:train_size_vsn]

val_data_feature = model_data_feature_sel[train_size_vsn:]

unknown_reals_origin = [col for col in model_data_feature_sel.columns if col.startswith('value_')] + ['y']

timeseries_config = {

"time_idx": "time_idx",

"target": "y",

"group_ids": ["group"],

"max_encoder_length": max_encoder_length,

"max_prediction_length": max_prediction_length,

"time_varying_unknown_reals": unknown_reals_origin,

"add_relative_time_idx": True,

"add_target_scales": True,

"add_encoder_length": True

}

training_ts = TimeSeriesDataSet(

train_data_feature,

**timeseries_config

)

The `VARIABLES_IMPORTANCE`

threshold is set to 0.8, which means we’ll retain features in the top 80th percentile of importance as determined by the Variable Selection Network (VSN). For more information about the Temporal Fusion Transformers (TFT) and its parameters, please refer to the documentation.

Next, we’ll train the TFT model.

`if torch.cuda.is_available():`

accelerator = 'gpu'

num_workers = 2

else :

accelerator = 'auto'

num_workers = 0validation = TimeSeriesDataSet.from_dataset(training_ts, val_data_feature, predict=True, stop_randomization=True)

train_dataloader = training_ts.to_dataloader(train=True, batch_size=64, num_workers=num_workers)

val_dataloader = validation.to_dataloader(train=False, batch_size=64*5, num_workers=num_workers)

tft = TemporalFusionTransformer.from_dataset(

training_ts,

learning_rate=0.03,

hidden_size=16,

attention_head_size=2,

dropout=0.1,

loss=QuantileLoss()

)

early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-5, patience=5, verbose=False, mode="min")

trainer = pl.Trainer(max_epochs=20, accelerator=accelerator, gradient_clip_val=.5, callbacks=[early_stop_callback])

trainer.fit(

tft,

train_dataloaders=train_dataloader,

val_dataloaders=val_dataloader

)

We intentionally set `max_epochs=20`

so the model doesn’t train too long. Additionally, we implemented an `early_stop_callback`

that halts training if the model shows no improvement for 5 consecutive epochs (`patience=5`

).

Finally, using the best model obtained, we select the 80th percentile of the most important features as determined by the VSN.

`best_model_path = trainer.checkpoint_callback.best_model_path`

best_tft = TemporalFusionTransformer.load_from_checkpoint(best_model_path)raw_predictions = best_tft.predict(val_dataloader, mode="raw", return_x=True)

def get_top_encoder_variables(best_tft,interpretation):

encoder_importances = interpretation["encoder_variables"]

sorted_importances, indices = torch.sort(encoder_importances, descending=True)

cumulative_importances = torch.cumsum(sorted_importances, dim=0)

threshold_index = torch.where(cumulative_importances > VARIABLES_IMPORTANCE)[0][0]

top_variables = [best_tft.encoder_variables[i] for i in indices[:threshold_index+1]]

if 'relative_time_idx' in top_variables:

top_variables.remove('relative_time_idx')

return top_variables

interpretation= best_tft.interpret_output(raw_predictions.output, reduction="sum")

top_encoder_vars = get_top_encoder_variables(best_tft,interpretation)

print(f"nOriginal number of features: {stationary_df_train.shape[1]}")

print(f"Number of features after Variable Selection Network (VSN): {len(top_encoder_vars)}n")

The original dataset contained 438 features, which were then reduced to 1 feature only after applying the VSN method! This drastic reduction suggests several possibilities:

- Many of the original features may have been redundant.
- The feature selection process may have oversimplified the data.
- Using only the target variable’s historical values (autoregressive approach) might perform as well as, or possibly better than, models incorporating exogenous variables.

In this final section, we compare out reduction techniques applied to our model. Each method is tested while maintaining identical model configurations, varying only the features subjected to reduction.

We’ll use TiDE, a small state-of-the-art Transformer-based model. We’ll use the implementation provided by NeuralForecast. Any model from NeuralForecast here would work as long as it allows exogenous historical variables.

We’ll train and test two models using daily SPY (S&P 500 ETF) data. Both models will have the same:

- Train-test split ratio
- Hyperparameters
- Single time series (SPY)
- Forecasting horizon of 1 step ahead

The only difference between the models will be the feature reduction technique. That’s it!

- First model: Original features (no feature reduction)
- Second model: Feature reduction using PCA
- Third model: Feature reduction using VSN

This setup allows us to isolate the impact of each feature reduction technique on model performance.

First we train the 3 models with the same configuration except for the features.

`from neuralforecast.models import TiDE`

from neuralforecast import NeuralForecasttrain_data = initial_model_train.join(stationary_df_train)

train_data = train_data.join(pca_df_train)

test_data = initial_model_test.join(stationary_df_test)

test_data = test_data.join(pca_df_test)

hist_exog_list_origin = [col for col in train_data.columns if col.startswith('value_')] + ['y']

hist_exog_list_pca = [col for col in train_data.columns if col.startswith('PC')] + ['y']

hist_exog_list_vsn = top_encoder_vars

tide_params = {

"h": 1,

"input_size": 32,

"scaler_type": "robust",

"max_steps": 500,

"val_check_steps": 20,

"early_stop_patience_steps": 5

}

model_original = TiDE(

**tide_params,

hist_exog_list=hist_exog_list_origin,

)

model_pca = TiDE(

**tide_params,

hist_exog_list=hist_exog_list_pca,

)

model_vsn = TiDE(

**tide_params,

hist_exog_list=hist_exog_list_vsn,

)

nf = NeuralForecast(

models=[model_original, model_pca, model_vsn],

freq='D'

)

val_size = int(train_size*VAL_SIZE)

nf.fit(df=train_data,val_size=val_size,use_init_models=True)

Then, we make the predictions.

`from tabulate import tabulate`

y_hat_test_ret = pd.DataFrame()

current_train_data = train_data.copy()y_hat_ret = nf.predict(current_train_data)

y_hat_test_ret = pd.concat([y_hat_test_ret, y_hat_ret.iloc[[-1]]])

for i in range(len(test_data) - 1):

combined_data = pd.concat([current_train_data, test_data.iloc[[i]]])

y_hat_ret = nf.predict(combined_data)

y_hat_test_ret = pd.concat([y_hat_test_ret, y_hat_ret.iloc[[-1]]])

current_train_data = combined_data

predicted_returns_original = y_hat_test_ret['TiDE'].values

predicted_returns_pca = y_hat_test_ret['TiDE1'].values

predicted_returns_vsn = y_hat_test_ret['TiDE2'].values

predicted_prices_original = []

predicted_prices_pca = []

predicted_prices_vsn = []

for i in range(len(predicted_returns_pca)):

if i == 0:

last_true_price = train_data['price'].iloc[-1]

else:

last_true_price = test_data['price'].iloc[i-1]

predicted_prices_original.append(last_true_price * (1 + predicted_returns_original[i]))

predicted_prices_pca.append(last_true_price * (1 + predicted_returns_pca[i]))

predicted_prices_vsn.append(last_true_price * (1 + predicted_returns_vsn[i]))

true_values = test_data['price']

methods = ['Original','PCA', 'VSN']

predicted_prices = [predicted_prices_original,predicted_prices_pca, predicted_prices_vsn]

results = []

for method, prices in zip(methods, predicted_prices):

mse = np.mean((np.array(prices) - true_values)**2)

rmse = np.sqrt(mse)

mae = np.mean(np.abs(np.array(prices) - true_values))

results.append([method, mse, rmse, mae])

headers = ["Method", "MSE", "RMSE", "MAE"]

table = tabulate(results, headers=headers, floatfmt=".4f", tablefmt="grid")

print("nPrediction Errors Comparison:")

print(table)

with open("prediction_errors_comparison.txt", "w") as f:

f.write("Prediction Errors Comparison:n")

f.write(table)

We forecast the daily returns using the model, then convert these back to prices. This approach allows us to calculate prediction errors using prices and compare the actual prices to the forecasted prices in a plot.

The similar performance of the TiDE model across both original and reduced feature sets reveals a crucial insight: feature reduction did not lead to improved predictions as one might expect. This suggests potential key issues:

- Information loss: despite aiming to preserve essential data, dimensionality reduction techniques discarded information relevant to the prediction task, explaining the lack of improvement with fewer features.
- Generalization struggles: consistent performance across feature sets indicates the model’s difficulty in capturing underlying patterns, regardless of feature count.
- Complexity overkill: similar results with fewer features suggest TiDE’s sophisticated architecture may be unnecessarily complex. A simpler model, like ARIMA, could potentially perform just as well.

Then, let’s examine the chart to see if we can observe any significant differences among the three forecasting methods and the actual prices.

`import matplotlib.pyplot as plt`plt.figure(figsize=(12, 6))

plt.plot(train_data['ds'], train_data['price'], label='Training Data', color='blue')

plt.plot(test_data['ds'], true_values, label='True Prices', color='green')

plt.plot(test_data['ds'], predicted_prices_original, label='Predicted Prices', color='red')

plt.legend()

plt.title('SPY Price Forecast Using All Original Feature')

plt.xlabel('Date')

plt.ylabel('SPY Price')

plt.savefig('spy_forecast_chart_original.png', dpi=300, bbox_inches='tight')

plt.close()

plt.figure(figsize=(12, 6))

plt.plot(train_data['ds'], train_data['price'], label='Training Data', color='blue')

plt.plot(test_data['ds'], true_values, label='True Prices', color='green')

plt.plot(test_data['ds'], predicted_prices_pca, label='Predicted Prices', color='red')

plt.legend()

plt.title('SPY Price Forecast Using PCA Dimensionality Reduction')

plt.xlabel('Date')

plt.ylabel('SPY Price')

plt.savefig('spy_forecast_chart_pca.png', dpi=300, bbox_inches='tight')

plt.close()

plt.figure(figsize=(12, 6))

plt.plot(train_data['ds'], train_data['price'], label='Training Data', color='blue')

plt.plot(test_data['ds'], true_values, label='True Prices', color='green')

plt.plot(test_data['ds'], predicted_prices_vsn, label='Predicted Prices', color='red')

plt.legend()

plt.title('SPY Price Forecast Using VSN')

plt.xlabel('Date')

plt.ylabel('SPY Price')

plt.savefig('spy_forecast_chart_vsn.png', dpi=300, bbox_inches='tight')

plt.close()

The difference between true and predicted prices appears consistent across all three models, with no noticeable variation in performance between them.

We did it! We explored the importance of feature reduction in time series analysis and provided a practical implementation guide:

- Feature reduction aims to simplify models while maintaining predictive power. Benefits include reduced complexity, improved generalization, easier interpretation, and computational efficiency.
- We demonstrated two reduction techniques using FRED data:

- Principal Component Analysis (PCA), a linear dimensionality reduction method, reduced features from 438 to 76 while retaining 90% of explained variance.
- Variable Selection Network (VSN) from the Temporal Fusion Transformers, a non-linear approach, drastically reduced features to just 1 using an 80th percentile importance threshold.

- Evaluation using TiDE models showed similar performance across original and reduced feature sets, suggesting feature reduction may not always improve forecasting performance. This could be due to information loss during reduction, the model’s difficulty in capturing underlying patterns, or the possibility that a simpler model might be equally effective for this particular forecasting task.

On a final note, we didn’t explore all feature reduction techniques, such as SHAP (SHapley Additive exPlanations), which provides a unified measure of feature importance across various model types. Even if we didn’t improve our model, it’s still better to perform feature curation and compare performance across different reduction methods. This approach helps ensure you’re not discarding valuable information while optimizing your model’s efficiency and interpretability.

In future articles, we’ll apply these feature reduction techniques to more complex models, comparing their impact on performance and interpretability. Stay tuned!

Ready to put these concepts into action? You can find the complete code implementation here.

👏 Clap it up to 50 times

🤝 Send me a LinkedIn connection request to stay in touch

*Your support means everything!* 🙏