XGBoost 怎么做时间序列预测？

Cloud2016 · 8 天前

XGBoost 很强，就不多说了，懂的都懂，不懂的我还在路上，刚上手遇到一小白问题。还是用老掉牙的时序数据 AirPassengers，准备来评测一下几个方法

air_passengers <- data.frame(
  y = as.vector(AirPassengers),
  month = rep(1:12, 12),
  year = rep(1949:1960, each = 12)
)
# 日期列用作后续 ggplot2 绘图
air_passengers$date <- as.Date(paste(air_passengers$year, air_passengers$month, "01", sep = "-"))

加载 xgboost 包，分割数据、训练数据、再拟合数据

library(xgboost)
data_size <- nrow(air_passengers)
# 拆分数据集
train_size <- floor(data_size * 0.67)
# 预测问题当作回归任务
mod_xgb <- xgboost(
  y = air_passengers[, 1], 
  x = air_passengers[, -c(1,4)],
  eval_set = (train_size+1):data_size, # 验证集
  early_stopping_rounds = 50,
  verbosity = 0 # 不显示训练过程
  )
# 拟合历史和预测未来 12 期
pred_xgb <- predict(mod_xgb, newdata = data.frame(
    month = c(air_passengers$month, 1:12),
    year = c(air_passengers$year, rep(1961, 12))
  ), validate_features = TRUE)
# 整理数据
air_passengers_xgb <- data.frame(
  y = pred_xgb,
  month = c(air_passengers$month, 1:12),
  year = c(air_passengers$year, rep(1961, 12))
)
air_passengers_xgb$date <- as.Date(paste(air_passengers_xgb$year, air_passengers_xgb$month, "01", sep = "-"))

最后，将对历史的拟合结果和对未来的预测结果展示出来

library(ggplot2)
ggplot() +
  geom_point(data = air_passengers, aes(x = date, y = y), size = 1) +
  geom_line(data = air_passengers_xgb, aes(x = date, y = y), color = "red") +
  labs(x = "", y = "")

图中黑色的点是原始数据，红色的线是拟合、预测结果。发现，离了大谱，模型经过所有训练数据（显然过拟合），从测试数据集来看，周期性学到了，但是趋势性和波动性没有学到，基于我的 XGBoost 信任，我先是怀疑我哪个地方用错了，但是，xgboost 文档翻了，不知道咋搞？特来求助。

nan.xiao · 7 天前

时间序列数据的特殊点在于是一个连续空间内的序列数据，都是高度自相关，直接进行回归效果肯定不好。要用回归方法建模就涉及到从时域到频域转换的问题，也就是怎么把时间序列的值转成特征。

这也不需要自己手写，直接调包就可以了，比如 Darts 就封装了一堆模型包括各种 GBDT：https://unit8co.github.io/darts/ 感兴趣的话可以看看里边是怎么实现的。

fenguoerbian · 7 天前

mod_xgb <- xgboost(
    y = air_passengers[, 1], 
    x = air_passengers[, -c(1,4)],
    eval_set = (train_size+1):data_size, # 验证集
    early_stopping_rounds = 50,
    nrounds = 10
    # verbosity = 0 # 不显示训练过程
)

注释掉 verbosity = 0，打印一些过程可以发现很早过拟合了。
early_stopping_rounds是当验证集上的损失函数不再下降后再经过多少轮就停止训练，在你这个数据里验证集上的损失最后仍有微微下降，所以这个参数没起到作用。
nrounds训练轮数，默认是100轮，打印训练过程后可以发现太多了，我手动调到了10轮。

这些做完之后会发现还是没什么鸟用，因为我搜索了下，看起来xgboost本身就是不擅长做线性外推的，因此在做时间序列的时候要自己先把序列的趋势性去掉，得到一个平稳序列再做后续的东西。

可以参考这个，还可以参考这个。

Cloud2016 · 7 天前

谢谢 nan.xiao fenguoerbian 我感觉调参的路还挺长的。

我注意到用于预测的 Python 模块很多，而且有的模块打包了一堆模型，可以调包和调参。不过，还是想知道这个模型背后大致的情况，所以，自己想从 XGBoost 开始手搓一下。我看了上面提到的 Darts ，它里面有的模型就用了 xgboost，除了Darts，比如 Python 模块 NeuralForecast 收集了大量基于神经网络的时间序列预测模型。基于树和神经网络的，我差不多都想了解一下、试一下。最后，在真实数据集上跑一下。

@fenguoerbian 我知道对原序列先取对数，之后再差分，就只剩下周期性了。但是，我以为 XGBoost 可以让我直接喂给它序列和特征（年和月），它就可以学出来。我试了下 NeuralForecast 的 LSTM 和 NHITS 两个神经网络模型，我跑示例而已，除了原始序列，啥特征也不需要自己做。

Cloud2016 · 7 天前

nan.xiao Darts 封装的 XGB 模型，实际上是通过滑动窗口来预测，比如要预测未来 N 期，就将原始序列滑动 N 期，本质上是一个复杂的基于树的平滑模型（我理解平滑也可以看作是一种回归）。由原始序列构造一个矩阵，R 语言有一个函数 embed 可以做这个。

具体怎么做的还得看它的代码。