Abstract
On the one hand, having a tight schedule is desirable and very cost-efficient for freight transport companies. On the other hand, a tight schedule increases the impact of delays and cancellations. Furthermore, the prediction of delays is extremely complex, because they depend on many factors of influence. To address these issues, this work will show an approach to forecast delays of freight trains by using data mining and machine learning methods. For this purpose, an international freight transport company in rail traffic provided us with a huge amount of historical data of freight train runs. In order to get a suitable prediction model, we apply a knowledge discovery in databases (KDD) process, which contains the steps data selection, data preprocessing, data transformation, data mining and interpretation/ evaluation. After the data selection and data preprocessing step we transform categorical features via one-hot encoding as well as via embedding with various embedding sizes. Furthermore, we apply a data transformation method for cyclical features like weekday. In the actual data mining process, we use the preprocessed historical data to perform a regression analysis, which forecasts the delays of freight trains, and compare several regression models like decision tree, random forest, extra trees and gradient boosting regression. An adequate prediction model will be integrated into an agent-based model, which tests the robustness of optimized locomotive schedules for freight trains.