Nonlinear regression with hierarchical recurrent neural networks under missing data
Date
Editor(s)
Advisor
Supervisor
Co-Advisor
Co-Supervisor
Instructor
BUIR Usage Stats
views
downloads
Citation Stats
Series
Abstract
We investigated nonlinear regression of variable length sequential data, where the data suffer from missing inputs. We introduced the hierarchical-LSTM network, which is a novel hierarchical architecture based on the LSTM networks. The hierarchical-LSTM architecture contained a set of LSTM networks, where each LSTM network is trained as an expert for processing the inputs following a particular presence-pattern, i.e., we partition the input space into subspaces in a hierarchical manner based on the presence-patterns and assign specific LSTM networks to these subpatterns. We adaptively combine the outputs of these LSTM networks based on the presence-pattern and construct the final output at each time step. The introduced algorithm protects the LSTM networks against performance losses due to: 1) statistical mismatches commonly faced by the widely used imputation methods; and 2) imputation drift, since our architecture uses only the existing inputs without any assumption on the missing data. In addition, the computational load of our algorithm is less than the computational load of the conventional algorithms in terms of the number of multiplication operations, particularly under high missingness ratios. We emphasize that our architecture can be readily applied to other recurrent architectures such as the RNNs and GRU networks. The hierarchical-LSTM network demonstrates significant performance improvements with respect to the state-of-the-art methods in several different well-known real-life and financial datasets. We also openly share the source code of our algorithm to facilitate other studies and for the reproducibility of our results. Future work may explore the selection of a subset of presence-patterns instead of using all presence-patterns so that one can use hierarchical-LSTM architecture with large window lengths by keeping the number of parameters and the computational load at the same level.