{"title": "Stationarity and Stability of Autoregressive Neural Network Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 267, "page_last": 273, "abstract": null, "full_text": "Stationarity and Stability of \n\nAutoregressive Neural Network Processes \n\nFriedrich Leisch\\ Adrian Trapletti 2 & Kurt Hornik l \n\n1 Institut fur Statistik \n\nTechnische UniversiUit Wien \n\nWiedner Hauptstrafie 8-10 / 1071 \n\nA-1040 Wien, Austria \n\nfirstname.lastname@ci.tuwien.ac.at \n\n2 Institut fiir Unternehmensfiihrung \n\nWirtschaftsuniversi tat Wien \n\nAugasse 2-6 \n\nA-lOgO Wien, Austria \n\nadrian. trapletti@wu-wien.ac.at \n\nAbstract \n\nWe analyze the asymptotic behavior of autoregressive neural net(cid:173)\nwork (AR-NN) processes using techniques from Markov chains and \nnon-linear time series analysis. It is shown that standard AR-NNs \nwithout shortcut connections are asymptotically stationary. If lin(cid:173)\near shortcut connections are allowed, only the shortcut weights \ndetermine whether the overall system is stationary, hence standard \nconditions for linear AR processes can be used. \n\n1 \n\nIntroduction \n\nIn this paper we consider the popular class of nonlinear autoregressive processes \ndriven by additive noise, which are defined by stochastic difference equations of \nform \n\n(1) \nwhere ft is an iid. noise process. If g( . .. , (J) is a feedforward neural network with \nparameter (\"weight\") vector (J, we call Equation 1 an autoregressive neural network \nprocess of order p, short AR-NN(p) in the following. \n\nAR-NNs are a natural generalization of the classic linear autoregressive AR(p) pro(cid:173)\ncess \n\n(2) \nSee, e.g., Brockwell & Davis (1987) for a comprehensive introduction into AR and \nARMA (autoregressive moving average) models. \n\n\f268 \n\nF. Leisch, A. Trapletti and K. Hornik \n\nOne of the most central questions in linear time series theory is the stationarity of \nthe model, i.e., whether the probabilistic structure of the series is constant over time \nor at least asymptotically constant (when not started in equilibrium). Surprisingly, \nthis question has not gained much interest in the NN literature, especially there \nare-up to our knowledge-no results giving conditions for the stationarity of AR(cid:173)\nNN models. There are results on the stationarity of Hopfield nets (Wang & Sheng, \n1996), but these nets cannot be used to estimate conditional expectations for time \nseries prediction. \n\nThe rest of this paper is organized as follows: In Section 2 we recall some results \nfrom time series analysis and Markov chain theory defining the relationship between \na time series and its associated Markov chain. In Section 3 we use these results to \nestablish that standard AR-NN models without shortcut connections are stationary. \nWe also give conditions for AR-NN models with shortcut connections to be station(cid:173)\nary. Section 4 examines the NN modeling of an important class of non-stationary \ntime series, namely integrated series. All proofs are deferred to the appendix . \n\n2 Some Time Series and Markov Chain Theory \n\n2.1 Stationarity \n\nLet ~t denote a time series generated by a (possibly nonlinear) autoregressive pro(cid:173)\nIf lEft = 0, then 9 equals the conditional expectation \ncess as defined in (1). \n1E(~t I~t-l' ... , ~t-p) and g(~t-l' ... , ~t-p) is the best prediction for ~t in the mean \nsquare sense. \nIf we are interested in the long term properties of the series, we may ask whether \ncertain features such as mean or variance change over time or remain constant. \nThe time series is called weakly stationary if lE~t = Jl and cov(~t,~t+h) = ,h, 'it, \ni.e., mean and covariances do not depend on the time t. A stronger criterion is \nthat the whole distribution (and not only mean and covariance) of the process does \nnot depend on the time, in this case the series is called strictly stationary. Strong \nstationarity implies weak stationarity if the second moments of the series exist. For \ndetails see standard time series textbooks such as Brockwell & Davis (1987). \nIf ~t is strictly stationary, then IP (~t E A) = rr( A), 'it and rrO is called the stationary \ndistribution of the series. Obviously the series can only be stationary from the \nIf \nbeginning if it is started with the stationary distribution such that ~o '\" rr. \nit is not started with rr, e.g., because ~o is a constant, then we call the series \nasymptotically stationary if it converges to its stationary distribution: \n\nlim IP(~t E A) = rr(A) \n\nt-HX) \n\n2.2 Time Series as Markov Chains \n\nUsing the notation \n\nXt-l \n\n(~t-l\"\" ,~t_p)' \n(g(Xt-d,~t-l\"\" , ~t-p+d \n(ft,O, ... ,O)' \n\n(3) \n\n(4) \n(5) \n\nwe can write scalar autoregressive models of order p such as (1) or (2) as a first \norder vector model \n\n(6) \n\n\fStationarity and Stability of Autoregressive Neural Network Processes \n\n269 \n\nwith Xt, et E lRP (e.g., Chan & Tong, 1985). If we write \n\npn(x,A) = IP{Xt+n E Alxt = x} \np( x, A) = pl (x, A) \n\nfor the probability of going from point x to set A E B in n steps, then {xd with \np(x , A) forms a Markov chain with state space (lRP ,B,>'), where B are the Borel \nsets on lRP and>' is the usual Lebesgue measure. \nThe Markov chain {xd is called cp-irreducible, if for some IT-finite measure cp on \n(lR P, B, >.) \n\n00 \n\nn=l \n\nwhenever cp(A) > O. This me~ns essentially, that all parts of the state space can be \nreached by the Markov chain irrespective of the starting point. Another important \nproperty of Markov chains is aperiodicity, which loosely speaking means that there \nare no (infinitely often repeated) cycles. See, e.g., Tong (1990) for details. \nThe Markov chain {Xt} is called geometrically ergodic, if there exists a probability \nmeasure 1I\"(A) on (lR P, B, >.) and a p > 1 such that \n\nVx E lR P : \n\nlim pnllpn(x,.) - 11\"(\u00b7)11 = 0 \nn-+oo \n\nwhere II . II denotes the total variation. Then 11\" satisfies the invariance equation \n\n1I\"(A) = ! p(x, A) 1I\"(dx) , VA E B \n\nThere is a close relationship between a time series and its associated Markov chain. \nIf the Markov chain is geometrically ergodic, then its distribution will converge to \n11\" and the time series is asymptotically stationary. If the time series is started with \ndistribution 11\", i.e., Xo \"\" 11\", then the series {~d is strictly stationary. \n\n3 Stationarity of AR-NN Models \n\nWe now apply the concepts defined in Section 2 to the case where 9 is defined by \na neural network. Let x denote a p-dimensional input vector, then we consider the \nfollowing standard network architectures: \n\nSingle hidden layer perceptrons: \n\ng(x) = 'Yo + L,8ilT(ai + a~x) \n\n(7) \n\nwhere ai, ,8i and 'Yo are scalar weights, aj are p-dimensional weight vectors, \nand IT(') is a bounded sigmoid function such as tanh(\u00b7). \n\nSingle hidden layer perceptrons with shortcut connections: \n\n(8) \n\nwhere c is an additional weight vector for shortcut connections between \ninputs and output. In this case we define the characteristic polynomial c(z) \nassociated with the linear shortcuts as \nc(z) = 1 - ClZ - C2z2 -\n\nZ E C. \n\n. .. -\n\ncP zP, \n\n\f270 \n\nF. Leisch, A. TrapleUi and K. Hornik \n\nRadial basis function networks: \n\n(9) \n\nwhere mj are center vectors and \u00a2( ... ) is one of the usual bounded radial \nbasis functions such as \u00a2(x) = exp( _x 2 ). \n\nLemma 1 Let {xtl be defined by (6), let IEjt:tl < 00 and let the PDF of f:t be \npositive everywhere in JR. Then if 9 is defined by any of (7), (8) or (9), the Markov \nchain {Xt} is \u00a2-irreducible and aperiodic. \n\nLemma 1 basically says that the state space of the Markov chain, i.e., the points that \ncan be reached, cannot be reduced depending on the starting point. An example \n\nfor a reducible Markov chain would be a series that is always positive if only Xo > \u00b0 \n\n(and negative otherwise). This cannot happen in the AR-NN(p) case due to the \nunbounded additive noise term. \n\nTheorem 1 Let {~tl be defined by (1), {xtl by (6), further let IEktl < 00 and the \nPDF of f:t be positive everywhere in JR. Then \n\n1. If 9 is a network without linear shortcuts as defined in (7) and (9), then \n\n{ x tl is geometrically ergodic and {~tl is asymptotically stationary. \n\n2. If 9 is a network with linear shortcuts as defined in (8) and additionally \nc(z) f 0, Vz E C : Izl ~ 1, then {xtl is geometrically ergodic and {~tl is \nasymptotically stationary. \n\nThe time series {~t} remains stationary if we allow for more than one hidden layer \n(-+ multi layer perceptron, MLP) or non-linear output units, as long as the overall \nmapping has bounded range. An MLP with shortcut connections combines a (pos(cid:173)\nsibly non-stationary) linear AR(p) process with a non-linear stationary NN part. \nThus, the NN part can be used to model non-linear fluctuations around a linear \nprocess like a random walk. \nThe only part of the network that controls whether the overall process is stationary \nare the linear shortcut connections (if present). If there are no shortcuts, then the \nprocess is always stationary. With shortcuts, the usual test for stability of a linear \nsystem applies. \n\n4 \n\nIntegrated Models \n\nAn important method in classic time series analysis is to. first transform a non(cid:173)\nstationary series into a stationary one and then model the remainder by a stationary \nprocess. The probably most popular models of this kind are autoregressive inte(cid:173)\ngrated moving average (ARIMA) models, which can be transformed into stationary \nARMA processes by simple differencing. \n\nLet I::!..k denote the k-th order difference operator \n\net - ~t-l \nI::!..(~t - ~t-d = ~t - 2~t-l + ~t-2 \n\n(10) \n(11) \n\n(12) \n\n\fStationarity and Stability of Autoregressive Neural Network Processes \n\n271 \n\nwith ~ 1 = ~. E.g., a standard random walk ~t = ~t-l +ft is non-stationary because \nof the growing variance, but can be transformed into the iid (and hence stationary) \nnoise process ft by taking first differences. \n\nIf a time series is non-stationary, but can be transformed into a stationary series \nby taking k-th differences, we call the series integrated of order k. Standard MLPs \nor RBFs without shortcuts are asymptotically stationary. It is therefore important \nto take care that these networks are only used to model stationary processes. Of \ncourse the network can be trained to mimic a non-stationary process on a finite time \ninterval, but the out-of-sample or prediction performance will be poor, because the \nnetwork inherently cannot capture some important features of the process. One way \nto overcome this problem is to first transform the process into a stationary series \n(e.g., by differencing an integrated series) and train the network on the transformed \nseries (Chng et al., 1996). \n\nAs differencing is a linear operation, this transformation can also be easily incor(cid:173)\nporated into the network by choosing the shortcut connections and weights from \ninput to hidden units accordingly. Assume we want to model an integrated series \nof integration order k, such that \n\n~k~t = g(~k~t_l' . .. ' ~k~t_p) + ft \n\nwhere ~k~t is stationary. By (12) this is equivalent to \n\n~t \n\nk \n\n~(-lt-l (~)~t-n + g(~k~t_l' ... ' ~k~t_p) + ft \nk ~(-lt-l (~)~t-n + g(~t-l' ... ,~t-p-k) + ft \n\nwhich (for p > k) can be modeled by an MLP with shortcut connections as defined \nby (8) where the shortcut weight vector c is fixed to \n\n(~) := 0 for n > k \n\nand 9 is such that g(~t-l' ... ,~t-p-k) = g(~kXt_d. This is always possible and \ncan basically be obtained by adding c to all weights between input and first hidden \nlayer of g. \nAn AR-NN(p) can model integrated series up to integration order p. If the order \nof integration is known , the shortcut weights can either be fixed, or the differenced \nseries is used as input. If the order is unknown, we can also train the complete \nnetwork including the shortcut connections and implicitly estimate the order of \nintegration. After training the final model can be checked for stationarity by looking \nat the characteristic roots of the polynomial defined by the shortcut connections. \n\n4.1 Fractional Integration \n\nUp to now we have only considered integrated series with positive integer order of \nintegration, i.e., kEN. In the last years models with fractional integration order \nbecame very popular (again). Series with integration order of 0.5 < k < 1 can \nbe shown to exhibit self-similar or fractal behavior, and have long memory. These \ntype of processes were introduced by Mandelbrot in a series of paper modeling river \nflows, e.g., see Mandelbrot & Ness (1968). More recently, self-similar processes were \nused to model Ethernet traffic by Leland et al. (1994). Also some financial time \nseries such as foreign exchange data series exhibit long memory and self-similarity. \n\n\f272 \n\nFLeisch. A. Trapletti and K. Hornik \n\nThe fractional differencing operator ~ k , k E [-1, 1] is defined by the series expansion \n\nk ~ f(-k+n) \n\n~ ~t = ~ r(-k)f(n + 1)~t-n \n\n(13) \n\nwhich is obtained from the Taylor series of (1 - z)k. For k > 1 we first use Equa(cid:173)\ntion (12) and then the above series for the fractional remainder. For practical \ncomputation, the series (13) is of course truncated at some term n = N. An AR(cid:173)\nNN(p) model with shortcut connections can approximate the series up to the first \np terms. \n\n5 Summary \n\nWe have shown that AR-NN models using standard NN architectures without short(cid:173)\ncuts are asymptotically stationary. If linear shortcuts between inputs and outputs \nare included-which many popular software packages have already implemented(cid:173)\nthen only the weights of the shortcut connections determine if the overall system \nis stationary. It is also possible to model many integrated time series by this kind \nof networks. The asymptotic behavior of AR-NNs is especially important for pa(cid:173)\nrameter estimation, predictions over larger intervals of time, or when using the \nnetwork to generate artificial time series. Limiting (normal) distributions of pa(cid:173)\nrameter estimates are only guaranteed for stationary series. We therefore always \nrecommend to transform a non-stationary series to a stationary series if possible \n(e.g., by differencing) before training a network on it. \n\nAnother important aspect of stationarity is that a single trajectory displays the \ncomplete probability law of the process. If we have observed one long enough tra(cid:173)\njectory of the process we can (in theory) estimate all interesting quantities of the \nprocess by averaging over time. This need not be true for non-stationary processes \nin general, where some quantities may only be estimated by averaging over several \nindependent trajectories. E.g., one might train the network on an available sam(cid:173)\nple and then use the trained network afterwards-driven by artificial noise from a \nrandom number generator-to generate new data with similar properties than the \ntraining sample. The asymptotic stationarity guarantees that the AR-NN model \ncannot show \"explosive\" behavior or growing variance with time. \n\nWe currently are working on extensions of this paper in several directions. AR-NN \nprocesses can be shown to be strong mixing (the memory of the process vanishes \nexponentially fast) and have autocorrelations going to zero at an exponential rate. \nAnother question is a thorough analysis of the properties of parameter estimates \n(weights) and tests for the order of integration. Finally we want to extend the uni(cid:173)\nvariate results to the multivariate case with a special interest towards cointegrated \nprocesses. \n\nAcknowledgement \n\nThis piece of research was supported by the Austrian Science Foundation (FWF) under \ngrant SFB#OlO ('Adaptive Information Systems and Modeling in Economics and Man(cid:173)\nagement Science'). \n\n\fStationarity and Stability of Autoregressive Neural Network Processes \n\n273 \n\nAppendix: Mathematical Proofs \n\nProof of Lemma 1 \n\nIt can easily be shown that {xe} is 0 and pn+l (x, A) > 0 for all x E A (Tong, \n1990, p. 455). In our case this is true for all n due to the unbounded additive noise. \n\nProof of Theorem 1 \n\nWe use the following result from nonlinear time series theory: \n\nTheorem 2 (Chan & Tong 1985) Let {Xt} be defined by (1), (6) and let G be compact, \ni.e. preserve compact sets. IfG can be decomposedasG = Gh+Gd andGd(-) is of bounded \nrange, G h(-) is continuous and homogeneous, i.e., Gh(ax) = aGh(x), the origin is a fixed \npoint of G h and Gh is uniform asymptotically stable, IEI\u20actl < 00 and the PDF of \u20act \nis \npositive everywhere in IR, then {Xt} is geometrically ergodic. \n\nfulfills the conditions by assumption. Clearly all networks are con(cid:173)\n\nThe noise process \u20act \ntinuous compact functions. Standard MLPs without shortcut connections and RBFs have \na bounded range, hence G h == 0 and G == Gd , and the series {ee} is asymptotically sta(cid:173)\ntionary. If we allow for linear shortcut connections between the input and the outputs, \nwe get G h = c'x and G d = 70 + l.:i (3i(T(ai + aix) i.e., G h is the linear shortcut part \nof the network, and Gd is a standard MLP without shortcut connections. Clearly, G h is \ncontinuous, homogeneous and has the origin as a fixed point. Hence, the series {eel is \nasymptotically stationary if G h is asymptotically stable, i.e., when all characteristic roots \nof Gh have a magnitude less than unity. Obviously the same is true for RBFs with shortcut \nconnections. Note that the model reduces to a standard linear AR(p) model if Gd == O. \n\nReferences \nBrockwell, P. J. & Davis, R. A. (1987). Time Series: Theory and Methods. Springer Series \n\nin Statistics. New York, USA: Springer Verlag. \n\nChan, K. S. & Tong, H. (1985). On the use of the deterministic Lyapunov function for \nthe ergodicity of stochastic difference equations. Advances in Applied Probability, 17, \n666-678. \n\nChng, E . S., Chen, S., & Mulgrew, B. (1996). Gradient radial basis function networks \nfor nonlinear and nonstationary time series prediction. IEEE Transactions on Neural \nNetworks, 7(1), 190- 194. \n\nHusmeier, D. & Taylor, J. G. (1997). Predicting conditional probability densities of sta(cid:173)\n\ntionary stochastic time series. Neural Networks, 10(3),479-497. \n\nJones, D. A. (1978). Nonlinear autoregressive processes. Proceedings of the Royal Society \n\nLondon A, 360, 71- 95. \n\nLeland, W. E., Taqqu, M. S., Willinger, W., & Wilson, D. V. (1994) . On the self-similar \nnature of ethernet traffic (extended version). IEEE/ACM Transactions on Networking, \n2(1), 1- 15. \n\nMandelbrot, B. B. & Ness, J . W. V. (1968). Fractional brownian motions, fractional noises \n\nand applications. SIAM Review, 10(4), 422-437. \n\nTong, H. (1990). Non-linear time series: A dynamical system approach. New York, USA: \n\nOxford University Press. \n\nWang, T. & Sheng, Z. (1996). Asymptotic stationarity of discrete-time stochastic neural \n\nnetworks. Neural Networks, 9(6) , 957-963. \n\n\f", "award": [], "sourceid": 1529, "authors": [{"given_name": "Friedrich", "family_name": "Leisch", "institution": null}, {"given_name": "Adrian", "family_name": "Trapletti", "institution": null}, {"given_name": "Kurt", "family_name": "Hornik", "institution": null}]}