You are on page 1of 4

Data driven Modelling In hydrology different types of models are used such as physical or scale models, mathematical models,

lumped conceptual models, distributed physically based models, empirical models. Please see the section on catchment modelling for a classification of hydrological models. During the last decade the area of empirical modelling received an important boost due to developments in the area of machine learning. It can be said that it now entered a new phase and deserves a special name data-driven modelling. Data-driven modelling (DDM) is based on the analysis of all the data characterising the system under study. A model can then be defined on the basis of connections between the system state variables (input, internal and output variables) with only a limited number of assumptions about the "physical" behaviour of the system. The methods used nowadays can go much further than the ones used in conventional empirical modelling: they allow for solving numerical prediction problems, reconstructing highly non-linear functions, performing classification, grouping of data and building rule-based systems. Data-driven modelling has been developed with the contributions from artificial intelligence, data mining, knowledge discovery in databases, computational intelligence, machine learning, intelligent data analysis, soft computing, pattern recognition, etc. There is a large overlap in the disciplines mentioned. We see data-driven modelling as a modelling approach which focuses on using the machine learning methods in building models of physical processes. These models can complement or replace the knowledge-driven models describing behaviour of physical systems. Examples of the most popular methods used in data-driven modelling of hydrological systems are: statistical methods, artificial neural networks and fuzzy rulebased systems.

Observations of a physical process are used in data-driven modelling to learn a mapping y = f (x). During the learning phase the prediction error (y y?) is used to modify the model parameters.

Learning

In machine learning an unknown mappings (or dependencies) between a system's inputs and its outputs from the available data is determined (Mitchell 1997) (see the following figure). By data we understand known samples that are combinations of inputs and corresponding outputs. As such a dependency (viz. mapping, or model) is discovered, which can be used to predict the future system's outputs from the known input values.

Learning tasks: The learning tasks in data-driven modellling can be of the following four types: Classification: where the task constitutes of assigning a class for an input data point. Association: where association between variables characterising the system is to be identified, which is used in subsequent prediction. Regression: where the task constitutes of predicting a real value associated with an input data point. Clustering: where groups of data points with within group similarity are to be determined.

The task of learning is often also characterised as supervised or unsupervised learning. Algorithms that require a set of data points with known outputs are referred to as supervised learning algorithms. Examples could be regression or classification. These are in contrast to unsupervised algorithms where the target outputs are not known. Example of this type could be clustering.

Data Data is usually split into three datasets. The first one is the training dataset. The training dataset is used to train the model (to determine the optimal parameter set). The testing dataset must not contain patterns from the training dataset. The predictive capability of the model is tested with the testing dataset during a procedure called validation (or testing). If the models performance on the testing dataset is satisfactory then it can be put to operation. As measurements may often be noisy an attempt to maximise the fit to the training data may lead to the model capturing not only the process but also the noise - a phenomenon known as overfitting. An overfitted model may not perform well on a new dataset and the model is said to have a poor generalisation capacity. In order to prevent this the third dataset is used, which is called the crossvalidation dataset. The training (optimisation of model parameters) is stopped when the error on the cross-validation dataset starts to increase. The cross-validation dataset must also not contain data points from the training and testing datasets. A model is built using the training data and is tested with the testing data, while the cross-validation data is used to determine the extent of the training. It is therefore imperative that these three datasets should have identical statistical distributions to ensure that they come from the same population. In order to highlight the importance let us

resort to an example. If in a hydrological problem our training data comes from a wet period and we test the trained model with data from a dry period then we are unable to reach a definite conclusion about the applicability of the model from this exercise. Strictly speaking, even if we have excellent results the model cannot be seen as being validated. The process of data-driven modelling The following list of steps in model building is often distinguished (e.g. Pyle, 1999): Select a clearly defined problem that the model will help to resolve. Specify the type of solution to the problem. Define how the solution delivered is going to be used in practice. Learn the problem, collect the domain knowledge, understand it. Clearly define assumptions, discuss them with the domain knowledge experts. 5. Let the problem drive the selection of modelling techniques. 6. Make the model as simple as possible, but no simpler. This rule is formulated sometimes in different ways like KISS, for example (Keep It Sufficiently Simple, or Keep It Simple, Stupid). More generally, this idea is widely known as the Occams Razor principle formulated by William of Occam in 1320 in the following form: shave all unneeded philosophy off the explanation. 7. Build (train) the model. 8. Refine the model iteratively (try different options until the model seems as good as it is going to get). 9. Test the model and evaluate the results. 10. Explore instabilities in the model (critical areas where small changes in inputs lead to large changes in output). 11. Define uncertainties in the model (critical areas and ranges in the data set where the model produces low confidence predictions). 12. Put the model into operation. Review it, if necessary, when experience is obtained. 1. 2. 3. 4. Applications of data-driven modelling Data-driven modelling has been successfully used in the following areas: Rainfall-runoff modelling (Minns and Hall, 1996; Dawson and Wilby, 1998; Abrahart and See, 2000) Estimating missing precipitation data (Abebe et al., 2000) Controlling polder water level (Lobbrecht and Solomatine, 1999; Bhattacharya et al., 2003). Reconstructing stage-discharge relationship (Bhattacharya and Solomatine, 2005). Classification of hydrologically homogeneous regions (Hall and Minns, 1999) Classifying surge water levels in the coastal zone depending on the hydrometeorological data (Solomatine et al., 2000)

Replicating behavior of hydrodynamic/hydrological river model with the objective of using the ANN in model-based optimal control of a reservoir (Solomatine & Torres, 1996)

References Abebe, A.J., and Price, R.K. (2003). Managing uncertainty in hydrological models using complementary models, Hydrological Sciences Journal, 48(5), 679-692. Abebe, A.J., Solomatine, D. P., and Venneker, R.G.W. (2000) Application of adaptive fuzzy rule-based models for reconstruction of missing precipitation events, Hyd. Sci. J., Vol. 45 (3), pp 425-436. Abrahart, R.J. and See, L. (2000). Comparing neural network and autoregressive moving average techniques for the provision of continuous river flow forecast in two contrasting catchments. Hydrological processes: 14, 2157-2172. Bhattacharya, B., and Solomatine, D.P. (2005). Neural networks and M5 model trees in modelling water level-discharge relationship, NeuroComputing Journal, 63, 381-396. Bhattacharya, B., Lobbrecht, A.H., and Solomatine D.P. (2003). Neural networks and reinforcement learning in control of water systems, Journal of Water Resources Planning and Management, ASCE, 129(6), 458-465. Dawson, C.W. and Wilby, R. (1998). An artificial neural network approach to rainfallrunoff modelling. Hydrological Sciences J., 43(1), 47-66. Govindaraju, R.S. and Ramachandra Rao, A., eds. (2001). Artificial neural networks in hydrology. Kluwer: Dordrecht. Hall, M. J. and Minns, A.W. (1999). The classification of hydrologically homogeneous regions. Hydrological Sciences J: 44, 693-704. Lobbrecht, A.H., and Solomatine, D. P. (1999). Control of water levels in polder areas using neural networks and fuzzy adaptive systems, Water Industry Systems: Modeling and optimization applications, Research Studies Press Ltd, Baldock, England. Minns, A.W. and Hall, M.J. (1996). Artificial neural network as rainfall-runoff model. Hydrological Sciences J., 41(3), 399-417. Pyle, D. (1999). Data preparation for data mining. Morgan Kaufmann: San Francisco. Solomatine D.P., Tores L.A. Avila. Neural network approximation of a hydrodynamic model in optimizing reservoir operation, A. Muller(ed.). Proc Hydroinformatics conference, 1996, pp. 201-206. Solomatine, D.P., Rojas, C., Velickov, S., and Wust, H. (2000) Chaos theory in predicting surge water levels in the North Sea, Proc. 4th Int. Conf. on Hydroinformatics, Iowa, USA.

You might also like