A general methodology for data-based rule building and its application to
natural disaster management
J. Tinguaro Rodrguez
, Begona Vitoriano, Javier Montero
Department of Statistics and Operational Research, Faculty of Mathematics, Complutense University of Madrid, Plaza de Ciencias 3, 28040 Madrid, Spain a r t i c l e i n f o Keywords: Decision support systems Data-based inductive reasoning Humanitarian logistics Natural disaster risk management Emergency management a b s t r a c t Risks derived from natural disasters have a deeper impact than the sole damage suffered by the affected zone and its population. Because disasters can affect geostrategic stability and international safety, developed countries invest a huge amount of funds to manage these risks. A large portion of these funds are channeled through United Nations agencies and international non-governmental organizations (NGOs), which at the same time are carrying out more and more complex operations. For these reasons, technological support for these actors is required, all the more so because the global economic crisis is placing emphasis on the need for efciency and transparency in the management of (relatively limited) funds. Nevertheless, currently available sophisticated tools for disaster management do not t well into these contexts because their infrastructure requirements usually exceed the capabilities of such organizations. In this paper, a general methodology for inductive rule building is described and applied to natural-disaster management. The application is a data-based, two-level knowledge decision support system (DSS) prototype which provides damage assessment for multiple disaster scenarios to support humanitarian NGOs involved in response to natural disasters. A validation process is carried out to measure the accuracy of both the methodology and the DSS. & 2009 Elsevier Ltd. All rights reserved. 1. Introduction Factors such as climate change and the growing population density of many cities and countries are putting more and more people throughout the world at risk of suffering due to natural disasters [17]. Recent tragic events such as the April 2009 LAquila earthquake in Italy or Hurricane Katrina in 2005 remind us that, even inside developed countries, there exist many population groups which are vulnerable to the impact of adverse natural phenomena [19]. Natural disasters have consequences not only for the popula- tion which is directly affected. They can also have profound implications for large sectors of the economy and the political system of the affected region, especially in developing countries. As shown in [20], the impact of a disaster in a region, if not managed properly, can produce political and social instability and affect international security and relations. Therefore, the development of models and tools to mitigate the consequences and risks of natural disasters is a key issue in todays global world. This paper continues the basic discussion initiated in [22], presenting a new version of SEDD (an acronym for expert system for disaster diagnosis in Spanish) which uses fuzzy logic. SEDD is a decision support system (DSS) prototype designed to provide support to non-governmental organizations (NGOs) involved in response to natural disasters. As shall be seen in the next section, NGOs currently play a crucial role in mitigating the consequences of disasters, especially in developing countries. Moreover, they are clamoring for better technological support for their decision processes, a eld in which, quite surprisingly, almost nothing has been done. The SEDD methodology is based on an inductive data-based approach (in the sense described in [29]) in which a large database of historical disaster instances is scanned and analyzed to create rules which can be used to assess the consequences of almost every possible disaster scenario. The remainder of this paper is organized as follows: the problem addressed and its importance in natural-disaster risk management is discussed in Section 2, where a review of state-of- the-art DSS addressing similar problems is also carried out. The knowledge representation model and the algorithms used by SEDD are described in Section 3. In particular, this section presents a general inductive methodology for building fuzzy rules from data. Rule aggregation and inference are then performed by means of a weighted averaging operators approach. Section 4 describes and discusses the computational experiments carried out to assess the accuracy of the model. Finally, conclusions are presented in Section 5. ARTICLE IN PRESS Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/caor Computers & Operations Research 0305-0548/$ - see front matter & 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.cor.2009.11.014
Corresponding author. Tel.: +34913944535.
E-mail address: jtrodrig@mat.ucm.es (J. Tinguaro Rodrguez). Please cite this article as: Tinguaro Rodrguez J, et al. A general methodology for data-based rule building and its application to natural disaster management. Computers and Operation Research (2009), doi:10.1016/j.cor.2009.11.014 Computers & Operations Research ] (]]]]) ]]]]]] ARTICLE IN PRESS 2. NGOs and natural-disaster risk management As stated above, mitigating the effects of disasters is an issue of growing importance in the current globalized world, and not only for humanitarian reasons. It is stated in [20] that developed countries invest a huge amount of funds in development assistance and other disaster-mitigation policies to avoid the risks of geostrategic destabilization produced by natural and anthropogenic disasters. Most of these aid funds are channeled to developing countries through United Nations (UN) agencies and international NGOs. These actors are involved because of their supposed political neutrality and their links with target popula- tions, which enable them to access politically unstable countries which otherwise would not authorize any external interference. In fact, NGOs and UN international relief agencies channel more than 60% of the total funding devoted to humanitarian aid throughout the world, as shown for instance in [26]. In recent years, the budgets and organizations devoted to emergency and humanitarian aid have experienced substantial growth, i.e., this sector of strategic activity is getting bigger over time [26]. As often occurs, this continuous growth has entailed the emergence of management difculties and efciency pro- blems as the complexity of the actions and operations developed by the actors has increased. On the other hand, the current global crisis is placing emphasis on the need for management transparency and efciency in practically all elds of human activity. Lavish, uncontrolled spending is no longer allowed. Consequently, efciency in disaster management becomes crucial. For these reasons, humanitarian logistics is an emerging eld which is becoming more and more relevant [30] and which is also drawing attention to some interesting problems in operations research. Disaster management and mitigation [10,19,32] is not an easy task and usually requires much analysis and many resources to design the right policy (to reduce vulnerability) at the right place (the vulnerable groups). This complexity suggests the use of the decision support (DSS) methodology, for example, to assess vulnerability [15] or to develop emergency plans [31]. In fact, main NGOs claim for specic technology to support such a complex decision making. There is a crying need to develop models and tools to address the specic problems of these organizations. In particular, this paper will focus on decision and assessment problems arising in the context of NGOs involved in response to natural disasters. 2.1. Problem formulation The main objective of SEDD is to support NGO decision-makers involved in international response and relief to people affected by a natural disaster. As soon as the NGO receives the rst notice of the occurrence of a potential disaster, it starts a decision process which is intended to reach a conclusion about whether or not suitable conditions exist to initiate a relief operation. It could be that a particular disaster scenario does not t the NGOs requirements or constraints regarding the nature of an interven- tion, size of disaster, or logistical capabilities. This decision process, usually under time pressure, must be based on the NGOs own internal information (funding, stockpiled resources, available personnel, etc.) as well as on knowledge and information available about the disaster case under study. Because informa- tion in the rst moments just after a disaster occurs tends to be confused, imprecise, or incomplete, uncertainty about what is really happening is a component which must be modeled. Despite all this uncertainty, a quick decision is needed. It should be recalled at this point that NGOs are normally specialized in one or more basic components of the relief task (such as health care, water and sanitation, or shelter and site management), but they are not specialized in responding to earthquakes, oods, or any other specic disaster scenario. Rather, NGOs are concerned with helping people, regardless of whether these people are suffering because of a drought or a hurricane. Moreover, international relief operations carried out by these NGOs are assumed to follow the quality standards and guidelines established by the Humanitarian Charter of the Sphere Project. These standards are set up by the United Nations (UN) and relevant international NGOs and contain a set of minimum standards for disaster response that sets out, for the rst time ever, what people affected by a disaster have the right to expect from humanitarian assistance in the basic components of aid: water and sanitation, nutrition, food security, shelter, and health care [14]. These standards also emphasize a correct and precise assessment of needs as a critical step for success in relief operations. Therefore, in order to decide on the appropriateness of an intervention which meets these quality standards, NGOs need an initial assessment of disaster consequences being as precise as possible. It should be stressed that it is in view of this assessment that the decision-maker can determine what kind of intervention is going to be necessary and whether or not the NGO can meet its requirements. Moreover, this decision must be made for every combination of disaster type and place. Furthermore, this initial assessment has to be done in a context of uncertainty and time pressure. Because all these factors normally appear together, expertize is needed to infer what is really going on and thereby support this urgent decision process. The main objective of the SEDD project is to assist interna- tional relief operations, supporting decision-makers by means of an inference tool capable of assessing the consequences of almost every combination of adverse phenomena and location, based on available information and taking into account the reliability of that information. In addition, SEDD should provide an estimate of the percentage of assistance effort which, as a consequence of the previous assessment, needs to be directed towards each of the basic relief components referred to in the Humanitarian Charter of the Sphere Project. It should be pointed that this problem formulation comes from the authors encounters with the staff of the International Federation of Red Cross and Red Crescent Societies (IFRC) of Spain. On the other hand, it is obvious that constraints must be imposed on the infrastructure requirements of a decision support tool operating in this context. As will become apparent in the next section, it is necessary to make realistic assumptions in this area if the DSS will have to be used in organizations or countries where the operational infrastructure cannot support highly sophisticated and precise methodology. In this sense, SEDD is intended to be a web-available (see for instance [22]), low-cost, tailor-made solution that should t specic NGO constraints such as ease of use, low computational and personnel requirements, and not relying on highly sophisticated and precise data (that in addition might not be available on time). 2.2. DSS for disaster management This section is devoted to a review of the state of the art in disaster management DSS (DSS-DM). In particular, the focus will be on DSS which address the problem formulated above. In this sense, the main point of discussion is that current DSS-DM do not meet the needs of NGOs for two main reasons: rst, they are not designed to address the specic problem of response to any J. Tinguaro Rodr guez et al. / Computers & Operations Research ] (]]]]) ]]]]]] 2 Please cite this article as: Tinguaro Rodrguez J, et al. A general methodology for data-based rule building and its application to natural disaster management. Computers and Operation Research (2009), doi:10.1016/j.cor.2009.11.014 ARTICLE IN PRESS possible natural disaster in any place; second, their sophistication and infrastructure requirements usually exceed those available in NGOs and in countries requiring humanitarian aid. As stated above, the difculties of disaster-mitigation tasks led to the introduction of decision support tools in the public sphere of emergency management [31,12,13]. In fact, such tools have become widely used by decision-makers in the public adminis- trations of developed countries because of their ability to integrate multiple sources of information and to provide a representation of disaster scenarios to assess vulnerability and emergency response policies and/or to support decisions in the preparation and relief phases. Because of this dependence on public administrators, in a rst stage, the scope of research was restricted to modeling the action of one phenomenon in a specic place (earthquakes in California [11], hurricanes in Florida [28], oods in Italy [27], etc.). However, it has gradually become clear [6,9] that integration of models and methodologies is necessary to develop more useful and exible systems. However, few practical systems based on this perspec- tive are already in operation. One of them [24] is HAZUS, the FEMA solution for hurricanes, oods, and earthquakes. Another proposal is that described in [4], where a modular approach is used to devise a dynamic integrated model for a DSS-DM which also takes environmental variables into account. In this paper, an alternative data-based approach is proposed to assess potential damage arising from various combinations of phenomena and locations. Together with other complex data structures and models, a distinctive feature of current DSS-DM is the use of geographical information systems (GIS). Because GIS are designed to support spatial decision making, they are very useful tools in the emergency management eld, where there is a strong need to address a number of spatial decisions [24]. In this way, a GIS should make possible a more precise assessment of vulnerability under various possible scenarios and a better response imple- mentation. In fact, a GIS is already present in almost every DSS-DM (see [7] for a survey, and also [1]). This trend makes current DSS-DM sophisticated and powerful tools that have become certainly indispensable in developed countries. Nevertheless, the high sophistication of these systems requires large computational and information resources, such as trained personnel or precise data, which could be unrealistic expectations in some contexts because they clearly exceed the infrastructure available. For instance, the expertise and precision needed to use a GIS could make the system much more complicated for end users. Furthermore, complex data such as a building census or precise meteorological information could be unavailable or unreliable. This is the main reason why HAZUS [24] and other highly sophisticated DSS methodologies have not been implemented in most developing countries or in most NGOs. In this sense, some authors (see [1], for example) are clamoring for exible, low-cost decision support tools to be developed for such contexts. A key characteristic of future emergency decision support systems should be high adaptability [16]. In conclusion, although a wide variety of DSS-DM are being used by public-sector managers in developed countries, not much can be found which addresses the particular problems experi- enced by NGOs and described above. Moreover, some methodol- ogies which have been designed for extensive application, like HAZUS, fail to meet these needs because of a lack of adaptability to particular infrastructure requirements. The SEDD project developed in this paper represents an original and specic decision support system that provides a suitable tool for use in an NGO decision-making context and enables the assessment of multiple disaster scenarios, similarly in this sense to HAZUS or to the integrated approach presented in [4]. 3. A general methodology for building rules from data This section rst describes the SEDD knowledge representation model, which takes a database as input and produces, by means of a set of classes (crisp or fuzzy), a matrix representing historical knowledge about disasters. Next, the approach used to build up three types of inference rules, using this matrix along with the raw data as input, is described. These two levels of knowledge are stored in SEDDs knowledge base (KB). Finally, the inference process carried out by SEDDs inference engine (IE) is described. It uses the rules in the KB along with a fact base (FB) containing raw data related to a disaster case under study to produce an assessment of the disasters possible consequences. All this methodology is presented in a rather general way to emphasize that it is general enough to serve as a methodological base for a wide class of DSS which could address problems other than natural-disaster risk management. Another methodology for building fuzzy rules from data is described in [8]. 3.1. Knowledge representation To build up rules from data and to carry out a useful inference process, it is necessary to dene rst the general framework and mathematical models used to represent the information and knowledge with which SEDD will work. In other words, a mathematical model of knowledge representation is needed to give the data an appropriate shape or structure in agreement with the data structures required as input to the rule-building and inference processes. Following the approach described in [23], the basic raw data used by SEDD are considered as a database, which can be viewed as a real-valued matrix D= (d ki ) mxn having m instances and n variables, X 1 ; . . . ; X n . Within the semantics of SEDD, each of these m instances represents a historical disaster scenario and each of the n variables a quantity of interest for describing these disaster scenarios. The range of each variable X i is then partitioned into a set of c i classes A i1 ; . . . ; A ic i , which can be fuzzy or crisp. In this paper, these classes are intended to be linearly ordered, i.e., A ij oA ij / iff j oj / , but a different structure could also be proposed, as explained in [18]. Here capital letters will be used to denote the values of variables in the database, i.e., X k i =d ki for k =1; . . . ; m and i =1; . . . ; n. Lower-case letters will be used to denote values of categories, i.e., x k ij i =m A ij (X k i ) for j i =1; . . . ; c i , where m A ij is the membership function of class A ij . In the crisp case, it is assumed that the value of X k i lies in exactly one class j / , i.e., m A ij / (X k i ) =1 and m A ij (X k i ) =0 if j aj / . In the fuzzy case, m A ij (X k i )A[0; 1] and the classes do not necessarily form a fuzzy partition in the sense of Ruspini, i.e., P c i j = 1 m A ij (X k i ) does not need to sum exactly to one (see for example [2]). In fact, missing values of any variable are modeled by assigning the value 0 to every class. In this way, the rst level of knowledge representation in SEDD is constituted by a matrix H = (h kj ) mxl , with l = P n i = 1 c i being the total number of categories or classes, and such that h kj : = x k ij i =m A ij i (X k i ) for all k =1; . . . ; m, i =1; . . . ; n, j i =1; . . . ; c i , and j =1; . . . ; l. Reference to the ith variable is removed in the h kj s because it is intended that categories in H will be sorted by the variables to which they correspond. 3.2. Data-based rule building The matrix H constitutes the rst level of knowledge in SEDD and represents historical information of the type in place x there J. Tinguaro Rodr guez et al. / Computers & Operations Research ] (]]]]) ]]]]]] 3 Please cite this article as: Tinguaro Rodrguez J, et al. A general methodology for data-based rule building and its application to natural disaster management. Computers and Operation Research (2009), doi:10.1016/j.cor.2009.11.014 ARTICLE IN PRESS was phenomena y and the damage was z. However, to obtain inference capability, a second level of knowledge, or meta- knowledge, is needed: something like when phenomenon y occurs in places like x, the damage tends to be z. This second-level knowledge, to which we shall refer as rules, must be extracted from the rst-level knowledge (the data), and therefore it is said that these rules are data-based. Conceptually, the proposed SEDD methodology for rule extraction is based on the idea that a rule is built up through successive repetition and experience of similar situations. It is usually accepted that whenever a relation is experienced or successively repeated, its rule condition is strengthened. In fact, when faced with the problem formulated in the last section, the expert implicitly tries to relate the current disaster scenario under study with other similar past situations and to apprehend whether the rules or relations extracted from the analysis of these past situations are fullled by the present one. Although the expert may not be able to enunciate working rules explicitly, his/her extensive knowledge of historical disaster cases similar to the current one under study should enable the expert to know what effects can be expected to occur. An estimated assessment of the consequences of the present disaster can then be generated by the expert. The approach presented in this paper follows the principles just described. Each instance of the database in which the same classes of different variables appear together is considered as evidence for the existence of a relationship between these categories. In this sense, what is going to be measured and translated into rules is the trend of some variables to occur as other variables appear. Rules still need some variables to play the role of premises or independent variables, with the rest serving as consequences or dependent variables. Thus, from the set of n variables X 1 ; . . . ; X n , a subset of p premise variables is extracted, leaving another subset of q =n p consequence variables. Because in this approach the conclusion for each consequence variable is independent of the conclusion for any other, for the sake of simplicity in presentation, it will be assumed without loss of generality that q =1, i.e., that there exists only one consequence or dependent variable for each set of premise variables. In the rest of this paper, this dependent variable will be denoted by Y, with {X 1 ; . . . ; X p ] being the set of premises or independent variables. Moreover, if T is a t-norm (see for instance [25]), let T(H) denote the c 1 c p m multidimensional matrix such that T(H)(j 1 ; . . . ; j p ) =T(H; j 1 ; . . . ; j p ) = (T(x 1 1j1 ; . . . ; x 1 pjp ); . . . ; T(x m 1j1 ; . . . ; x m pjp )) t ; for j i =1; . . . ; c i and i =1; . . . ; n. In this way, for each combination (j 1 ; . . . ; j p ) of the premise indices, T(H) is a vector containing the membership degrees in the conjunction class A 1j 1 4 4A pjp of the values of the premise variables for each of the m instances in the database. Let y j also denote the vector of length m such that y j = (y 1 j ; . . . ; y m j ) t = (m B j (Y 1 ); . . . ; m B j (Y m )) t ; j =1; . . . ; d; where B j is any of the d =c p1 classes dened in the last section for the dependent variable Y, and Y k represents the value of this variable for the kth instance in the database, k =1; . . . ; m. In this paper, three types of rules and the algorithms to compute them are described. Formally, a rule is understood to be an expression of the type R : if X 1 is A 1 and X 2 is A 2 and. . . and X p is A p then Y is B; where each A i is a class of the ith premise variable and B is the conclusion assigned to the dependent variable Y. Thus, the meaning of the three different groups of rules is that three different types of conclusions B will be assigned to the dependent variable Y: + In the rst type of rules, a degree of possibility p j is assigned to each one of the d classes, B j (j =1; . . . ; d). Therefore, B =p = (p 1 ; . . . ; p d ). As shall be seen in the next section, this leads to a class B j or a union of adjacent classes as a prediction of the dependent variable Y. This group of rules is useful to deal with categorical variables and also with numerical variables which have previously been classied, as is the case with linguistically assessed variables. + The second group of rules assigns to Y a mean value y in the range of the dependent variable. The algorithm that computes this value makes use of the possibilities p determined by the previous group of rules to weight the values of Y. Therefore, these rules are dependent on the rules in the rst group, and in this case, B =y. This group of rules works with numerical data and is therefore designed to predict numerical variables. + Finally, the last group of rules assigns to Y an interval [b 1 ; b 2 ] of possible values of the dependent variable. The lower and upper extremes of this interval are computed by means of fuzzy or crisp order statistics, respectively, and therefore the algorithm to compute these rules is in fact an algorithm to compute those statistics. Thus, for this group of rules, B = [b 1 ; b 2 ]. Intervals and order statistics work well when dealing with numerical variables which exhibit large variability, outliers, or both. Although the two are similar (and in fact identical in the crisp case), it is important not to confuse these rules with the forecasts that constitute the outcome of the inference process described in the next subsection. As will be seen shortly, this inference process makes use of various rules along with information about the disaster case under study to construct the forecasts, while the rule-building process described in this section makes use of data and rst-level knowledge to extract the second-level knowledge (i.e., to build up the rules). Another important remark before describing the algorithms for rule extraction: it should be noted that, in most of the cases which pose the problem of building rules from data, the set of rules to create is not previously dened. This amounts to saying that the set of combinations of independent variables and/or classes of these variables that have to be used as premises of the rules is not given. Furthermore, every possible combination of classes could occur in practice and could constitute an important premise for explaining the data. For this reason, regardless of the fact that this could lead to exponentially increasing computational require- ments, this research presents algorithms to build every possible rule, leaving for a discussion at the end of this section and for future research the issue of how to develop heuristics to avoid creating all the rules. In any case, this possibility will not constitute a problem for a DSS with small data requirements, which, as stated in the previous section, is the case for SEDD. Case 1. B =p. Calculation of dependent class possibilities p Given the matrix H, a class B j of the dependent variable Y, and a combination (j 1 ; . . . ; j p ) of classes of the p premise variables, j A{1; . . . ; d] and j i A{1; . . . ; c i ] for all i =1; . . . ; p, the possibility in H of the class B j when A 1j 1 4 4A pjp is true is dened as a weighted aggregation of its membership degrees over all m instances in the database, i.e., p H (j[j 1 ; . . . ; j p ) = P m k = 1 T(x k 1j1 ; . . . ; x k pjp )y k j P m k = 1 T(x k 1j 1 ; . . . ; x k pjp ) ; where y k j =m B j (Y k ) is the membership degree in the jth class of the kth instance of variable Y and T is the logical operator (usually a J. Tinguaro Rodr guez et al. / Computers & Operations Research ] (]]]]) ]]]]]] 4 Please cite this article as: Tinguaro Rodrguez J, et al. A general methodology for data-based rule building and its application to natural disaster management. Computers and Operation Research (2009), doi:10.1016/j.cor.2009.11.014 ARTICLE IN PRESS t-norm) that models the conjunction and. In vector notation, p H (j[j 1 ; . . . ; j p ) = T(H; j 1 ; . . . ; j p ) t y j T(H; j 1 ; . . . ; j p ) t 1 ; where 1 represents a vector with m ones, i.e., 1 = (1; . . . ; 1 z}|{ m ), in such a way that the denominator is equal to the sum of all the elements in T(H; j 1 ; . . . ; j p ). Proposition 1. If the variable Y does not have missing values and if the classes B j , j =1; . . . ; d, form a Ruspini partition, then the possibilities of the d classes of Y given any combination of premises (j 1 ; . . . ; j p ) sum up to one. Proof. X d j = 1 p H (j[j 1 ; . . . ; j p ) = P m k = 1 T(x k 1j 1 ; . . . ; x k pjp ) P d j = 1 y k j P m k = 1 T(x k 1j 1 ; . . . ; x k pjp ) = P m k = 1 T(x k 1j 1 ; . . . ; x k pjp ) P m k = 1 T(x k 1j 1 ; . . . ; x k pjp ) =1: & Thus, if a variable Y has missing values and its classes form a Ruspini partition, the proposition just stated means that the possibilities of Y sum to o1. In this way, it is possible to dene the ignorance associated with variable Y given a combination of premises (j 1 ; . . . ; j p ) as one minus the sum of its possibilities for that combination of premises, i.e., I Y (j 1 ; . . . ; j p ) =1 P d j = 1
p H (j[j 1 ; . . . ; j p ). As explained in [18], the ignorance I is a necessary class that should be added to the existing set of classes to model better the underlying learning process. Case 2. B =y. Calculation of the dependent-variable mean y. Given H and a combination (j 1 ; . . . ; j p ) of classes of the p premise variables, j i A{1; . . . ; c i ] for all i =1; . . . ; p, the fuzzy mean of a crisp variable Y when A 1j 1 4 4A pjp is true can be easily dened as the fraction X m k = 1 T(x k 1j 1 ; . . . ; x k pjp )Y k X m k = 1 T(x k 1j 1 ; . . . ; x k pjp ): , However, because the variable Y can have not only missing values, but also outliers in classes with low possibility, it seemed more realistic to dene this mean value as y H (j 1 ; . . . ; j p ) = P m k = 1 W k (j 1 ; . . . ; j p )T(x k 1j 1 ; . . . ; x k pjp )Y k P m k = 1 W k (j 1 ; . . . ; j p )T(x k 1j 1 ; . . . ; x k pjp ) : This expression can be alternatively given in a more compact vector notation, dening the weights W as W k (j 1 ; . . . ; j p ) = X d j = 1 y k j p H (j[j 1 ; . . . ; j p ); in such a way that W(j 1 ; . . . ; j p ) = (W 1 (j 1 ; . . . ; j p ); . . . ; W m (j 1 ; . . . ; j p )) t : If 3 stands for the element-wise product of matrices and Y = (Y 1 ; . . . ; Y m ) t , then y H (j 1 ; . . . ; j p ) = (W(j 1 ; . . . ; j p )3T(H; j 1 ; . . . ; j p )) t Y W(j 1 ; . . . ; j p ) t T(H; j 1 ; . . . ; j p ) : This way, missing values of Y do not affect the computation. Moreover, if outliers of Y lie in classes with low or zero possibility, their values will have little effect on the resulting mean value. Such a mean y H (j 1 ; . . . ; j p ) will be biased towards the classes B j with higher possibility p H (j[j 1 ; . . . ; j p ), which is a realistic assump- tion in the authors opinion. Case 3. B = [b 1 ; b 2 ]. Calculation of fuzzy order statistics. Given H and a combination (j 1 ; . . . ; j p ) of classes of the p premise variables, j i A{1; . . . ; c i ] for all i =1; . . . ; p, if the classes of the premise variables are fuzzy, it is not obvious how to calculate the percentiles aA{1; . . . ; 99] of the values of a crisp variable Y for which A 1j 1 4 4A pjp is true. For each instance of Y k in H, the truth value of the conjunction A 1j 1 4 4A pjp could take on a different value T(x k 1j 1 ; . . . ; x k pjp ), so it is not possible simply to take as the a percentile the value of Y below which a percent of the observations for which A 1j 1 4 4A pjp is true may be found. However, it seems natural to generalize this idea to the fuzzy case by dening the a percentile as the value of Y below which is found a percent of the total amount of membership w(j 1 ; . . . ; j p ) = P m k = 1 T(x k 1j 1 ; . . . ; x k pjp ) in the conjunction class A 1j 1 4 4A pjp . The algorithm used to nd this value is the following: 1. Sort H by the values of Y, removing the instances for which the value of Y is missing. 2. Dene k a (j 1 ; . . . ; j p ) =min k= P k s = 1 T(x s 1j 1 ; . . . ; x s pjp )4 a 100 w(j 1 ; . . . ; j p ) n o . 3. Dene the a percentile as PC a (j 1 ; . . . ; j p ) =Y ka(j1;...;jp) . Thus, to build the interval that constitutes the conclusion of the rules in this third group, two values, a 1 ; a 2 A{1; . . . ; 99], a 1 oa 2 , must be chosen, leading to the interval [PC a1 (j 1 ; . . . ; j p ); PC a 1 (j 1 ; . . . ; j p )]. These three groups of rules are then stored as vectors or multidimensional matrices in the SEDD knowledge base. These matrices constitute the second level of knowledge representation in SEDD. 3.3. Inference process To make inferences about the unknown variables of a disaster case under study, SEDDs inference engine must combine the known information present in the fact base with the rules stored in the knowledge base. This is done by using these known data as the premises of the rules. Without loss of generality, it can be assumed that the values of variables X 1 ; . . . ; X p are known and that it is desired to generate a prediction of the value y of an unknown variable Y. The values of the known variables in the fact base are denoted by the vector F = (F 1 ; . . . ; F p ), and f ij i denotes the membership degree of value F i in the class A ij i , i.e., f ij i =m A ij i (F i ), j i A{1; . . . ; c i ] for all i =1; . . . ; p. The way that predictions are built from these inputs is basically the same for each of the three groups of rules. However, because for the rst group a prediction class is needed, different ways of aggregating the resulting class possibilities B j , j =1; . . . ; d, of the dependent variable Y are described. Because for the crisp case there is only one combination (j 1 ; . . . ; j p ) of classes of the independent variables into which the values F 1 ; . . . ; F p are classied, the output of the inference process is simply the conclusion of the rule having the combination of classes (j 1 ; . . . ; j p ) as its premise. However, in the fuzzy case, the values F 1 ; . . . ; F p are usually classied into several combinations A 1j 1 4 4A pjp of classes of the independent variables, and therefore a way of aggregating the conclusions of the correspond- ing rules is needed. This aggregation is carried out by means of a weighted aggregation operator, in which the different truth degrees T(f ij i ; . . . ; f ijp ) of the values F 1 ; . . . ; F p lying in each combination A 1j 1 4 4A pjp are used as weights. Thus, because this case is the more general one, only the construction of predictions for the fuzzy case is described. J. Tinguaro Rodr guez et al. / Computers & Operations Research ] (]]]]) ]]]]]] 5 Please cite this article as: Tinguaro Rodrguez J, et al. A general methodology for data-based rule building and its application to natural disaster management. Computers and Operation Research (2009), doi:10.1016/j.cor.2009.11.014 ARTICLE IN PRESS 3.3.1. Inference process for rules with B=p Let us rst denote by T(F), p H (j), y H , and PC a i the c 1 c p multidimensional matrices such that T(F)(j 1 ; . . . ; j p ) : =T(f 1j i ; . . . ; f pjp ); p H (j)(j 1 ; . . . ; j p ) : =p H (j[j 1 ; . . . ; j p ); y H (j 1 ; . . . ; j p ) : =y H (j 1 ; . . . ; j p ); PC a i (j 1 ; . . . ; j p ) : =PC a i (j 1 ; . . . ; j p ); i =1; 2: Let us also denote by A:B the Frobenius inner product of two matrices A and B of the same size c 1 c p : A : B : = X c 1 j 1 = 1 . . . X cp jp = 1 A j1;...;jp B j1;...;jp : Recall then that the rst group of rules described above will provide, for each combination (j 1 ; . . . ; j p ) of the independent classes, a degree of possibility p H (j[j 1 ; . . . ; j p ) for each class B j of the dependent variable Y, j =1; . . . ; d. Therefore, for each j =1; . . . ; d, the nal assessment ^ p H (j) of the possibility of these classes can be calculated as ^ p H (j) = P c 1 j 1 = 1 . . . P cp jp = 1 T(f 1j i ; . . . ; f pjp )p H (j[j 1 ; . . . ; j p ) P c 1 j 1 = 1 . . . P cp jp = 1 T(f 1j i ; . . . ; f 1jp ) ; or in more compact matrix notation, ^ p H (j) = T(F) : p H (j) T(F) : 1 ; where 1 now stands for a c 1 c p matrix of ones. Thus, the denominator is equal to the sum of all the elements of T(F). This produces a vector ^ p H = ( ^ p H (1); . . . ; ^ p H (d)) containing an assessment of the nal possibility of every dependent class. However, as stated above, what is actually needed is a class that constitutes a prediction of the true state of variable Y. To construct this prediction, rst recall that classes were said to have a structure of linear order. This way, if no class amasses enough evidence to be considered as a solid prediction, the joint possibility of adjacent classes (in the sense of linear order) should be taken into account. Formally, assume that dA(0; 1) is a threshold value set by the user to indicate the desired level of evidence that a predicted class or set of classes must amass. Dene J = {1; . . . ; d] and dene the set of indices adjacent to a given index j as adj({j]) = {j 1; j 1] J for each index j AJ. Given a set E J of consecutive indices, i.e., such that E = S max(E) j = min(E) {j], dene adj(E) = {min(E) 1; max(E)1] J. The following describes three possible ways of obtaining the nal outcome of this inference process, which are referred to as optimistic, pessimistic, and neutral for reasons that become evident by looking at the following algorithm: 1. Prediction=|, S = {j AJ= ^ p H (j) =max( ^ p H )] 2. DO WHILE prediction=| IF P j AS ^ p H (j)Zd THEN Prediction = S j AS B j ELSE IF optimistic THEN S =S C {min(adj(S))] ELSE IF pessimistic THEN S =S C {max(adj(S))] ELSE IF neutral THEN S =S C {j= ^ p H (j) =max i Aadj(S) ( ^ p H (i))] END DO Therefore, the optimistic method looks at the classes below the minimum of ^ p H , the pessimistic method searches those above the maximum of ^ p H , and the neutral method always tries to provide a prediction with the minimum number of classes. Obviously, this reasoning assumes that lower classes are better (fewer casualties, injured people, etc.) than higher ones, but the method could be easily adapted to the inverse case. Note that because the variable Y can have missing values, not every value of d leads to a prediction. In this sense, the following proposition can be formulated: Proposition 2. If the classes of variable Y form a Ruspini partition, the preceding algorithm always stops if and only if dr1 max j 1 ;...;jp I Y (j 1 ; . . . ; j p ). Proof. Using Proposition 1 and denitions, on the one hand, X d j = 1 ^ p H (j) = P c 1 j 1 = 1 . . . P cp jp = 1 T(f 1j i ; . . . ; f pjp )(1 I Y (j 1 ; . . . ; j p )) P c 1 j 1 = 1 . . . P cp jp = 1 T(f 1j i ; . . . ; f 1jp ) Z P c 1 j 1 = 1 . . . P cp jp = 1 T(f 1j i ; . . . ; f pjp )(1 max j 1 ;...;jp I Y (j 1 ; :::; j p )) P c 1 j 1 = 1 . . . P cp jp = 1 T(f 1j i ; . . . ; f 1jp ) =1 max j 1 ;...;jp I Y (j 1 ; . . . ; j p ): On the other hand, suppose that the maximum of I Y (j 1 ; . . . ; j p ) occurs when (j 1 ; . . . ; j p ) = (a 1 ; . . . ; a p ). Because one can easily choose f 1j i ; . . . ; f pjp in such a way that T(f 1j i ; . . . ; f pjp ) = 1 if (j 1 ; . . . ; j p ) = (a 1 ; . . . ; a p ) 0 else ;
the result is a case such that
P d j = 1 ^ p H (j) = 1 max j 1 ;...;jp I Y (j 1 ; . . . ; j p ), which completes the proof. & Finally, it is important to note that both the ignorance I Y and the number of classes included in a prediction can be seen as a measure of the uncertainty associated with that prediction. 3.3.2. Inference process for rules with B =y Recall that this group of rules provides, for each combination (j 1 ; . . . ; j p ) of the independent classes, an average or mean value y H (j 1 ; . . . ; j p ) of the dependent variable Y when A 1j 1 4 4A pjp is true. Using the weighted aggregation methodology, the outcome ^ y H of this inference process can be calculated as ^ y H = P c 1 j 1 = 1 . . . P cp jp = 1 T(f 1j i ; . . . ; f pjp )y H (j 1 ; . . . ; j p ) P c1 j1 = 1 . . . P cp jp = 1 T(f 1j i ; . . . ; f pjp ) ; or using the matrix notation already described in the previous case, ^ y H = T(F) : y H T(F) : 1 : 3.3.3. Inference process for rules with B = [b 1 ; b 2 ] This group of rules produces, for each combination (j 1 ; . . . ; j p ) of independent classes, an interval [PC a 1 (j 1 ; . . . ; j p ); PC a 1 (j 1 ; . . . ; j p )] of values for the predicted value of the dependent variable Y, where PC a i (j 1 ; . . . ; j p ) is the fuzzy a i percentile of the values of Y when A 1j 1 4 4A pjp is true, a i A{1; . . . ; 99] i =1; 2. As for the preceding group of rules, the nal predicted percentiles are calculated through weighted aggregation as c PC a i = P c1 j1 = 1 . . . P cp jp = 1 T(f 1j i ; . . . ; f pjp )PC a i (j 1 ; . . . ; j p ) P c 1 j 1 = 1 . . . P cp jp = 1 T(f 1j i ; . . . ; f pjp ) ; or in matrix notation, c PC a i = T(F) : PC a i T(F) : 1 : The interval that constitutes the nal outcome of this inference process is then given by [ c PC a 1 ; c PC a 2 ]. The length of this interval can be seen as a measure of the uncertainty associated with variable Y when the various conjunction classes A 1j 1 4 4A pjp hold. J. Tinguaro Rodr guez et al. / Computers & Operations Research ] (]]]]) ]]]]]] 6 Please cite this article as: Tinguaro Rodrguez J, et al. A general methodology for data-based rule building and its application to natural disaster management. Computers and Operation Research (2009), doi:10.1016/j.cor.2009.11.014 ARTICLE IN PRESS 3.4. A remark about algorithmic complexity From the preceding discussion, it follows that an upper bound for the total number of rules of each type to be created is in the order of u = (max i = 1...p (c i )) p , which is also an upper bound for the total number of operations to be performed in each of the inference processes. Therefore, this methodology for rule extrac- tion leads to algorithms which are exponential in the number p of premises or independent variables. However, in the crisp case, it has been shown that only one rule of each group has to be created and that no inference process is needed at all. With a few assumptions about the fuzzy classes which are dened for independent variables, it is possible to reduce u substantially, although it is possible that exponentiality remains. For instance, by using triangular orthogonal classes, at most two of them have a membership degree 40 for each independent variable, and therefore in this case u =2 p . By using trapezoidal orthogonal classes, the worst case also leads to u =2 p , but it is strongly probable that for a number of independent variables, say r of them, only one class has membership degree 40, leading to u =2 pr . Moreover, it is interesting to remember that it is possible to devise heuristics to achieve a further reduction in the number of rules to be created, for instance, by assessing rst which rules are going to have little impact on the predictions. In any case, as already mentioned, the exponential nature of these algorithms does not constitute a problem if the number p of premises is small. In the next section, it will be shown that for a DSS like SEDD, which is intended to work with a small amount of easily accessible data, performance could be good enough to justify this methodology. 4. Computational experiments 4.1. Validation process This section presents the results of a validation process carried out to measure of the accuracy of the methodology just described. As input data for the system, the EM-DAT database (Emergency Events Database, available at www.emdat.be) of CRED (Center for Research on the Epidemiology of Disasters, www.cred.be) has been used. This database contains data on over 16,000 natural and anthropogenic disasters occurring from 1900 to the present all over the world. Because this database does not provide any explicit estimation of the vulnerability of each location affected, to be consistent with the disaster denition used here (see Section 1), the data have been merged with UN data on the human development index (HDI), taking this index as an estimate of the affected countrys vulnerability. It should be noted that the EM-DAT database contains no variables which enable a good description of the places affected by a disaster, even with the addition of the HDI index. The result is a huge variability of the dependent variables for similar values of the explanatory variables. For instance, a strong earthquake of magnitude 8 on the Richter scale was reported in Indonesia in 1979, but it caused only two casualties. However, a weaker quake of magnitude 7.5 struck the same country in 1992 and produced at least 2500 casualties. In this sense, a variable such as density of population in the affected place or some kind of index of human activity would be needed, although such variables are not at all easy to obtain. There are also a considerable number of outliers and a signicant proportion of missing values in the dependent variables. Last but not least, the mode of the distribution of these variables usually occurs in the lowest classes, because major disasters are much talked about and spectacular, but are relatively uncommon events. To illustrate the capability of SEDD to evaluate and assess the consequences of multiple disaster types, it was decided to focus on the same adverse phenomena that HAZUS [24] is able to deal with: oods, windstorms (including hurricanes, typhoons, cyclones, etc.), and earthquakes. For each type of disaster, the signicant variables considered here are the following: HDI, magnitude of the disaster (inundated area, maximum wind speed, magnitude on the Richter scale), number of casualties (NC), number of injured, homeless, and affected people, and damage in US dollars. The values of the rst and last variables are modied to take into account variations of the HDI index and of currency values and thus to provide a normalized scale. Therefore, n=7 for this process. Consistently with the denition of a disaster used here, the rst two variables are taken as explanatory and the remaining ve as dependent, so p=2 and q=5. Triangular orthogonal fuzzy classes are used for all the variables. The details presented here were calculated with the t-norm T=min. Three classes are dened for the rst independent variable HDI: low, medium, and high. They closely correspond to the divisions already made in the UN data. For the variable magnitude, four classes are constructed, which correspond to very weak, weak, moderate, and high intensities of the adverse natural phenomenon. Five classes are dened for each dependent variable. In particular, those dened for NC are the following: no casualties, very few, few, quite a lot, and a lot. Therefore, there are l =32 categories. For each type of disaster, 10 training sets were randomly constructed; each one containing 80% of the available sample, while keeping aside 20% as a validation set. The rst set constitutes the rst-level data from which meta-knowledge is extracted. The rst two variables in the validation sets are then input into the IE as successive FB vectors, producing a predicted output for the remaining ve variables. Errors are measured in the following way: + For the class prediction generated by the rst reasoning method, the rate of correct classications is calculated, along with the average number of classes that conform to the predictions. This last value is useful for measuring the uncertainty associated with the predictions. For wrong classications, the average distance from the true class to the nearest predicted class is also computed, which gives an idea of how severely wrong predictions deviate from the true class. Only the neutral method is tested in this paper. + For mean values predicted using the second group of rules, relative and absolute errors are measured in an asymmetrical way. Because variable NC has a high frequency of zero values, the concept of relative error could be nonsensical. Moreover, if the real value of this variable is small, say 5, a value of 10 is a very good prediction, although its relative error is 100%. On the other hand, if only absolute errors are measured, a prediction of 20,000 units when the real value is 21,000 is also a good approximation (its relative error is 5%). However, it produces an absolute error of 1000 units, which could disturb the average errors when small values are also taken into consideration. For these reasons, a fuzzy set is dened over the range of dependent variables to assess whether or not a value is large enough to make a relative error measure meaningful. This means that for small values, only absolute errors are computed, while only relative errors are calculated for large values. This fuzzy set depends on two parameters, denoted by b 1 ob 2 , which correspond to the largest value J. Tinguaro Rodr guez et al. / Computers & Operations Research ] (]]]]) ]]]]]] 7 Please cite this article as: Tinguaro Rodrguez J, et al. A general methodology for data-based rule building and its application to natural disaster management. Computers and Operation Research (2009), doi:10.1016/j.cor.2009.11.014 ARTICLE IN PRESS which is considered not big enough and the smallest value that is viewed as big enough. Membership degrees in this fuzzy set for yA(b 1 ; b 2 ) are given by b(y) = (y b 1 )=(b 2 b 1 ). Errors are then obtained by e = P v k = 1 (1 b(y k ))[y k y k [ v P v k = 1 b(y k ) and Z = P v k = 1 b(y k )[y k y k [ P v k = 1 y k b(y k ) ; where e represents the absolute error, Z the relative error, and v is the size of the validation set. + Finally, for intervals derived by the third methodology, the rate of correct classications is measured, along with the average size of the intervals. If the true value of the dependent variable lies outside the interval, the errors encountered are measured in the same way as in the second case. Details of the validation process are presented only for one type of disaster (earthquakes) and for the dependent variable NC. Because the methodology is the same for the other cases, only a summary of the nal errors is shown. Table 1 presents a sample of the possibilities obtained for the classes of NC when HDI =medium and magnitude=moderate. As stated above, the frequencies of the lower classes are much larger than those of the other classes, leading to higher possibilities for the lower classes. Note that these possibilities sum to o1. Therefore, 1 max j 1 ;...;jp I NC (j 1 ; . . . ; j p ) o1, meaning that some ignorance is attributed to variable NC because the EM-DAT sample contains missing values for that variable (which should be taken into account when choosing the threshold d). Table 2 presents the possibilities obtained for the dependent variable NC for various combinations of premises. An increasing trend from HDI =high to HDI =low and from magnitude=very low to magnitude=high can be observed for this variable, as could be logically expected. Table 3 presents the fuzzy 75th percentile of the variable NC obtained for each combination of the independent variables. The same trend as before is observed. Also, it is interesting to note how the presence of outliers makes the above-average values greater than the percentiles. Error measures for the class predictions of the variable NC are presented in Table 4. For each value of d, the rate of correct classications, the average size (measured in classes) of the predictions and the average distance to the real class, when a classication is wrong, are displayed. The errors of mean values predicted for the same variables by the second method are presented in Table 5, where average absolute and relative errors, computed as explained earlier, are summarized. Tables 68 present error and accuracy measures for the interval prediction of NC. Table 6 presents, for each combination Table 1 A sample of computed possibilities for classes of number of casualties when HDI =medium and magnitude=moderate. Number of casualties No. casualties Very few Few Quite a lot Lot 0.377 0.234 0.177 0.091 0.054 Table 2 Weighted average values of number of casualties. HDI Magnitude Very low Low Moderate High Low 2.31 30.92 217.74 1622.37 Medium 1.97 8.18 83.88 839.90 High 0.91 2.96 7.20 62.48 Table 3 Fuzzy 75th percentiles of number of casualties. HDI Magnitude Very low Low Moderate High Low 7 30 126 382 Medium 7 14 46 207 High 2 10 9 20 Table 4 Error display for class predictions of NC. d C. classif. (%) A. N. class Avg. dist. 0.20 43.48 1.00 2.07 0.35 51.60 1.46 1.73 0.40 56.53 1.64 1.66 0.50 65.72 2.02 1.49 0.60 75.02 2.50 1.38 0.65 81.66 2.78 1.24 0.70 86.72 3.08 1.12 0.75 91.64 3.37 1.13 0.80 95.56 3.71 1.90 Table 5 Parameters beta and error display for NCs mean value prediction. Avg. abs. error Avg. rel error (%) b 1 b 2 348.6 77.72 100 1000 Table 6 Correct classications rate and average interval size for different percentile parameters. a 1 a 2 C. classif. (%) Avg. int. length 10 75 79.72 117.0 10 80 84.92 251.4 10 90 92.85 871.3 15 75 79.21 116.9 15 85 86.49 442.8 20 80 77.95 250.9 25 75 66.78 116.0 Table 7 Below errors for interval prediction of NC and different values of a 1 . a 1 % Below err. Avg. abs. err. Avg. rel. err. (%) 10 0.00 0.0 0.0 15 0.51 0.4 0.0 20 7.19 1.1 0.0 25 12.88 1.4 0.0 Table 8 Above errors for interval prediction of NC and different values of a 2 . a 2 Above err. (%) Avg. abs. err. Avg. rel. err. (%) 75 20.28 123.0 91.3 80 14.86 115.0 84.3 85 12.83 81.8 78.8 90 7.15 102.7 73.7 J. Tinguaro Rodr guez et al. / Computers & Operations Research ] (]]]]) ]]]]]] 8 Please cite this article as: Tinguaro Rodrguez J, et al. A general methodology for data-based rule building and its application to natural disaster management. Computers and Operation Research (2009), doi:10.1016/j.cor.2009.11.014 ARTICLE IN PRESS of percentile parameters a 1 , a 2 , the proportion of predicted intervals which correctly classify the values of the dependent variable and the average length of these intervals. Table 7 (respectively, Table 8) presents the proportion of intervals for which the real value of NC lies below (above) the predicted value, and the absolute and relative errors, in the sense of the second method, associated with the inferior (superior) extreme of the interval. In the authors opinion, a good choice of class prediction threshold is d =0:5 because this value achieves a good compro- mise between correct classication (65% accuracy) and size or uncertainty of predictions (size of two classes). Similarly, according to Table 6, good choices for the percentile parameters seem to be a 1 =15 and a 2 =75, because these values achieve an 80% rate of correct classication with an average interval length of 117, which is relatively small. However, as presented in Table 8, a problem with this choice is that for 20% of the instances, the real value of NC lies above the interval, which amounts to underestimating the magnitude of the disaster. Distances from the upper bound of the interval to real values of NC are in the order of one hundred for relatively small values of that variable (NCo500). For larger values of NC, the relative errors can increase to as much as 90%. In this sense, a better choice of percentile parameters might be a 1 =20 and a 2 =80, because the lower-bound errors, although larger, are insignicant and the upper-bound errors are somewhat reduced. As expected, the validation results for the second group of rules are rather poor, mainly because of the huge variability contained in the sample for the dependent variable NC. Obviously, mean values perform poorly as a predictor for a variable with huge variance and very far away from a normal distribution. For these reasons, further results do not include this kind of prediction. Despite some uncertainty that can be assumed due to the absence of basic data and explanatory variables, class and interval predictions are quite satisfactorily when predicting the variable NC. The only sense in which these predictions could be considered dubious is that they underestimate the consequences of disasters. However, this is only natural when considering that major disasters are very unusual events and that this fact is also reected in the EM-DAT database. A more complete or balanced database should make it possible to distinguish these major disasters more effectively. Validation results for the rest of the dependent variables and the disaster types are presented in Tables 9 and 10. Note that, to enable better tuning of the system, the parameters are allowed to be different for each of the dependent variables. For instance, because the proportion of missing values could vary from one variable to another, a threshold d with good performance for any one of them would not be well adapted to the rest. Furthermore, according to Proposition 2, the algorithm which generates the class prediction would not stop if the condition dr1 max j 1 ;...;jp I Y (j 1 ; . . . ; j p ) were not fullled by some of the variables. The same applies for the percentile parameters a 1 ; a 2 . Moreover, because the scale varies for each variable, the parameters b 1 ; b 2 should also be varied to adapt to each variables sensitivity. Although class forecasting performs more accurately in some cases, these tables presents that, in general, interval predictions have better results. Their proportion of correct classications is never o60%. Moreover, for many cases, the average interval length is small enough to outrank class predictions, which in turn normally have a size of at least two classes. Nevertheless, the most important reason to prefer interval forecasts is the following: they adapt better to samples with modes, high variability, and few independent variables. For instance, if a variable has a sharp mode in one class, say, with 50% of instances, then the class predictions will be irremediably biased towards this mode class. However, 50% of cases lie outside the mode class, and the only way to capture this variability is by using a large enough value of d and predictions with a relatively large size. Moreover, class predictions must be constructed based on that maximum possibility category, and as a result, classes far away from that category are rarely predicted. On the other hand, if this mode lies on an extreme of a variables range, one of the percentiles is normally close enough to it, and therefore the mode should lie inside the interval. The high variability of the remaining 50% of the sample is then addressed by the other percentile. If the mode is located in the center of the range, this is not a problem for interval prediction. Finally, one more advantage of interval forecasts is their ability to reect a samples variability and therefore to provide a measure of the uncertainty contained in historical data. Although large interval sizes could be seen as undesirable, they are also the logical consequence of samples with very large variances. Thus it has been shown that, among the three inference methods described in the last section, the one based on intervals responds best to the various difculties present in the EM-DAT sample. On average, the interval-based method successfully classies more than two out of three instances of the validation Table 9 Error display for class predictions. Disaster Casualties Injured Homeless Affected Damage $ d % C.C. Size d % C.C. Size d % C.C. Size d % C.C. Size d % C.C. Size Earthquake 0.50 65.72 2.02 0.50 76.90 1.30 0.35 73.83 2.05 0.40 76.56 2.33 0.30 66.95 1.93 Flood 0.50 59.68 1.97 0.35 70.66 1.73 0.35 54.96 2.52 0.40 61.77 1.64 0.30 63.64 1.59 Wind storm 0.50 69.75 2.08 0.30 55.03 2.04 0.30 61.39 2.44 0.35 56.93 1.93 0.25 60.48 1.74 Table 10 Error display for interval predictions. Disaster Casualties Injured Homeless Affected Damage$ a 1 a 2 % C.C Lgth. a 1 a 2 % C.C Lgth. a 1 a 2 % C.C Lgth. a 1 a 2 % C.C Lgth. a 1 a 2 % C.C Lgth. Earthquake 20 80 78.0 251 20 80 63.2 1094 35 75 73.3 23402 15 75 62.7 62672 15 75 60.4 926248 Flood 10 75 74.0 191 25 75 76.3 693 23 75 71.0 106173 10 80 67.9 5116463 10 75 62.2 1130942 Wind storm 10 75 74.7 88 15 80 77.4 473 30 75 70.3 44992 10 75 71.7 445831 10 80 65.4 849605 J. Tinguaro Rodr guez et al. / Computers & Operations Research ] (]]]]) ]]]]]] 9 Please cite this article as: Tinguaro Rodrguez J, et al. A general methodology for data-based rule building and its application to natural disaster management. Computers and Operation Research (2009), doi:10.1016/j.cor.2009.11.014 ARTICLE IN PRESS set and provides an idea of the level of uncertainty present in the sample. Class predictions, although showing good performance in many cases, do not perform as well as interval-based methods. Finally, forecasts based on mean values perform poorly. 4.2. A case study This section presents a case study which provides a better picture of SEDDs operation. The recent tragic earthquake in LAquila, Italy, will be taken as an example. LAquila is a small city (approximately 75,000 inhabitants), which is the capital of the region of Abruzzo, in the center of Italy. In the night of April 6, 2009, the city was struck by an earthquake of magnitude 6.3, resulting in 294 deaths, 1500 injuries, and 40,000 homeless. Let us also recall that Italy is a highly developed country to which the United Nations Development Program (UNDP) statistics assign an HDI of 0.945 out of 1 (2008 data). Following the methodology described at the beginning of this section, HDI and magnitude information is used to forecast the number of casualties, injured people, and affected people. The classes for these variables are the same as before, but in this case, the entire EM-DAT sample will be used to build the rules. The parameters also have the same values as presented in Table 10. The forecasts obtained are presented in Table 11. These results are quite poor: as already happened in the validation process, the predicted mean values are very far away from the real ones. Moreover, the class predictions fail for all variables, and the interval forecasts are correct only for the number of injured people, being far away from the real values for the remaining variables. In summary, the performance obtained is much poorer than that observed in the validation process. How can this be? In the authors opinion, one main problem is the value of the HDI variable. Recall that this variable provides a measure of the vulnerability of a country, but it is of course true that local conditions and therefore local vulnerability can differ widely from the national average. This is a strong argument for including more independent variables in the EM-DAT dataset, particularly those which would make it possible to assess local vulnerability. In this sense, let us recall that many Italian experts criticized the quality of the buildings which were destroyed by the earthquake (see for instance [34]). It seems that many of these buildings were built without taking into account the current laws and regulations of the country. This information could be taken to suggest that the local HDI of the affected zone (LAquila in this case) is lower than the national average. In any case, it is clear (at least from the facts) that local vulnerability was in fact relatively high. Taking into account these considerations, it seems appropriate to repeat the case study with a corrected HDI value. It was decided to assign this variable a value of 0.8, which constitutes the limit between what the United Nations calls medium and high levels of human development. That is, the new value is signicantly lower than the original value, but still high enough to be considered realistic for a developed country such as Italy. Results obtained using this corrected values are presented in Table 12. Now the interval predictions have improved, correctly includ- ing the real value for two of the three variables and being only slightly off for the remaining one. Although the predicted classes do not include the real ones, the distance to them is smaller than before. Moreover, for the injured variable, the possibilities obtained were ^ p H = (22:69; 32:98; 15:77; 8:08; 3:80), and thus class 4 would have been predicted by using the pessimistic rule (see Section 3.3) because d =0:5 for this variable. Anyway, as already stated, the class predictions are biased towards the mode class, which usually is the lowest one, and this case study is no exception. Further work will be required to develop methods to correct unbalanced samples as in the case of EM-DAT. Interval predictions appear best suited to the challenge of modeling the EM-DAT database and using it for learning and inference tasks. Once again, some underestimation is present in the predictions as a consequence of the samples lack of balance. Finally, more independent variables would be useful to distin- guish local features of the place affected by a disaster. 5. Conclusions Humanitarian NGOs play a key role in current emergency response, especially in many developing countries where avail- able resources or policies make it impossible for local organiza- tions to attend to disasters properly and international humanitarian aid is therefore required. Moreover, disaster mitigation and risk management are issues whose importance goes beyond humanitarian action. Therefore, the problem dis- cussed in this paper is crucial in natural-hazard management, addressing a gap that some NGOs claim exists. Standard existing DSS have been criticized because of their unrealistic complexity. In this paper, on one hand, the authors emphasize the need to develop decision support tools specically to address this problem, and on the other hand, they show that it is possible to design such a practical decision support tool so that it can be implemented in contexts such as developing countries or NGOs. In particular, a DSS-DM data-based rule-building methodology has been presented here, which enables damage assessment for multiple disaster scenarios to be generated from available information about the case under study, taking into account historical disaster data and knowledge. Although the system is still a prototype, validation results suggest the suitability of the approach developed in this work. Future research within the SEDD project will involve, among other things, heuristics for reducing the number of rules which need to be built (by means of fuzzy categorical clustering, for example; see [5]), the introduction of a fuzzy bipolar approach (in the sense of [21]) to model arguments for and against a certain classication, and an extended reasoning methodology which should take into account relations between dependent variables (see, e.g., [18]). Additional effort should also be devoted to enlarge the knowledge base by adding more premises and dependent variables, to analyze the robustness of the system with respect to Table 11 Real and forecast values for the case of the LAquila earthquake. Variable Real value Real class Predicted value Predicted class Predicted interval Casualties 294 3 10 1 [0,51] Injured 1500 4 303 12 [8,2060] Homeless 40000 4 2683 12 [0,27601] Table 12 Real and forecast values for the case of the LAquila earthquake with corrected HDI. Variable Real value Real class Predicted value Predicted class Predicted interval Casualties 294 3 81 12 [0,283] Injured 1500 4 368 12 [33,1880] Homeless 40000 4 15225 13 [0,65256] J. Tinguaro Rodr guez et al. / Computers & Operations Research ] (]]]]) ]]]]]] 10 Please cite this article as: Tinguaro Rodrguez J, et al. A general methodology for data-based rule building and its application to natural disaster management. Computers and Operation Research (2009), doi:10.1016/j.cor.2009.11.014 ARTICLE IN PRESS class denition, and to estimate the required resources. Moreover, in order to provide a better treatment of information uncertainty, it will be necessary to introduce, together with fuzzy uncertainty measures, complementary probabilistic measures, and to imple- ment in both frameworks the spatial aggregation operators described in [2,3,33]. Acknowledgments This research has been partially supported by Grants TIN2009- 07901, CCGO7-UCM/ESP-2576, and I-Math Consolider C3-0132. References [1] Aleskerov F, Say AI, Toker A, Akin H, Altay G. A cluster-based decision support system for estimating earthquake damage and casualties. Disasters 2005;29:255276. [2] Amo A, Montero J, Biging G, Cutello V. Fuzzy classication systems. European Journal of Operational Research 2004;156:495507. [3] Amo A, Montero J, Molina E. On the representation of recursive rules. European Journal of Operational Research 2001;130:2953. [4] Asghar S, Alahakoon D, Churilov L. A dynamic integrated model for disaster management decision support systems. International Journal of Simulation 2006;6(1011):95114. [5] Benati S. Categorical data fuzzy clustering: an analysis of local search heuristics. Computers & Operations Research 2008;35(3):76675. [6] Cagliardi M, Spera C. Towards a formal theory of model integration. Annals of Operations Research 1995;58:40540. [7] Cova TJ. GIS in emergency management. In: Longley PA, Goodchild MF, Maguire DJ, Rhind DW, editors. Geographical information systems: principles, techniques, applications and management. New York: Wiley; 1999. p. 84558. [8] Destercke S, Guillaume S, Charnomordic B. Building an interpretable fuzzy rule base from data using orthogonal least squares application to a depollution problem. Fuzzy Sets and Systems 2007;158(18):207894. [9] Dolk D, Kottermann J. Model integration and a theory of models. Decision Support Systems 1993;9:5163. [10] Drabek TE, Hoetmer GJ, editors. Emergency management: principles and practice for local government. Washington, DC: International City Manage- ment Association; 1991. [11] Eguchi RT, Goltz JD, Seligson HA, Flores PJ, Blais NC, Heaton TH, et al. Real- time loss estimation as an emergency response decision support system: the early post-earthquake damage assessment tool (EPEDAT). Earthquake Spectra 1997;13(4):81532. [12] Eom HB, Lee SM. A survey of decision support system applications (1971April 1988). Interfaces 1990;20(3):6579. [13] Eom SB, Lee SM, Kim EB, Somarajan C. A survey of decision support system applications (19881994). Journal of the Operational Research Society 1998;49:10920. [14] Griekspoor A, Collins S. Raising standards in emergency relief: how useful are sphere minimum standards for humanitarian assistance?. British Medical Journal 2001;323:7402. [15] Matisziwa TC, Murraya AT. Modeling st path availability to support disaster vulnerability assessment of network infrastructure. Computers & Operations Research 2009;36(1):1626. [16] Mendonc-a D, Beroggi EG, Wallace WA. Decision support for improvisation during emergency response operations. International Journal of Emergency Management 2001;1:308. [17] Milly PC, Wetherald RT, Dunne KA, Delworth TL. Increasing risk of great oods in a changing climate. Nature 2002;415:5147. [18] Montero J, Go mez D, Bustince H. On the relevance of some families of fuzzy sets. Fuzzy Sets and Systems 2007;158:243942. [19] Morrow BH. Identifying and mapping community vulnerability. Disasters 1999;23(1):118. [20] Olsen GR, Carstensen N, Hoyen K. Humanitarian crisis: what determines the level of emergency assistance? Media coverage, donor interest and the aid business. Disasters 2003;27(2):10926. [21] O
zt urk M, Tsouki as A. Modelling uncertain positive and negative reasons in
decision aiding. Decision Support Systems 2007;43(4):151226. [22] Repoussis PP, Paraskevopoulos DC, Zobolas G, Tarantilis CD, Ioannou G. A web-based decision support system for waste lube oils collection and recycling. European Journal of Operational Research 2009;195:676700. [23] Rodriguez JT, Vitoriano B, Montero J, Omana A. A decision support tool for humanitarian organizations in natural disaster relief. In: Ruan D, editor. Computational intelligence in decision and control. Singapore: World Scientic; 2008. p. 6005. [24] Schneider PJ, Schauer BA. Hazusits development and its future. Natural Hazards Review 2006;7(2):404. [25] Schweizer B, Sklar A. Probabilistic metric spaces. New York: North-Holland/ Elsevier; 1983. [26] Stoddard A. Humanitarian NGOs: challenges and trends. In: Macrae J, Harmer A, editors. Humanitarian action and the global war on terror: a review of trends and issues, HPG Report 14, London: ODI; 2003. p. 2536. [27] Todini E. An operational decision support system for ood risk mapping, forecasting and management. Urban Water 1999;1:13143. [28] Tufekci S. An integrated emergency management decision support system for hurricane emergencies. Safety Science 1995;20:3948. [29] Turban E, Aronson J. Decision support systems and intelligent systems. Upper Saddle River: Prentice-Hall; 1997. [30] Van Wassenhove LN. Humanitarian aid logistics: supply chain management in high gear. Journal of Operations Research Society 2006;57:47589. [31] Wallace WA, De Balogh F. Decision support systems for disasters manage- ment. Public Administration Review 1985;45:13447. [32] Webb P, Harinarayan A. A measure of uncertainty: the nature of vulnerability, and its relationship to malnutrition. Disasters 1999;23(4):292305. [33] Yager R, Kacprzyk J, editors. The ordered weighted averaging operators: theory and applications. Dordrecht: Kluwer Academic Publisher Group; 1997. [34] /http://news.bbc.co.uk/2/hi/europe/7992936.stmS. J. Tinguaro Rodr guez et al. / Computers & Operations Research ] (]]]]) ]]]]]] 11 Please cite this article as: Tinguaro Rodrguez J, et al. A general methodology for data-based rule building and its application to natural disaster management. Computers and Operation Research (2009), doi:10.1016/j.cor.2009.11.014