You are on page 1of 11

Chapter 1

Results and Discussion

1.1 Descriptive Statistics of Second Visit Data

Variables

Table 2 shows the descriptive statistics of the second visit variables of interests
(VI), TOTEX2 and TOTIN2. This was computed to provide a brief idea on how
much a household spends and earns in a period of time, measure the differences
of the statistics between the two variables and to compare the results with other
tests later on.

Table 2: Descriptive Statistics of the 1997 FIES Second Visit

Variable Mean Std. Dev Min Max N

TOTEX2 102,389.8 129,866.6 8,926.00 3,903,978 4,130


TOTIN2 134,119.4 216,934.9 9,067.00 4,357,180 4,130

The average total spending of a household in the National Capital Region (NCR)
is about Php 102,389.80 while the average total earnings amounted to P134,119.40,
2

a difference of more than thirty thousand pesos. it can be noted that the observa-
tions from the TOTIN2 have a larger mean and standard deviation as compared
to TOTEX2. The dispersion can be also seen by just looking at the minimum at
maximum of the two variables.

1.2 Formation of Imputation Classes

Table 3 shows the results of the Chi-Square Test of Independence where it was
performed to determine if the candidate matching variables (MVs) are associ-
ated with the VIs. The MV stated in the methodology must be highly correlated
to the variables of interest. The first visit variables of interest, TOTIN and TO-
TEX, were grouped into four categories in order to satisfy the assumptions in the
association tests. The first visit VIs were used as the variables to be tested for
association rather than second visit VIs since the second visit VIs already con-
tained missing data.

The candidate MVs that were tested are the provincial area codes (PROV), re-
coded education status (CODES1) and recoded total employed household mem-
bers (CODEP1).The PROV has four categories. The codes for PROV were 39,
which is designated for Manila, while 74 is designated for NCR District 2. District
2 is comprised of Quezon City, Mandaluyong City,San Juan, Marikina and Pasig
City. The code 75, which is NCR District 3 for PROV is designated for Caloocan,
Malabon, Navotas and Valenzuela. The last category for PROV is 76, which is
NCR District fourth that includes Makati, Las Piñas, Muntinlupa, Parañaque,
3

Pasay, Taguig and Pateros.

The candidate MV CODES1 has three categories. The original Education Status
variable had 99 categories, hence, the researchers reduced these categories and cat-
egorize them further into smaller groups to reduce the heterogeneity and the bias
of the estimates. The recoded MV CODES1 were indicated as 1 for respondents
which indicated responses from No Grade Completed until High School Graduate
for its educational attainment; 2 for respondents that answered as College Un-
dergraduate or College Graduate as its educational attainment; 3 for respondents
which had an educational attainment higher than a Bachelor’s Degree.

CODEP1 has also four categories. The original Total Employed Household Mem-
bers variable had 7 categories and like the Education Status variable, this was
reduced to smaller groups. The recoded MV CODEP1 were indicated as 0 for
households with no employed members, 1 for households with one or two em-
ployed members, 2 for households with three or four employed members and 4 for
households with five or more employed members.
4

Table 3: Results for the Chi-Square Test of Independence for the Matching
Variables

The Chi-Squared test of association for the candidates and the variables of inter-
est showed that PROV, CODES1 and CODEP1 are associated to CODIN1 and
CODEX1. The p-values for all the candidates were less than 0.0001 indicating
that the association is very significant. The results of succeeding measures of as-
sociation will determine which of the three candidates will be chosen as the MV
of the study.
5

Table 4 shows the other measures of association, namely, the Phi-Coefficient,


Cramers V and the Contingency Test. These tests were done in order to as-
sess the degree of association of the candidates to CODIN1 and CODEX1.

Table 4: Tests of Association for Matching Variable: Degree of Association

The degree of association for all the tests showed small measures association with
variables CODIN and CODEX. This kind of result is expected in real complex
data, given larger variability among the observations. From Table 4, it is clearly
shown that the CODES1 is the MV which exhibit the largest association among
the variables and therefore, the MV that can ensure that the ICs are homoge-
neous. Thus, CODES1 is the chosen MV for this data.

To have a detailed description of the CODES1 imputation classes, the descriptive


statistics for each imputation class was obtained. Table 5 shows the descriptive
statistics of each imputation class of the data. The descriptive statistics will tell if
the best MV decreases the variability of the observations. In checking for the vari-
ability of each imputation class, the standard deviation will be used and compared
6

with the value from the overall standard deviation of the variables of interest.

Table 5: Descriptive Statistics of the Data Grouped into Imputation Classes.

The table shown above indicates that IC1 is the imputation class with the smallest
standard deviation. The two ICs, IC2 and IC3 produced large standard deviations
however it is being neutralized by a low value from IC1 which has the largest pro-
portion of the data. A possible reason why the standard deviation and the mean
of IC3 are large is because majority of the extreme values were contained on that
class.
7

1.2.1 Mean of the Simulated Data by Nonresponse Rate

for Each Variables of Interest

Table 6 shows the result of the means in both VIs under the varying rates of
nonresponse. This was generated to have a brief description on the effects on
nonresponse rate on the population mean ignoring the missing values. More im-
portantly, the results below were used as input in the comparison of the estimates
from the imputed data for each imputation method (IM).

Table 6: Means of the Retained and Deleted Observations

The mean rates of the observations set to nonresponse and observations retained
showed contrasting results. For both variables, TOTEX2 and TOTIN2,When
the nonresponse rate increases, the mean rate of observations set to nonresponse
also increases. Conversely, the mean rate of observations retained decreases when
nonresponse rate increases. Perhaps the large values that were set to nonresponse
increased the means of the data sets containing nonresponse for the varying rates
of nonresponse. Hence, as the number of missing values increases, the deviation
between the means of the actual and retained data slowly increases.
8

1.2.2 Regression Model Adequacy

Table 7 show the different regression models for all VIs and nonresponse rates
(NRRs) that were checked for adequacy. The columns are represented as follows:
(a) VI, (b) the nonresponse rate (NRR), (c) IC, (d) the prediction model, (e)
the coefficient of determination (R2 ) and (f) the F-statistic and its corresponding
p-value indicated by the values in parenthesis.

For the notations used in Table 7, the codes IC1, IC2, IC3 represents the first,
second and third imputation class respectively. Meanwhile, for the regression
equations used for the regression imputation, ŷi represents the dependent vari-
able, which is the predicted second visit value for variable TOTIN2 or TOTEX2.
Logarithmic transformations were utilized in order to correct the non-linearity for
the regression equations. The code (LN F V E1i ) is the logarithmic transformation
of the first visit observation for the variable Total Expenditure (TOTEX) under
the First Imputation Class. Similarly, (LN F V I1i ) is the logarithmic transforma-
tion of the first visit observation for the variable Total Income (TOTEX)under
the First Imputation Class. The same notation also applies for (LN F V E2i )
and (LN F V E3i ) under the Second and Third Imputation Class for the variable
TOTEX and (LN F V I2i ) and (LN F V I3i ) under the Second and Third Imputa-
tion Class for the variable TOTIN.
9

Table 7: Model Adequacy Results


10

Table 7 showed the regression models used for the regression imputations under
their respective VIs and ICs. Before using these equations for imputating missing
values, diagnostic checking of the models, which include Linearity, Normality of
Error Terms, Independence of Error Terms and Constancy of Variance.

First, the researchers looked at the coefficient of determination or R2 of each


regression equation in order to determine the explanatory power of first visit VI
to the second visit VI. A large value of R2 is a good indication on how well the
model fits the data. The highest R2 in Table 7 measured 93.2% (The equation
under TOTEX2,IC3 with 30% nonresponse rate). Meanwhile, the lowest coeffi-
cient of determination can be found at the equation with the variable TOTIN2,
under IC1 with 20% NRR, which had an R2 of 70.3%. For all NRR and VIs, the
third IC generated the highest R2 while the first IC produced the lowest R2 .

Second, the models were checked if they satisfy the assumption of linearity. This
was performed using the ANOVA tables presented in Appendix C. The results
of the diagnostic checking showed that all models exhibited the assumption of
linearity. The p-values for all the models were less than 0.0001, an indication that
the linearity of the models is very significant.

Third, the next phase for diagnostic checking is to check if the regression model
satisfy the assumption of normality. For this study, the researchers examined the
Normal Probability Plot(NPP) of the regression models. The normal probability
plot in all models moderately follows the S-shaped pattern which indicates that
11

the residuals are not normal but rather lognormal. However, the shape of the
NPP improved after ln transformation was applied even though the model was
not linear previously. Since the data used is a complex data, the models were used
even if assumption of the residuals to be normal is not perfectly achieved.

Fourth, in testing for the assumption of independence of error terms, the Durbin -
Watson test was implemented. Results in Appendix C show that all of the models
satisfy the assumption of independence. However, since the data in this paper is
not a time series data where the assumption of the independence of error terms
is relatively important, the assumption of independence was ignored.

Lastly, to check if the residuals satisfy homoscedasticity or the equality of vari-


ances, a scatter plot of the residuals against the predicted values was obtained.
Results showed that there were no distinct patterns evident in the scatter plot.
The logarithmic transformation resolved the problem of heteroscedasticity.

Hence, given this discussion, the results show that the assumptions for the di-
agnostic checking of the regression equations used for the regression imputations
are satisfied.

You might also like