Professional Documents
Culture Documents
Variables
Table 2 shows the descriptive statistics of the second visit variables of interests
(VI), TOTEX2 and TOTIN2. This was computed to provide a brief idea on how
much a household spends and earns in a period of time, measure the differences
of the statistics between the two variables and to compare the results with other
tests later on.
The average total spending of a household in the National Capital Region (NCR)
is about Php 102,389.80 while the average total earnings amounted to P134,119.40,
2
a difference of more than thirty thousand pesos. it can be noted that the observa-
tions from the TOTIN2 have a larger mean and standard deviation as compared
to TOTEX2. The dispersion can be also seen by just looking at the minimum at
maximum of the two variables.
Table 3 shows the results of the Chi-Square Test of Independence where it was
performed to determine if the candidate matching variables (MVs) are associ-
ated with the VIs. The MV stated in the methodology must be highly correlated
to the variables of interest. The first visit variables of interest, TOTIN and TO-
TEX, were grouped into four categories in order to satisfy the assumptions in the
association tests. The first visit VIs were used as the variables to be tested for
association rather than second visit VIs since the second visit VIs already con-
tained missing data.
The candidate MVs that were tested are the provincial area codes (PROV), re-
coded education status (CODES1) and recoded total employed household mem-
bers (CODEP1).The PROV has four categories. The codes for PROV were 39,
which is designated for Manila, while 74 is designated for NCR District 2. District
2 is comprised of Quezon City, Mandaluyong City,San Juan, Marikina and Pasig
City. The code 75, which is NCR District 3 for PROV is designated for Caloocan,
Malabon, Navotas and Valenzuela. The last category for PROV is 76, which is
NCR District fourth that includes Makati, Las Piñas, Muntinlupa, Parañaque,
3
The candidate MV CODES1 has three categories. The original Education Status
variable had 99 categories, hence, the researchers reduced these categories and cat-
egorize them further into smaller groups to reduce the heterogeneity and the bias
of the estimates. The recoded MV CODES1 were indicated as 1 for respondents
which indicated responses from No Grade Completed until High School Graduate
for its educational attainment; 2 for respondents that answered as College Un-
dergraduate or College Graduate as its educational attainment; 3 for respondents
which had an educational attainment higher than a Bachelor’s Degree.
CODEP1 has also four categories. The original Total Employed Household Mem-
bers variable had 7 categories and like the Education Status variable, this was
reduced to smaller groups. The recoded MV CODEP1 were indicated as 0 for
households with no employed members, 1 for households with one or two em-
ployed members, 2 for households with three or four employed members and 4 for
households with five or more employed members.
4
Table 3: Results for the Chi-Square Test of Independence for the Matching
Variables
The Chi-Squared test of association for the candidates and the variables of inter-
est showed that PROV, CODES1 and CODEP1 are associated to CODIN1 and
CODEX1. The p-values for all the candidates were less than 0.0001 indicating
that the association is very significant. The results of succeeding measures of as-
sociation will determine which of the three candidates will be chosen as the MV
of the study.
5
The degree of association for all the tests showed small measures association with
variables CODIN and CODEX. This kind of result is expected in real complex
data, given larger variability among the observations. From Table 4, it is clearly
shown that the CODES1 is the MV which exhibit the largest association among
the variables and therefore, the MV that can ensure that the ICs are homoge-
neous. Thus, CODES1 is the chosen MV for this data.
with the value from the overall standard deviation of the variables of interest.
The table shown above indicates that IC1 is the imputation class with the smallest
standard deviation. The two ICs, IC2 and IC3 produced large standard deviations
however it is being neutralized by a low value from IC1 which has the largest pro-
portion of the data. A possible reason why the standard deviation and the mean
of IC3 are large is because majority of the extreme values were contained on that
class.
7
Table 6 shows the result of the means in both VIs under the varying rates of
nonresponse. This was generated to have a brief description on the effects on
nonresponse rate on the population mean ignoring the missing values. More im-
portantly, the results below were used as input in the comparison of the estimates
from the imputed data for each imputation method (IM).
The mean rates of the observations set to nonresponse and observations retained
showed contrasting results. For both variables, TOTEX2 and TOTIN2,When
the nonresponse rate increases, the mean rate of observations set to nonresponse
also increases. Conversely, the mean rate of observations retained decreases when
nonresponse rate increases. Perhaps the large values that were set to nonresponse
increased the means of the data sets containing nonresponse for the varying rates
of nonresponse. Hence, as the number of missing values increases, the deviation
between the means of the actual and retained data slowly increases.
8
Table 7 show the different regression models for all VIs and nonresponse rates
(NRRs) that were checked for adequacy. The columns are represented as follows:
(a) VI, (b) the nonresponse rate (NRR), (c) IC, (d) the prediction model, (e)
the coefficient of determination (R2 ) and (f) the F-statistic and its corresponding
p-value indicated by the values in parenthesis.
For the notations used in Table 7, the codes IC1, IC2, IC3 represents the first,
second and third imputation class respectively. Meanwhile, for the regression
equations used for the regression imputation, ŷi represents the dependent vari-
able, which is the predicted second visit value for variable TOTIN2 or TOTEX2.
Logarithmic transformations were utilized in order to correct the non-linearity for
the regression equations. The code (LN F V E1i ) is the logarithmic transformation
of the first visit observation for the variable Total Expenditure (TOTEX) under
the First Imputation Class. Similarly, (LN F V I1i ) is the logarithmic transforma-
tion of the first visit observation for the variable Total Income (TOTEX)under
the First Imputation Class. The same notation also applies for (LN F V E2i )
and (LN F V E3i ) under the Second and Third Imputation Class for the variable
TOTEX and (LN F V I2i ) and (LN F V I3i ) under the Second and Third Imputa-
tion Class for the variable TOTIN.
9
Table 7 showed the regression models used for the regression imputations under
their respective VIs and ICs. Before using these equations for imputating missing
values, diagnostic checking of the models, which include Linearity, Normality of
Error Terms, Independence of Error Terms and Constancy of Variance.
Second, the models were checked if they satisfy the assumption of linearity. This
was performed using the ANOVA tables presented in Appendix C. The results
of the diagnostic checking showed that all models exhibited the assumption of
linearity. The p-values for all the models were less than 0.0001, an indication that
the linearity of the models is very significant.
Third, the next phase for diagnostic checking is to check if the regression model
satisfy the assumption of normality. For this study, the researchers examined the
Normal Probability Plot(NPP) of the regression models. The normal probability
plot in all models moderately follows the S-shaped pattern which indicates that
11
the residuals are not normal but rather lognormal. However, the shape of the
NPP improved after ln transformation was applied even though the model was
not linear previously. Since the data used is a complex data, the models were used
even if assumption of the residuals to be normal is not perfectly achieved.
Fourth, in testing for the assumption of independence of error terms, the Durbin -
Watson test was implemented. Results in Appendix C show that all of the models
satisfy the assumption of independence. However, since the data in this paper is
not a time series data where the assumption of the independence of error terms
is relatively important, the assumption of independence was ignored.
Hence, given this discussion, the results show that the assumptions for the di-
agnostic checking of the regression equations used for the regression imputations
are satisfied.