Multiple Linear Regression Using Rank-Based Test of Asymptotic Free Distribution

  

Multiple Linear Regression Using

Rank-Based Test of Asymptotic Free

Distribution

  Kuntoro † † Department of Biostatistics and Population Study, Airlangga University School of

  

Public Health, Surabaya 60115, Indonesia

(e-mail: kuntoro1@indo.net.id)

  Abstract An experimental design is a classical approach for proving causal relationship. Some- time a study in the field of public health including maternal child health study is difficult to control experimental conditions properly beside an ethical reason for doing an exper- iment. A multiple regression approach that involves a dependent variable and a number of independent variables in its model could be an alternative solution for proving causal relationship in a non experimental study.

  In maternal child health study that involves variables in ordinal scales such knowledge, attitude and practice, an ordinary regression model is not the best choice for analyzing those variables. A rank-based test of asymptotic free distribution is the better alternative solution than that one. The Jaeckel - Hettmansperger- McKean, HM is used to demon- strate the effect of knowledge about safe water and attitude upon drinking unboiled water on practice of drinking unboiled water. The data obtained from sample of mothers having under five yrears children in 14 districts in East Java Province, Indonesia.

  The results show that Hodges - Lehmann estimate of tau is 0.5329. The Jaeckal distribution measure is 0.00002721. The HM statistic for testing the null hypothesis, Beta1 = Beta2 = 0 is 0.000102. Under null hypothesis, HM statistic has a sampling distribution that approximates to Chi Square distribution. Since the result is less than critical point of 5.99 (degree of freedom = 2 and level of significance of 0.05), the alternative hypothesis fails to be rejected. That means there are no effect of konwledge and attitude on practice.

  It is concluded that the procedure is quite simple compared to ordinary regression procedure, no assumption is made. It is easy to use. It is recommended to use HM

  276

   independent variable and Y variable as dependent variable. A regression model as a statistical tool looks like an experimental model as a research methodology tool in which they connect between the independent variable and the dependent variable (Joreskorg and Sorgom, 1988).

  Today many researchers from the areas of social sciences and economics as well as the behavioral sciences implement the regression model to demonstrate causal relationship in the nonexperimental conditions. They use the quantitative approach for collecting the data. Most data have an ordinal scale such as motivation, attitude, knowledge, practice, performance. Hence, one of the classical assumptions of ordinary regression model related to the scale of the data is violated.

  Researchers who are not statisticians argue that a statistical method is just a tool for support their findings no matter it violates or it does not violates the assumptions. They considers that a statistical tool is not an objective of the research process. A statistician should explain to them that the results of the research are valuable optimally when they are analyzed by mean of an appropriate statistical method. Over years statisticians have developed statistical methods that are expected to support the researchers in analyzing their data properly. This paper discusses the application of regression model when the data do not have an interval or a ratio scale. The first section discusses basic concept of nonparametric multiple linear regression. The second one implements that statistical method in the data collected from health research.

2 Basic Concept

  The basic concept to be discussed includes the data to be used, the asumption of the multiple regression model, the hypothesis to be formulated, the procedure for computing the statistic, and in the case where ties exist.

  2.1 Data

  Suppose, x = ⌊x 1 x 2 . . . x p ⌋ is a row vector of p independent variables, and x 1 = (x 11 , x 21 , . . . , x p1 ), . . . , x = (x 1n , x 2n , . . . , x pn ) are n fixed values of this vector. ′ ′ ′ n ¿From each vector x 1 , x 2 , . . . , x the value of the single response random dependent n variable Y is observed. Hence, a set of observations Y , Y , . . . , Y is obtained, in which ′ ′ 1 2 n Y is the value of the dependent variable when x = x . i i

2.2 Assumptions

  First of all, the following equation represents the multiple regression model:

  

  277 Secondly, the error random variables ǫ 1 , ǫ 2 , . . . , ǫ n are a random sample from a continuous distribution which is symmetric about its Median 0. It has cumulative dis- tribution function F (·) and has probability density function f (·) that satisfies the mild R +∞ 2 condition that f (t)dt < ∞. −∞

  2.3 Hypothesis

  In this regression model, it is emphasized to test the null hypothesis that a specific subset β q of the regression parameters β are equal to zero. Without loss of generality (because the ordering of (x 1 , β 1 ), (x 2 , β 2 ), . . . , (x p , β p ) pairs in the equation is arbitrary), this subset β q is taken to be the first q components of β, that is, β = [β q 1 β 2 . . . β q ] is taken. Hence, the hypothesis to be tested is ′ ′ ¤

  £β H : = 0; β = (β β ; . . . β ) and ξ not specif ied (4) q q+1 q+2 p

  p−q

  The statement mentioned above tells that the null hypothesis accepts that the inde- pendent variables x 1 , x 2 , . . . , x q do not have the significant roles in determining the value of the dependent variable Y. (In many setting, the interest is to assess the effect of all the independent variables simultaneously, which is appropriate to taking q = p in the null hypothesis

  2.4 Procedure

  In order to compute the Jaeckel - Hettmansperger - McKean, test statistic HM, it is processed in several steps clearly.

  The first step is to obtain an unrestricted estimator for the vector of regression param- ′ ′ ′ ′ β β β β eters . Suppose R i (β) is the rank of Y i − x among Y i n 1 − x 1 , Y 2 − x 2 , . . . , Y n − x as a function of β, for = 1, 2, . . . , n. The unrestricted estimator for β is appropriate to a special case of a class of estimator proposed by Jaeckel (1972). Hence, the estimator of the value of β, say, ˆ β minimizes the measure of dispersion: 1 X n 2 1

  1 D J (n + 1) [R i (β) − (n + 1)](Y i − x β ) (5)

  (Y − Xβ) = (12) i i=1

  2 In general, the estimator ˆ β does not have an expression of closed-form and methods of iterative computer is generally needed to obtain numerical solution. It can be accomplished by using command of ”RREG” in MINITAB program to obtain that value.

  The second step is to involve repeating the steps in order to obtain ˆ β. Except that minimization of the measure of dispersion Jaeckel D J (Y − Xβ) is obtained under the condition that the null hypothesis is true, say, β q = 0 , with β unspecified. Suppose

  p−q

  278

   By combining the results of the three steps, the Jaeckel - Hettmansperger - McK- ean test statistic HM is expressed by equation as follows:

  2D J HM = (8)

  τ ˆ If the null hypothesis is true, and n tends to be infinite, HM statistic has an 2 asymptotic chi square distribution (χ ) with q degree of freedom which is appropriate to the q constraints placed on β under the null hypothesis.

  To test the null hypothesis, ′ ′ ¤ £β

  H : = 0; β = (β q+1 β q+2 ; . . . β p ) and ξ not specif ied q

  p−q

  against the alternative hypothesis, ′ ′ ¤ £β = 0; β

  H : 6= (β q+1 β q+2 ; . . . β p ) and ξ not specif ied q

  p−q

  by selecting the level of significance of α, 2 Reject the null hypothesis if HM ≥ χ q,α 2 (9) 2 Accept the null hypothesis if HM < χ q,α where χ is the upper α percentile point of chi square distribution with the q degree of q,α 2 freedom. The value of χ can be obtained from the statistical table which is available in q,α the text-books of statististics.

  Hettmansperger and McKean (1977) and McKean and Sheather (1991) remind that in application using small to moderate sample size, the chi square distribution is often too 2 light-tailed. They suggest to replace the percentile of chi square χ by: q,α qF q,n−p−1;α where F q,n−p−1;α is the upper α percentile of the F distribution with q numerator degree of freedom and n - p - 1 denominator degree of freedom. ′ ′ ′

  TIES β β β : when the ties exist among Y 1 − x 1 , Y 2 − x 2 , . . . , Y n − x , use the rank n average to break the ties in computing the minimum of D J ′ ′ ′ (Y − Xβ). Similarly when the

  β β β ties exist among Y 1 − x 1 , Y 2 − x 2 , . . . , Y n − x , use the rank average to break n the ties in computing the minimum of D J ).

  (Y − Xβ

  

  279

3 Material And Method

3.1 Material

  To show the computation of the Jaeckel - Hettmansperger - McKean, test statistic HM

  , the secondary data collected by Kuntoro (2001) are used in this paper. The data were collected from 2804 students of the elementary schools who lived in 14 districts in East Java Province, Indonesia. The variables of knowledge about safe water, attitude upon drinking unboiled water, and practice of drinking unboiled water are selected. The level of knowledge about safe water is scored 2 for good knowledge and scored 1 for bad knowledge. The level of attitude upon drinking unboiled water is scored 5 for strongly disagree, scored 4 for diagree, scored 3 for doubtful, scored 2 for agree, and scored 1 for strongly agree. The level of practice of drinking uboiled water is scored 3 for never, scored 2 for ever, scored 1 for always. The unit of analysis is district. For each unit of analysis, the selection of score of variable based on the highest percentage of level of variable. For example, district of Ponorogo, the highest percentage of level of knowledge is bad. Then the score for knowledge is 1, The highest percentage of level of attitude is strongly disagree. Then the score for attitude is 5. The highest percentage of level of practice is never. Then the score for practice is 3.

  280

  55.7 Disagree

  2

  63.6 Never

  48.6 Disagree

  0.2 Good

  3 Bojonegoro

  4

  2

  58.6 Never

  74.1 Good

  3 Tuban

  3 Mojokerto

  4

  2

  50.7 Never

  73.7 Disagree

  65.3 Good

  2 Probolinggo

  5

  4

  52.0 Bad

  51.7 Ever

  2 Sampang

  37.9 Agree

  68.6 Bad

  2 Sumenep

  4

  1

  49.1 Ever

  45.7 Disagree

  50.9 Bad

  5

  51.0 Disagree

  2

  58.8 Ever

  49.6 Strongly Disagree

  64.0 Good

  2 Lamongan

  4

  1

  63.3 Ever

  2

  49.7 Strongly Disagree

   The following table shows the highest percentage of level of knowledge, attitude, and practice and their scores.

  5

  81.3 Bad

  3 Kediri

  5

  2

  64.5 Never

  50.0 Strongly Disagree

  80.0 Good

  3 Blitar

  1

  53.5 Never

  64.9 Never

  69.4 Strongly Disagree

  79.9 Bad

  District % Level and Score % Level and Score % Kategori/Skor Ponorogo

  Practice of Drinking Unboiled Water

  Attitude Upon Drinking Unboiled Water

  Knowledge About Safe Water

  Table 1. The Highest Percentage of Level of Knowledge, Attitude, and Practice

  46.1 Disagree

  1

  69.1 Good

  2

  3 Bondowoso

  5

  2

  52.5 Never

  53.4 Strongly Disagree

  74.0 Good

  2 Jember

  4

  54.8 Ever

  4

  48.2 Disagree

  61.7 Good

  3 Lumajang

  4

  2

  59.0 Never

  42.5 Disagree

  65.6 Good

  3 Malang

  42.4 Ever

  

  281

3.2 Method

  By applying Secondary Data Analysis Method (Nachmias, 1987) The scores of three variables are analyzed by mean of MINITAB program in order to compute HM statistic.

  First of all : Enter the scores of variables of knowledge (knowl), attitude(attit), and Row Knowl Attit Pract practice (pract) to the spreadsheet of MINITAB as follows. 2 4 1 3 2 1 1 2 5 5 4 4 3 3 3 3 11 10 8 6

  5 7 9 2 2 2 2 2 2 1 4 5 4 4 5 4 4 3 2 3 3 3 2 2 13 14

  12 1 1 2 2 4 5 2 2 2 Second :

  Create matrices of M1, M2, and M3 that state the null hypothesis 1, the null hypothesis 2, and the null hypothesis 3 respectively. 2 rows read. DATA> END DATA> 0 1 DATA> 1 0 MTB > READ C4-C5 The null hypothesis 1: H [β = β = 0; ξ unspecif ied] 01 1 2 MTB > COPY C4-C5 M1 MTB > PRINT M1 1 Data Display Matrix M1 1 MTB > · ¸ 1 0

  Then M1 = 0 1 MTB > READ C6-C7 The null hypothesis 2: H 02 [β 1 = 0; ξ unspecif ied]

  282 Matrix M3

   MTB >

  1 £ ¤

  Then M3 = 0 1 Third: Operate the command of Rank Regression (RREG) to obtain the value that can be used to compute measure of dispersion Jaeckel,HM statistic and to obtain the equation of rank regression. SUBC> HYPOTHESIS M1. MTB > RREG ’Pract’ 2 ’Attit’ ’Knowl’; To test the null hypothesis 1: SUBC> HYPOTHESIS M2. MTB > RREG ’Pract’ 2 ’Attit’ ’Knowl’; To test the null hypothesis 2: MTB > RREG ’Pract’ 2 ’Attit’ ’Knowl’; SUBC> HYPOTHESIS M3. To test the null hypothesis 3:

4 Result And Discussion

  = = 0

4.1 To test the null hypothesis : β β

  1

  2 The statement of the null hypothesis is the independent variable of knowledge about safe

  water and the independent variable of attitude upon drinking unboiled water do not affect MTB > RREG ’Pract’ 2 ’Attit’ ’Knowl’; the dependent variable of practice of drinking unboiled water. SUBC> HYPOTHESIS M1. Predictor Rank Least-sq Rank Least-sq Pract = 2.50 + 0.000 Attit + 0.000 Knowl The regression equation is This is the ”print out ” of MINITAB : Coefficient Coefficient Hodges-Lehmann estimate of tau = 0.5329 Least-squares S = %2 Knowl 0.0000 0.1994 0.3904 0.3242 Attit 0.0000 0.1044 0.2421 0.2011 Constant 2.4999 1.8038 0.9790 0.8132 ANOVA for hypothesis matrix M1

  

  283

  = 0

  4.2 To test the null hypothesis : β

  1 The statement of the null hypothesis is the independent variable of knowledge about safe This is the "print out " of MINITAB : SUBC> HYPOTHESIS M2. MTB > RREG ’Pract’ 2 ’Attit’ ’Knowl’; water does not affect practice of drinking unboiled water.

  Pract = 2.50 + 0.000 Knowl + 0.000 Attit Attit 0.0000 0.1044 0.2421 0.2011 Knowl 0.0000 0.1994 0.3904 0.3242 Constant 2.4999 1.8038 0.9790 0.8132 Predictor Rank Least-sq Rank Least-sq Coefficient Coefficient ANOVA for hypothesis matrix M2 Hodges-Lehmann estimate of tau = 0.5329 Least-squares S = %2 Reduced model Full model DF F Denom DF Approx F Dispersion Rank 5.54256347 5.54258979 Least-sq 3.23076923 3.12341772 Unusual observations Observation Knowl Pract Pseudo Fit SE Fit Residual 1 0.2839 1 0.3208 11

11 -0.00

0.38 MTB > X denotes an observation whose X value gives it large influence.

  14 1.00 2.000 2.261 2.500 0.522 -0.500 X = 0

  4.3 To test the null hypothesis : β

  2 The statement of the null hypothesis is the independent variable of attitude upon drinking SUBC> HYPOTHESIS M3. MTB > RREG ’Pract’ 2 ’Attit’ ’Knowl’; unboiled water does not affect practice of drinking unboiled water. PRAKT = 2.50 + 0.000 Knowl + 0.000 Attit The regression equation is This is the ”print out ” of MINITAB : Coefficient Coefficient Constant 2.4999 1.8038 0.9790 0.8132 Predictor Rank Least-sq Rank Least-sq

  284

   about safe water does not affect the dependent variable of practice of drinking unboiled water, and also the independent variable of attitude upon drinking unboiled water does not affect the dependent variable of practice of drinking unboiled water.

  Like parametric multiple regression model, rank regression model also requires the assumption that there is no collinearity among independent variables. MINITAB will drop the independent variable which is highly correlated with other independent variable and there is no hypothesis to be tested. Before doing RR command in MINITAB, collinearity among independent variables can be detected by computing correlation coefficient for ordinal scale such as Spearman rank correlation coefficient.

5 Conclusion And Recommendation

  It is concluded that knowledge about safe water and attitude upon drinking unboiled water simultaneously do not affect practice of drinking unboiled water. Each independent variable does not affect practice of drinking water.

  The procedure is quite simple compared to ordinary regression procedure. The as- sumption made is no collinearity among independent variables. It is easy to use. It is recommended to use HM statistic in analyzing the data having ordinal scale obtained from public health study as well as social study.

  References

  [1] Campbell, D.T., and Stanley, J.C. (1966). Experimental and Quasi-Experimental Designs for Research. Rand McNally College Publishing Company. Chicago. [2] Fowler, Jr., F.J. (1984). Survey Research Methods. Sage Publications.Beverly Hills. [3] Hollander, M., and Wolfe, D.A. (1999). Nonparametric Statistical Methods. John Wiley & Sons, Inc.New York. [4] J¨oreskog, K.G., and S¨orgbom, D. (1988). LISREL 7 A Guide to the Program and Applications 2nd Edit. SPSS, Inc.Chicago. [5] Kuntoro, Sulisyorini, L., Mahmudah, Soenarnatalina, Puspitasari, N., Indawati,

  R., Qomaruddin, M.B. and Wibowo, A. (2001). Baseline Survey About Knowl-

  edge, Practice of Hygiene and Sanitation in East Java. Cooperation Between

  Airlangga University and Regional Development Planning Board of East Java Province. Surabaya.