Student Name:

Institutional Affiliation:

Introduction

In statistical modelling different models are usually considered depending on the type of response variable that the researcher is using. For example, if the outcome variable is continuous then the normal regression method may be used. In this data however, the outcome variable or the dependent variable is binary in that it takes up two outcomes which are smoker and non-smoker. For appropriate analysis using STATA, 0 was coded for smoker and 1 for non-smoker. When the response variable is binary, in most cases the logistic regression model is usually employed.

1. Consider a binary dependent regression model with 1 independent variable

Binary independent regression model with one independent variable

a. Select an appropriate independent variable. Justify your selection.

There are four independent variables under consideration which include: the age of the individual, years of education, income and price of cigarettes in 1979. For bivariate independent regression modelling of one independent variable, I chose years of education since this variable has the least standard deviation compared to the other three variables. The standard deviation for the variables of age of the individuals, years of education, income and price of cigarettes in 1979 are 17.05694,9083.511,4.848667 and 17.05694 respectively as shown in the table below:

Summary statistics for the independent variables

Variable | Obs Mean Std. Dev. Min Max

————-+———————————————————

educ | 1,196 12.22115 3.275847 0 18

income | 1,196 19304.77 9083.511 500 30000

pcigs79 | 1,196 60.98495 4.848667 46.3 69.8

age | 1,196 41.80686 17.05694 17 88

b. Use three independent different models (in one regressor) to estimate the probability of smoking and formulate the population regression function for each of them. Report and interpret the results.

i) Logistic regression model

The first model to consider is the Logistic regression model. After modelling using STATA, these were the results:

Number of iterations in the model

Number of iterations Log likelihood statistic

0 -794.47478

1 -789.20987

2 -789.20747

3 -789.20747

Smoker coefficient Std.error z p-value 95% confidence interval

Educ -.0591058 .0183098 -3.23 0.001 [-0.949924,0.232191]
Constant .2304882 .2292564 1.01 0.315 [-.218846 , .6798224]

The population regression function is:

The chi square likelihood ratio test value is 10.53 and the p-value for the chi square test is 0.0012. The log likelihood value is -789.20747.

In this model, if the education variable is held constant, then the logarithm of the odds that one is a smoker increase by 0.2304882. Moreover, the coefficient of the education variable shows that there is a negative relationship between being a smoker and the years of education. For a unit increase in the years of education the logarithm of odds of being a smoker decrease by 0.0591058.

The probability value of the years of education variable is 0.001 which is less than 0.05, hence the years of education variable is statistically significant. The p-value of the chi square test is 0.0012 and thus less than 0.05, therefore the overall model is statistically significant.

Odds ratio of the coefficients of the logistic model

smoker Odds Ratio Std. Err. z p>z 95% confidence interval

educ .9426071 .017259 -3.23 0.001 [.9093799 , .9770484]
constant 1.259215 .288683 1.01 0.315 [.8034454 , 1.973527]

In this model, the odds ratio for the years of education variable is 0.9426071. The odds ratio is statistically significantly as the value of the odds ratio is between the 95% confidence interval. However, the intercept is not statistically significant as the odds ratio is 1.259215 and it is not within the 95% confidence interval observed.

ii) probit regression model

The second model to consider is the probit regression model. After modelling using STATA, these were the results that were obtained:

Number of iterations in the model

Number of iterations Log likelihood statistic

0 -794.47478

1 -789.09871

2 -789.09791

3 -789.09791

smoker Coefficient Std. Err. z p>z 95% confidence interval

educ -.0372712 .0113767 -3.28 0.001 [-.0595691, -.0149732]
constant .1484785 .1426455 1.04 0.298 [-.1311016 , .4280586]

The population regression function is:

If the years of education variable is held as a constant, then the probability of being a smoker increases by 0.1484785. The coefficient of the years of education variable shows that there is a negative relationship between probability of being a smoker and the years of education. If there is unit increase in the years of education variable then the probability of being a smoker decreases by 0.372712. The probability value of the education variable is 0.001 and thus less than 0.001 and hence less than 0.05. Since the probability value is less than 0.001 therefore the years of education variable is statistically significant at 5% significance level.

iii. Complementary Log-log regression

The third model to be considered is the complementary log-log regression. After modelling the data using STATA these were the results:

Number of iterations in the model

Number of iterations Log likelihood statistic

0 -806.57863

1 -789.65283

2 -789.63896

3 -789.63896

smoker Coefficient Std. Err. z p>z 95% confidence interval

educ -.0424448 .0134565 -3.15 0.002 [-.0688189 , -.0160706]
constant -.2234006 .1662607 -1.34 0.179 [-.5492656 , .1024644]

The probability of being a smoker according to this model was given by .37879942 and thus the probability of not being a smoker is 0.62120058.

The population regression function is:

If the years of education variable is held as a constant, then the probability of being a smoker decreases by 0.2234006. Moreover, in this model, there is a negative relationship between the dependent and the independent variable that has been chosen.

The years of education variable is also statistically significant as the p-value of the years of education variable is less than 0.05.

In this model too, the years of education variable is also statistically significant since it is within the computed 95% significance level. Moreover, the model is also statistically significant.

This is because the model has a chi square likelihood ratio test of 9.67 which has a p-value of 0.0019. Since the p-value is less than 0.05, then the model is said to be statistically significant.

c. Comment on the measures of fit used in each of the three models.

In statistics, there are different methods for testing the fitness of a model. There can be the chi square test, the Kolmogorov- Smirnov test, the likelihood ratio test depending on the model that is being measured on whether it is fit or not.

I)Logistic regression model

The measure of fit that was used in the logistic regression model was the chi square likelihood ratio test and the probability value for the chi square likelihood ratio test. The chi square likelihood ratio test model can usually be compared to the statistical chi square tables but since their p-value was provided by STATA, then the value of the p-value was used to measure for fitness.

If the p-value exceeded the 5% level of significance, the model was not significantly fit but if it was less than 5% level of significance then it was fit and statistically significant.

ii) Probit regression model

In this model, the measure of fitness that was used was the chi square likelihood ratio test and the probability value of the chi square likelihood ratio test. The chi square likelihood ratio test model can usually be compared to the statistical chi square tables but since their p-value was provided by STATA, then the value of the p-value was used to measure for fitness.

If the p-value exceeded the 5% level of significance, the model was not significantly fit but if it was less than 5% level of significance then it was fit and statistically significant.

The likelihood ratio test value was 10.75 and the probability value of the chi square likelihood ratio test was 0.0010. Since 0.0010 is less than the 5% significance level, the model was generally said to be statistically significant.(Durante,2018)

iii) complementary log- log regression model

In this model, the measure of fitness that was used was the chi square likelihood ratio test and the probability value of the chi square likelihood ratio test. The chi square likelihood ratio test model can usually be compared to the statistical chi square tables but since their p-value was provided by STATA, then the value of the p-value was used to measure for fitness.

If the p-value exceeded the 5% level of significance, the model was not significantly fit but if it was less than 5% level of significance then it was fit and statistically significant.

The likelihood ratio test value was 9.67 and the probability value of the chi square likelihood ratio test was 0.0019. Since 0.0019 is less than the 5% significance level, the model was generally said to be statistically significant.

d. Conduct a sensitivity analysis to determine the sensitivity of the predicted probability of smoking to change in the independent variable. Analyze using numerical examples.

Sensitivity analysis has been termed as the analytical study of how the uncertainty in the output can be divided and allocated to different sources of uncertainty in inputs (Saltelli,A., 2002). The type of sensitivity analysis that will be used is the one- at- a-time sensitivity analysis approach.

i) Logistic regression

Since the model was based on only one independent variable that is the years of education variable, only this independent variable shall be considered in the sensitivity analysis.

This approach of sensitivity analysis involves placing other factors as a constant and check for the effect that the independent variable has on the response variable. For a unit increase in the years of education the logarithm of odds of being a smoker decrease by 0.0591058.(Gasso,2019)

ii) Probit regression

Since the model was based on only one independent variable that is the years of education variable, only this independent variable shall be considered in the sensitivity analysis.

This approach of sensitivity analysis involves placing other factors as a constant and check for the effect that the independent variable has on the response variable. If there is unit increase in the years of education variable, then the probability of being a smoker decreases by 0.372712.

iii) Complementary log-log regression models

Since the model was based on only one independent variable that is the years of education variable, only this independent variable shall be considered in the sensitivity analysis.

This approach of sensitivity analysis involves placing other factors as a constant and check for the effect that the independent variable has on the response variable. The probability of being a smoker according to this model was given by .37879942 and thus the probability of not being a smoker is 0.62120058. Per unit change of the years of education variable, the output decreases by -.0424448.

e. Discuss, in brief, the strengths and limitations of each model. Give examples whenever possible.

I) Logistic regression model

This regression model has several strengths. Firstly, this model can be used for data that have a binary response variable. Compared to the simple linear regression that requires that the outcome be continuous, the logistic regression model is quite flexible in that it accounts for binary outcomes. Secondly, the logistic regression model, has a simple output that are usually relatively easy to interpret. That is the outcome obtained from the logistic regression model normally has good probabilistic interpretations. Thirdly, this model has an inbuilt mechanism that is usually able to regulate overfitting that is avoid overfitting. In addition, the model is usually friendly when one wishes to add some data into the model. However, this model does have limitations. These limitations include: not being able to accommodate other types of response variables or more complex relationships. The logistic model usually cannot accommodate data in which the response variable is not binary since it only leads to spurious relationships.

ii) Probit

This regression model has several strengths in that it not only can accommodate binary response outcomes but a wide range of response outcomes. This model can also be applicable with data which has temporal correlated errors. However, the probit model has limitations too. These limitation is that it requires normal distribution of all unobserved components which are of use.(Fillipini,2018)

iii) Complementary log-log regression models.

These models have several strengths one of which as seen in their use in our data can be able to accommodate binary outcomes. Moreover, these models, are vital in that they are usually able to give directly the probability that either of the binary outcomes occurs. The limitation of this model is that it cannot accommodate other types of response variables from other distributions.

(Williams,2019)

f. Use STATA to sketch a graph representing each of the three models.

2) Now, consider a binary dependent variable regression model with multiple independent variables. Answer the following:

Binary dependent variable regression with multiple independent variables

a) Select some appropriate independent variables (at least two). Justify your selection.

There are four independent variables under consideration which include: the age of the individual, years of education, income and price of cigarettes in 1979. For binary independent regression modelling of multiple independent variables, I chose years of education and price of cigarettes in 1979 since these variable has the least standard deviation compared to the other variables. The standard deviation for the variables of age of the individuals, years of education, income and price of cigarettes in 1979 are 17.05694,9083.511,4.848667 and 17.05694 respectively as shown in the table below:

Summary statistics for the independent variables

Variable | Obs Mean Std. Dev. Min Max

————-+———————————————————

educ | 1,196 12.22115 3.275847 0 18

income | 1,196 19304.77 9083.511 500 30000

pcigs79 | 1,196 60.98495 4.848667 46.3 69.8

age | 1,196 41.80686 17.05694 17 88

b) Use three different models (using the same regressors) to estimate the probability of smoking and formulate the population regression function for each of them. Report and interpret the results.

i)Logistic regression

The first model to consider is the Logistic regression model. After modelling using STATA, these were the results:

smoker Odds Ratio Std. Err z p>z 95% conf.interval

Pcigs79 .9759585 .0119749 -1.98 0.047 [.9527681, .9997133]
educ .9407557 .0172839 -3.32 0.001 [.9074825, .9752488]
cons 5.680365 4.509777 2.19 0.029 [1.198359, 26.9256

—————————————————————————

smoker | Coef. Std. Err. z P>|z| [95% Conf. Interval]
————-+—————————————————————-

pcigs79 | -.0243352 .0122699 -1.98 0.047 -.0483837 -.0002867

educ | -.0610718 .0183723 -3.32 0.001 -.097081 -.0250627

_cons | 1.737015 .7939239 2.19 0.029 .1809532 3.293078

——————————————————————————

The population regression function is:

The chi square likelihood ratio test value is 14.46 and the p-value for the chi square test is 0.0007. The log likelihood value is -787.24475.

In this model, if the education variable and the price of cigarette variable in 1979 is held as a constant, then the logarithm of the odds that one is a smoker increase by 1.737015 . Moreover, the coefficient of the education variable shows that there is a negative relationship between being a smoker and the years of education and both the price per cigarette in 1979. For a unit increase in the years of education the logarithm of odds of being a smoker decrease by 0.0610718. For a unit increase in the price of cigarette in 1979, then the logarithm of odds that someone is a smoker decreases by 0.0243352.

The probability value of the years of education variable is 0.001 which is less than 0.05, hence the years of education variable is statistically significant. Moreover, the probability value of the price of cigarette in 1979 variable is 0.047 which is less than 0.05, hence the price per cigarette in 1979 variable is statistically significant

The p-value of the chi square test is 0.0007 and thus less than 0.05, therefore the overall model is statistically significant.

ii) The probit regression

The Second model to consider is the probit regression model. After modelling using STATA, these were the results:

—————————————————————————–

smoker | Coef. Std. Err. z P>|z| [95% Conf. Interval]
————-+—————————————————————-

pcigs79 | -.0151279 .0076246 -1.98 0.047 -.0300718 -.0001839

educ | -.038428 .0113989 -3.37 0.001 -.0607695 -.0160866

_cons | 1.084363 .4928821 2.20 0.028 .1183318 2.050394

——————————————————————————

The chi square likelihood ratio test value is 14.69 and the p-value for the chi square test is 0.0006. The log likelihood value is -787.12968.

In this model, if the education variable and the price of cigarette variable in 1979 is held as a constant, then the probability of being a smoker increases by 1.084363 . Moreover, the coefficient of the education variable shows that there is a negative relationship between being a smoker and the years of education and both the price per cigarette in 1979. For a unit increase in the years of education the probability of being a smoker decrease by 0.038428 . For a unit increase in the price of cigarette in 1979, then the logarithm of odds that someone is a smoker decreases by 0.0151279 .

The probability value of the years of education variable is 0.001 which is less than 0.05, hence the years of education variable is statistically significant. Moreover, the probability value of the price of cigarette in 1979 variable is 0.047 which is less than 0.05, hence the price per cigarette in 1979 variable is statistically significant

The p-value of the chi square test is 0.0006 and thus less than 0.05, therefore the overall model is statistically significant.

iii) Complementary log-log regression model

The third model to be considered is the complementary log-log regression. After modelling the data using STATA these were the results:

Iteration 0: log likelihood = -805.23795

Iteration 1: log likelihood = -787.74345

Iteration 2: log likelihood = -787.7258

Iteration 3: log likelihood = -787.7258

Complementary log-log regression Number of obs = 1,196

Zero outcomes = 741

Nonzero outcomes = 455

LR chi2(2) = 13.50

Log likelihood = -787.7258 Prob > chi2 = 0.0012

——————————————————————————

smoker | Coef. Std. Err. z P>|z| [95% Conf. Interval]
————-+—————————————————————-

pcigs79 | -.0186268 .0093917 -1.98 0.047 -.0370341 -.0002194

educ | -.0438438 .0134848 -3.25 0.001 -.0702735 -.0174141

_cons | .9274987 .6017598 1.54 0.123 -.2519288 2.106926

The chi square likelihood ratio test value is 13.50 and the p-value for the chi square test is 0.0012. The log likelihood value is -787.7258.

In this model, if the education variable and the price of cigarette variable in 1979 is held as a constant, then the logarithm of the odds that one is a smoker increase by 0.9274987. Moreover, the coefficient of the education variable shows that there is a negative relationship between being a smoker and the years of education and both the price per cigarette in 1979. For a unit increase in the years of education the probability of being a smoker decrease by 0.0438438. For a unit increase in the price of cigarette in 1979, then the probability that someone is a smoker decreases by 0.0186268

The probability value of the years of education variable is 0.001 which is less than 0.05, hence the years of education variable is statistically significant. Moreover, the probability value of the price of cigarette in 1979 variable is 0.047 which is less than 0.05, hence the price per cigarette in 1979 variable is statistically significant

The p-value of the chi square test is 0.0012 and thus less than 0.05, therefore the overall model is statistically significant.

c) Discuss the method of estimation used for each model.

d) Conduct a sensitivity analysis to determine the sensitivity of the predicted probability of smoking to changes in the independent variables (individually and collectively). Analyse using numerical examples.

i) Logistic regression

Since the model was based on only two independent variables that is the years of education variable, and the price of cigarette in 1979 variable only these two independent variable shall be considered in the sensitivity analysis. This approach of sensitivity analysis involves placing other factors as a constant and check for the effect that the independent variable has on the response variable.

For a unit increase in the years of education the logarithm of odds of being a smoker decrease by 0.0610718.

For a unit increase in the price of cigarette in 1979, then the logarithm of odds that someone is a smoker decreases by 0.0243352.

ii) probit regression model

Since the model was based on only two independent variables that is the years of education variable, and the price of cigarette in 1979 variable only these two independent variable shall be considered in the sensitivity analysis. This approach of sensitivity analysis involves placing other factors as a constant and check for the effect that the independent variable has on the response variable.

For a unit increase in the years of education the probability of being a smoker decrease by 0.038428.

For a unit increase in the price of cigarette in 1979, then the logarithm of odds that someone is a smoker decreases by 0.0151279.

iii) Complementary log-log regression model

Since the model was based on only two independent variables that is the years of education variable, and the price of cigarette in 1979 variable only these two independent variable shall be considered in the sensitivity analysis. This approach of sensitivity analysis involves placing other factors as a constant and check for the effect that the independent variable has on the response variable.

For a unit increase in the years of education the probability of being a smoker decrease by 0.0438438.

For a unit increase in the price of cigarette in 1979, then the probability that someone is a smoker decreases by 0.0186268

3) Now, suggest a different model specification (using multiple regressors) by adding or dropping one or more variable(s) in the models used in Task 2. Estimate the models using the new specification, then compare between the two specifications.

Different model specification

I) null model

In this model, all the terms that were used in the previous model are dropped and thus the model is seen whether it is good or fit without the regressors.

These are the results from the modelling done by STATA:

ogistic regression Number of obs = 1,196

LR chi2(0) = 0.00

Prob > chi2 = .

Log likelihood = -794.47478 Pseudo R2 = 0.0000

——————————————————————————

smoker | Odds Std. Err. z P>|z| [95% Conf. Interval]
————-+—————————————————————-

_cons | .6140351 .0365716 -8.19 0.000 .5463816 .6900655

——————————————————————————

From the results shown below, the constant is seen to be statistically significant since it is within the specified confidence interval.

However, the chi square likelihood ratio test statistic is too small hence the model is not good for modelling the data. The R squared to which is pseudo is too small hence no variation brought about by the model.

ii) full model

This is the model in which all the variables are incorporated hence all the four variables of the model are used here. After modelling that was done with the STATA software, these were the results that were obtained:

Iteration 0: log likelihood = -794.47478

Iteration 1: log likelihood = -786.98338

Iteration 2: log likelihood = -786.97838

Iteration 3: log likelihood = -786.97838

Logistic regression Number of obs = 1,196

LR chi2(3) = 14.99

Prob > chi2 = 0.0018

Log likelihood = -786.97838 Pseudo R2 = 0.0094

——————————————————————————

smoker | Coef. Std. Err. z P>|z| [95% Conf. Interval]
————-+—————————————————————-

educ | -.0663463 .0197625 -3.36 0.001 -.1050801 -.0276125

pcigs79 | -.0245635 .0122778 -2.00 0.045 -.0486276 -.0004994

income | 5.17e-06 7.09e-06 0.73 0.466 -8.73e-06 .0000191

_cons | 1.715296 .7946142 2.16 0.031 .1578807 3.272711

——————————————————————————

The chi square likelihood ratio test value is 14.99 and the p-value for the chi square test is 0.0018

. The log likelihood value is -786.97838.

In this model, if all the variables are held as a constant, then the logarithm of the odds that one is a smoker increase by 1.715296 . Moreover, the coefficient of the education variable shows that there is a negative relationship between being a smoker and the years of education and both the price per cigarette in 1979. For a unit increase in the years of education the logarithm of odds of being a smoker decrease by 0.066346. For a unit increase in the price of cigarette in 1979, then the logarithm that someone is a smoker decreases by 0.0245635. For a unit increase in income when other factors are held as a constant, the log odds of being a smoker increase by 5.17e-06.

The probability value of the years of education variable is 0.001 which is less than 0.05, hence the years of education variable is statistically significant. Moreover, the probability value of the price of cigarette in 1979 variable is 0.045 which is less than 0.05, hence the price per cigarette in 1979 variable is statistically significant. The probability value of income variable is 0.466 which is greater than 0.05 hence the income variable at 5% significance level is not statistically significant.

The p-value of the chi square test is 0.0012 and thus less than 0.05, therefore the overall model is statistically significant.

The pseudo R- squared value of the full model is greater than the pseudo R-squared value of the null model hence the full model is a better model. The logistic regression in task 2 also has a lower pseudo R-squared than the full model hence it is a better model.

Bibliography

Saltelli,A. (2002). Sensitivity Analysis for Importance Assessment. 22.

Durante, D. (2018). Conjugate Bayes for probit regression via unified skew-normals. arXiv preprint arXiv:1802.09565.

Gasso, G. (2019). Logistic regression.

Filippini, M., Greene, W. H., Kumar, N., & Martinez-Cruz, A. L. (2018). A note on the different interpretation of the correlation parameters in the Bivariate Probit and the Recursive Bivariate Probit. Economics Letters, 167, 104-107.

Williams, R. (2019). GOLOGIT2: Stata module to estimate generalized logistic regression models for ordinal dependent variables.

Jamal, A., Phillips, E., Gentzke, A. S., Homa, D. M., Babb, S. D., King, B. A., & Neff, L. J. (2018). Current cigarette smoking among adults—United States, 2016. Morbidity and Mortality Weekly Report, 67(2), 53.