##
**Logistic Regression**

##
Logistic regression is the preferred method for examining this type of data. These methods are different from what we have seen so far. At the same time, you will recognize a lot of similar features. Let's being with an example of the type of data that lends itself to this analysis. Consider the experimental data summarized in Table. There were six large jars, each containing a number of beetles and carefully measured small amount of insecticide. After a specified amount of time, the experimenters examined the number of beetles that were still alive. We can calculate the empirical death rate for each jar's level of exposure the insecticide. These are given in the last row of Table using Mortality rate = Number died / Number exposed How can we develop statistical models to describe this data? We should immediately recognize that within each jar the outcomes are binary valued: alive or dead. We can probably assume that these events are independent of each other. The counts of alive or dead should then follow the binomial distribution. (See for a quick review of this important statistical model.) There are six separate and independent binomial experiments in this example. The *n* parameters for the binomial models are the numbers of insects in each jar. Similarly, the* p* parameters represent the mortality probabilities in each jar. The aim is to model the *p *parameters of these six binomial experiments. Any models we develop for this data will need to incorporate the various insecticide does levels to describe the mortality probability *p *in each jar. The alert reader will notice that the empirical mortality rates given in the last row of Table are not monotonically increasing with increasing exposure levels of the insecticide. Despite this remark, there is no reason for us ton fit a nonmonotonic model to these data. The aim of this chapter is to explain models for the different values of the *p *parameters using the values of the doses of the insecticide in each jar. The first idea that comes to mind is to treat the outcomes as normally distributed. All of the *ns* are large, so a normal approximation to the binomial distributions should work. We also already know a lot abort linear regression, so we are tempted to fit the linear model

*What is wrong with this approach? It seems simple enough, but remember that p must always be between 0 and 1. There is no guarantee that at extreme values of the Does, this straight line model would result in estimates of p that are less than 0 or greater than 1. We would expect a good model for these data to always give an estimate of p between 0 and 1. A second but less striking problem is that the variance of the binomial distribution is not the same for different values of the p***.**

*parameter so the assumption of constant variance is not valid for the usual linear regression*.