Do the following problems from Rosner, *Fundamentals of Biostatistics, 5th Edition.*

- Problems 3.108 - 3.111 on p. 74
- Problems 3.112 - 3.115 on p. 75
- Hint for 3.113: use the binomial distribution to compute the probability of observing 100 or more cases in 1000 men if the probability of a case is 8%.

- Problems 4.34 - 4.35 on p. 110
- Problems 5.41 - 5.45 on p. 149

*Your response to each question should be in the form of a report. Prepare the report in a word processor, integrating graphics and discussion.*

The data set BONEDEN on the data disk accompanying Rosner, *Fundamentals of Biostatistics, 5th Edition,* is described in the file BONEDEN.DOC. An easy way to import it into R is to open the Excel file BONEDEN.XLS, save the worksheet as a tab-delimited text file boneden.txt, edit it with NotePad to make sure thare are no extraneous lines and that there is a single <CR> after the last line, then import it with **read.table()**, specifying that there is a header line and the separator is a tab character. After creating the data frame boneden, check that it is complete (**dim(boneden)** shows the number of rows and columns, for example) and save your workspace (frequently!) to protect against a system crash.

> boneden <- read.table("boneden.txt", header = T, sep = "\t") > dim(boneden) [1] 41 25 > save.image()

Follow the instructions in 2.37 - 2.45 on p. 43 of Rosner, but with the following modification. He asks for a scatter plot of % difference in bone density, grouped by difference in tobacco use, where difference has been categorized into 5 levels. Instead, give scatter plots like those in Figure 2.12 on p. 38 (using different plotting symbols to distinguish monozygotic and dizygotic twins), and then give comparative box plots to compare the 5 levels. You can use **cut()** to categorize the continuous variable.

Note that the heavier-smoking twin is defined as the one with the higher pack-years and that this is always Twin 2; you can quickly verify this by plotting **pyr2** against **pyr1**. The calculation of C is illustrated below. If you **attach(boneden)** you can refer to variables in the data frame without the **boneden$** prefix, but you must use the prefix if you are creating a new variable. Since we have added a new variable to the data frame, we have to detach the data frame and attach it again if we want to use the new variable without the **boneden$** prefix.

> attach(boneden) > plot(pyr1,pyr2) > abline(0,1) > boneden$lsc <- 100*(ls2 - ls1)/((ls2 + ls1)/2) > detach(boneden) > attach(boneden)

Explore the data set graphically and report anything else interesting you find.

Find the Niagara River Pollution Case Study archived at http://www.ssc.ca/Documents/Case%20Studies/1999/E-niagara.html. Extract the "Dieldrin in water" readings at Fort Erie and Niagara-on-the-Lake. Study the two time sequences graphically, looking for the following features: trend, cyclic effects, change-points, autocorrelation. Are the "detection limits" a problem? Does a log transformation make the data easier to interpret? Are there differences between the two stations?

*Your response to each question should be in the form of a report. Prepare the report in a word processor, integrating graphics and discussion.*

In Exercise #2 you drew histograms and density estimates for samples of different sizes from the standard normal distribution and the chi-square distribution on 1 degree of freedom. How large does the sample size have to be for the data to give a reliable indication of the shape of the underlying distribution?

Generate 200 samples, each with n = 4 independent observations, from the standard normal distribution, and compute the 200 sample means. Display the 200 sample means on a histogram and compute the mean, variance and standard deviation of this distribution. Repeat for another 200 samples, this time each with n = 100 independent observations. Explain how this illustrates Equation 5.10 on p. 135 of the text.

A clever way to do this in R is to generate 800 independent standard normal observations and arrange them in a matrix with 4 rows and 200 columns, then use **apply()** to compute the 200 column means. (The modification of the code to generate samples of size n = 100 is obvious.)

> simdat <- matrix(rnorm(800), ncol = 200) > xbars <- apply(simdat, 2, mean) > hist(xbars) > mean(xbars) > var(xbars) > sqrt(var(xbars))

Generate 200 samples, each with n = 4 independent observations, from the chi-square distribution on 3 degrees of freedom, and compute the 200 sample means. Display the 200 sample means on a histogram and compute the mean, variance and standard deviation of this distribution. Repeat for another 200 samples, this time each with n = 100 independent observations. Explain how this illustrates Equation 6.3 on p. 174 of the text.

Statistics 2MA3 2001-2002