Course description: Measures of central tendency and dispersion, sampling, probability, testing of hypothesis, correlation and regression, time series, and index numbers.
A three part assignment including a compare and contrast of data sets, scatterplot, box plot, least squares regression, standard deviation, and confidence interval.
| Attachment | Size |
|---|---|
| MATH3353_David_Norman_Project1.pdf | 861.77 KB |
Problem 1: Consider the following data set: (20,525) (17,57) (10,19) (9,18) (7,14) (16,41) (3,5) (5,10) (10,23) (12,24) (9,15) (20,571) (18,102) (16,56)
Obtain the least squares line for this data set. Now compute the residuals and make a residual plot. Explain what you see. Now obtain the least squares line for the original x-values, and the log of the y-values. Again obtain the residuals and make a residual plot. Predict the response y at the input level
Problem 2: Now add 20 to the y-value of each point in the original data set. Make the best prediction for the new response at x = 13, and compare your result to that obtained in problem one. How do you think in general adding a constant to the data set will affect the outcome of your predictions?
Problem 3: Have Excel try to calculate Log(-6) and tell me what happens. Now consider the following data set:
(2,4) (1,-3) (17,55) (9,2) (20,283) (16,12) (10,5) (19,175) (11,-5) (2,5) (8,3) (10,4) (15,18) (7,1)
Make your best prediction of the response at the input level of 13.
The least squares line of this data set is: y=23.1x-178.2. The slope is commonly described as rise over run. This means for every 1 unit moved on the X-axis (run), the Y axis moves 23.1 units (rise). The y intercept is the point where the least squares line crosses the Y-axis. The least squares line defines the best fit of the data set; therefore the least squares line can be used to make predictions. The residual values are found by subtracting the predicted values of Y from the observed values of Y.
In the graph of the residual plot of problem 1, there is a definite pattern shown by X values zero though twenty, however the data points at point 20 on the X-axis appear to be outliers from the data set and in turn could be influential observations on the least squares line.
Because the origin of the data is unknown, two explanations of the data are possible. When the original data is plotted on a graph, the X values indeed appear to be influential observations. A lurking variable is likely to exist that explains the appearance of the influential observations.
On the other hand, the residual plot indicates that the data is better explained as exponential rather than linear. For this reason, we need to use the predictions from the log of Y to create a new residual plot.
By calculating the r2 of each residual plot, you can see that the line derived using the log of Y better explains the data that seems to be in an exponential curve. R2 shows what percentage of variability in y is explained by the linear relationship with x. The r2 for the first data set is 0.45 . The r2 for the first data set using log transformation is 0.86. Therefore, you can conclude that the least squares regression line using log transformation better represents the data set of problem one.
At the point X=13 in the first data plot, we predict the Y value at 122.22. In the second plot using the log of Y, the prediction is closer to the actual trend of the points at 43.25.
Because the same integer was added to each of the points on the Y axis, the graph just simply moved up the axis changing the prediction at X=13 to 142.22 on the Y-axis. However, when you base the least squares regression line on log transformation, the slope and intercept may change slightly. This explains why the prediction at X=13 (using log transformation) is 74.6 not 63.25. Although, this prediction is not exactly 20 units higher than the prediction in problem one, it seems acceptable because there is only the small difference of 11.35 units. Knowing what the numbers of the data set represent could help you in deciding how significant this slight difference is.
You cannot take the log of a negative number; therefore, you cannot take the log of any of the negative Y coordinates in the data set as a method of finding a least regression line for an exponential data set. Instead, we must use a similar method as in problem 2 and add an integer (n) to the Y values to make the negative points greater than 0. By adding an integer (n) to all the Y values, the graph moves up along the Y-axis. Once the calculations of slope and intercept have been made from the x and log(Y+n), you may make predictions based on your least square regression line for log(Y+n), which in tum leads to the predictions for Y+n. Then, subtract the integer (n) from the Y+n predictions to obtain the answer. The prediction for the X = 13 of 21.27.
However, you might also notice that neither least regression line we obtained truly fits the data set well according to r squared. When looking at the data set as linear the r squared showed that 45.2% of the variability in y is explained by the linear relationship with x. After considering the data set as exponential by using the log of y, the r squared only rose to 58.7% which is a little better, but this may indicate that there is something hidden among the data set. Therefore, it might be helpful to know what the numbers represent in order to be able to look for what might be causing the problem.
| Attachment | Size |
|---|---|
| MATH3353_David_Norman_Project2.pdf | 765.9 KB |
An individual assignment involving a least squares line calculation, slope description, residual plot, r squared, standard deviation of survey data, z score, box plot, and interquartile range.
| Attachment | Size |
|---|---|
| MATH3353_David_Norman_Project3.pdf | 492.84 KB |
Group assignment to experiment with skewing parameters to come out with a predetermined outcome. Also includes calculations to use t-scores, z-scores, provide a confidence interval, and analyze the validity of statistics results. A page is missing from the explanation.
| Attachment | Size |
|---|---|
| MATH3353_David_Norman_Project4.pdf | 204.83 KB |
A group project involving the analysis of null and alternative hypotheses based on p-values and F distribution.
| Attachment | Size |
|---|---|
| MATH3353_David_Norman_Project5.pdf | 262.27 KB |
A group project including the analysis of p-values, confidence intervals, null hypothesis, intercept, and slope.
| Attachment | Size |
|---|---|
| MATH3353_David_Norman_Project6.pdf | 383.32 KB |
The final exam was an individual project. It was a full culmination of calculating least squares regression, slope, residual plots, box plots, testing of the null hypothesis, confidence intervals, standard deviation, z-scores, p-values, t distributions, and data analysis.
| Attachment | Size |
|---|---|
| MATH3353_David_Norman_Final_Project.pdf | 1.52 MB |