This is the top 10 topic for linear correction, one of the most important ideas in statistics and one that you used quite often in your other subjects and later on after graduate. So we are to have the business or a two-page order to examine to statistics statistics review will let the PowerPoint presentation loaded, and we will be skipping pretty quickly to slide number 52, which is the start of top 10 counts of number four slide 52. Okay, linear regression and paying them in the start here on and page 52 member linear is from line and regression means. The tendency for what you see to two to two go towards the mean. Over time and tank I am a won't get into how that word was generated but will linear remember it is aligned and regression means somewhere near the middle and pay or trying to do is find a line or a line of best fit through data. Usually we generate a scatter plot we try and generate the line that best fits that data. Sometimes there isn't a very good line at all sometimes the line is a perfect fit, which is pretty rare. And sometimes the lion is just what we call a best fit the optimal line where we can use that line to predict the future wide values or what we call white hat values from the from the observed exes which are at the independent variables to be observed. Why does we used that those relationships between the independent variables in the deep end and variables to make a formula so that we can predict later on because and in research work should and research including decision makers by managers were trying to predict or explain and linear regression helps us predict a future situations from the data that we already half of the way we predict is by making an equation. When their aggression is about least-squares, what is the smallest amount of distance between a line and the points and will be some all those up. We use the line that has us that has the least squares the smallest number of squares between the lines in the point no give and all of that in the in the presentation okay. The next slide slide 50 through 53 member, he said that our our purpose here in the near aggression is to create an equation and equation from the existing data that we can use to predict future data that's our purpose here. So why a half, which means the predicted values of pain we are and what why without full hat on it just means the observed values in the doors of the sample you have a wide hat predicted values in the future is equal to the use of zero. This is called the inner set up to compare the wine intercept plus to be some piece of one ask up a piece of one is so slow to talk about that in a moment and ex is of the independent value that you'll that's observed or that you'll provide in the equation so going to the bullets just repeats what I just said the regression equations why hat is the deep end and variable that's the variable were trying to predict or explain usually pretty a case of the deep end it very was a left-hand side of equation and you can think of it this way. It's the variable that your job depends on. Its the most important thing that your trying to do why at the deep end and terrible exes be independent for bullets on the right-hand side of the equation if it's if there's just one ex-burial is called simple linear regression if there are more if there's more than one ex value of its called multiple linear regression or just multiple regression piece of zero sequel to the wide intercept that means what why what is why when X. is zero. That's what that means a what is why it fixes or lets the force will appoint but the last bullet on slide 53 as piece of one that's called the slope or the regression coefficient and pain that's an important number because that number indicates the relationship between ask them why a pain here's to you a better definition at the end of slide 53. A change in why relates to a per-unit change impacts a 101 unit change and why relates to one unit change and ask up paying. That's what that means spy 54 slope and correlation first full positive slope is where situation where be one of the regression coefficient is greater than zero so positive correlation between ex and why that means if X. is why increases. Excuse me, if ex-increases. Why increases in a positive slope means is a positive correlation between next and why a which means that if one increases the other increases effects increases wide increases of ex-increases at the independent variable, then why increases the deep end for the opposite is negative slope that's where the regression coefficient B1 is less than zero votes in negative numbers say -.231 or something like that. Okay, there's a negative correlation that means if X. increases why he decreases. So they move in opposite directions if you will last option is a serial slope the regression core fish in a 0.0 that means that there's no correlation to occur to them means that there's no association between Texans are of the predictive value for why it is simply the mean of why there's no linear relationship between ask them why. In general, you won't find that generally hopefully won't be close to zero for the data they are studying as if it's close to zero image can't say anything about. You want there to either be positive slope or negative slope so that you can see an association for a potential so association between ex or more than one acts and the why can a 55 simple in her regression as I talked about before simple just means single one independent variable one deep and invertible K-1 independent variable acts and won the pennant variable watch obtained and how the definition of linear's a graph for a question of regression equation. And it's a straight line that's what linear regression as you probably guessing is their nonlinear regression Mr. Isthmus wears a curve, but that's not on the exam to linear regression means a graph of the regression equation. Like you learn in algebra and pay it and why is he called and asked us plus B. that's the equation. That's an equation for alliance exact same equation we have before in a previous life is a straight line. Today in example why is equal to the souring. That's the deep end of very safer example of female manager in thousands of dollars, up to exes say the number of children that the female manager have us and is the number of observation so you get a huge table of all the salaries and all of the children for each individual person that can be on the.
On Slate slide 557, so here's a given day to commit this is the way you work through regression of a remember on the exam you don't necessarily have to remember I memorize all of the formulas in each particular case, because it's mostly about the concepts not necessarily about the arithmetic can remember there just isn't that much time to answer 16 questions and 35 minutes, but this said a review of how linear regression works so that you can understand the con you can understand the arithmetic am a concept well so that you can repeat the concept on the exam well to pay in and use it in your other classes as well case so the first column on slide 57 is ex-remember X. is the independent rebel so one of the managers has two children, one of the managers has one child another the managers has four children. Why accelerate this to be 48 units in thousands of us is $48,000 $52,000 33,000 that's the day that it's given in the problem paying slide 58 we sum up all of the axis we sum up all of the widest we also count the number of observations which a story that's what's going on in slight 50 82 plus one plus four to 748 plus 52 plus 33 is 133 okay and if you're wondering how this one to how this the computer of like Excel or another program go from a list of data to the regression equation is he uses the method of least-squares up. And you don't need to remember that formula, but it's using a formula that finds the best line between the day that. Probably not straight up, probably not straight down maybe somewhere in the middle or off, awful little bit off to decide one way or the other and and and and the computer usually have to use a computer to do this with a lot of dark collations and it comes up with that equation. Why that is equal to be zero plus the one acts to pay the method of leave from a slight 59. The method of the square on this particular case, the slope of B1 is -6.5 for the computer comes up with that number that's slow the regression coefficient of every method of least squares is not on the exam you don't have to remember you want to memorize the former will be won is equal to -6.5 bets given in the problem. In other words, the computer has determined that and the problem is just giving you the output of that it's helping you to derive the output in the problem interpretation if one female manager has one more child than another than the salary is his $6,500. That's 6.5 times a thousand in running a $6-$6500 lower water lower because it's night to say that is the salary of female managers is expected to decrease by -6.5 in thousands per child for each additional child so one additional unit. One additional unit here walk. Here that would be having one additional child relates in down to is related to our reduction in the salary and of the manager by -6.54 in thousands $6,500 to pay at least according to this day that this sample of three female managers are paying to slide 16 next page intercept the subzero abuse of zeroes are called up. Just read during a little but of algebra, but if you put these are on the left-hand side and we put why a hat or why bar same thing on the right-hand side slope is equal to why half minus be one asked in the formula hair for ex-bars equal to the sum of ex-divided by it and some of ex was seven divided by three equals 2.33 white hat is equal to the sum of why divided by and is equal to 133 divided by three is equal to 44.33 the slope is equal to substituting and hear the slope is equal to 44.33 minus -6.5 times 2.33, which is 59.5. So that's just the calculation for the intercept of pain, and if so if the number of children is zero member of the expected salary is 59 is 59.5 times a thousand which is 59,500 paying so that's how you calculate the particular intercept, which means what X. is zero. Where does the line call or it is the regression line crossed the y-axis and those of the calculations for those values right to slow the use of zero and so on page 60 is equal to a wide hat, minus the regression coefficient times sample a slight 61 years of years the full regression equation for this female manager problem why hat is equal to the subzero, which is 59.5 minus one minus six came as a negative number. He is -6.5 acts, and that's a form or if we know X. that means we can calculate. Why have the predictive value in paying 6259 .so for the here's the title of a slight 60 to forecast the salary. If a particular female manager has three children, the predicted amount. This regression equation based on the data we have is 59.5 -6.5 times a rate which is equal to 40 to 40 tons thousand $40,000 is the expected sari to take calculations get a little bit more, and detail is also something known as the standardbearer of the estimate the standard error of the estimate is why had is equal to the forecast that we just made a piece of zero for just the intercept must be someone regression coefficient times acts and the error in the in this particular case is really why, which is the observed value of why minus white hat, which is the predictive value of why remember to regret there's no room. There's no way to make the regression line go through every single point on the scatter plot perfectly. We just have to pick the line of best fit. Okay, so it's the line in the line is the optimal number. It reduces the load the least squares, smallest amount of squares but is still not perfect. So this always can abuse some error that's why the term error is in the middle of slide 63 the error is the observer for why that's why would nothing about it minus why half the predictive value at. It would be nice if that number was zero, but it generally will not be easier so we need to know something about the standard error of the estimate. And that is abbreviated and that the symbol for that is best is usually asked somebody for standard air is equal to the square root of the some of the Square's. The sum of the squares divided by an -2, which is equal to the square root of the some of why minus why cat that's the sum of the observed values, minus the predicted values square device looks and looks a lot like a variance calculation doesn't first as a standard deviation circulation and pay the sum of the to observe why minus the predicted White squared divided by an minus two thing and that's the standard error of the estimate okay and a slide 64 which Ammann now goes through some of the calculations that arrive to standard error of the estimate. And I won't go through all of that right now, but it's it seemed to hit its in tabular form so you can see how that particular works that particular value works to get some of the squares and we get the sum of this and this particular example of some scores of strip .5 53.5 divided by three -2, which is one which is equal to the square root of 3.5 which is 1.9 is so one is 1.9 mean that's the standard air of the estimate that means that the actual salary is typically 1.9 units away from the expected salary. 1.9 times a thousand is 1900 so the actual souring is typically or approximately $1900 a way from the expected soured. We wish to be expected in the axles were the same but there's no way to do perfect protection, which is half the estimated and this is the way he calculate how far away, this is a slight 66, and other important concept in regression is the coefficient of determination coefficient of determination in the first bullet point here is our square are squared is the percentage of the total variation in why that be can be explained by the variation and acts as always, be some variation in why it is always good to be some variation in Macs are squared is the percent of the total 100% so you would like a high are square and tank as the percent of the total very nation in why the observed values that can be explained by the variation. Ian asked a nice squared server 7080 90% those kinds of numbers, but it depends on the problem of a look point number two farce coefficient of determination are squirmed measures how close the linear regression line fits the points in the scatter diagram or scatter plot the break told you the line is a best fit line. It's not a perfect line so, which is trying to measure how good that line is in our score. It helps measure the number three are squared is to a use equal to one or 100%. That's the maximum house per possible value, that means that there's a perfect linear relationship between ask him why to carry less below .if they are squared is zero, which means 0% that's the minimum value that means that there's no linear relationship are squared is a square of our which the correlation between two variables if there's only one ex-and one why so are squared again on page on slight 66 is the percentage of the total variation in why that he can be explained by the variation in next and we would prefer that that that number be high it be close to under percent were also data doesn't explain what's going on, and to slide 67 sources of variation in their five bullets on the slide total variation is equal to the explain very efficient. Plus the unexplained reaction to it we would we would prefer that the total variation equals explain variation that we can account for all, but not nothing is ever that two of this discussion be explained variation in some left over unexplained variation. Assess is equal to the some of the squares, which is also a equal to the variation total SS is equal to the regression sum of squares. Plus there are some of square, she became sometimes it's called the regression sum of squares plus the residual sum of squarish SST, which is sum of squares total is equal to the sum of squares read regression plus the sum of Square air Bush SS are as the summer of squares regression is equal to the explained variants and the sum of squares air or is equal to the unexplained parents. Those of the definitions are an slight 67 slide 68 coefficient of determination are squared is basically the sum of squares of the regression divided by the sum of squares of the total remember our square is the amount that speak these the explain various sub that this is the first bullet is the sum of squares of their aggression, divided by the sum of squares of the total and we would further that number be high. Here's an example of second bullet are squared as he called 197 divided by 200.5 so that's our squirt is equal to the sum of squares of the regression divided by the sum of squares of the total they are squared here he is .98 third poll interpretation. That means .98 98% of the total variation and salary can be explained by the variation in number of children at least to the extent of this particular data that was me that was completely made up the tank slide 69, in our square fall summer between 011 or if you like a 0% and 100% pay this era means that there's no linear relationship since the sum of squares over of the regression is zero. That means he explain were very issues euro means is no linear relationship answer or wind or one better percent. That means that there's a perfect relationship since the sum of squares of the regression equals the sum of squares of total, but life isn't that simple. Also if the sum of squares of the regression equals the sum of scores of total that means that there's nothing left over that means that the unexplained variation in the sum of squares he sum of score says he is equal to zero in tank, but just because there is a perfect relationship or good relationship for an association between two variables it does not prove cause-and-effect correlation that is the relation of a mathematical relationship between two variables and ask them why is causality correlation is not causality you would be nice if we can prove it that easily, but it's not that simple, but it's this step. It's one more step in that direction yet to do some more advanced techniques to be able to demonstrate causality that one variable causes another day. For this is reduced to an correlation and regression to a slight 70 are is equal to the correlation coefficient pain, case number one that's the first bullet here the slope of the slope, which is be so one is less than zero, which means that the all are is less than zero are as the negative square root of the coefficient of determination. So far is equal to minus square root of four square day in our example for example with a female salary as a manager sow is the slope of the one is -6.5 are squared is equal to .98 K. we take the negative square of .98 that means our is naked of .99 and a case number to the slope is greater than or equal to zero on our is the positive square root of the coefficient of determination example are squared is equal to .49. We take the positive spare room for 44.49 we end up with .7 article 2.7 fourth bullet says are is no interpretation in this particular case and this particular case are overstates the relationship, which is why we usually we compute are the order report are squared, because that's easier to understand our square to say us .49 that means 49% of the variation in why is explained by the variance variation and ask okay. Slide 973 were almost done here are of the slices caution and nonlinear relationships. For example, a parabola at per a and other kinds of a curvilinear or relationships cannot be measured up by our squares and take us remember were during linear regression. So if there's a relationship between two variables of its nonlinear linear regression isn't going to be a will to determine the saying. So, there are some more sophisticated relationships that you might see in some formidable classes are nonlinear but not for SOM 120 certainly not for the LDC exam day at's second bullet here is you could give in our score to zero with a nonlinear graph on a scatter diagram and the reason for that is a because the scatter diagram is a displaying the points were trying to find a single lying. That's the best fit line, but if there's a nonlinear relationship, and still come back, but it will of the linear regression process the process of least-squares will detect it and you let up with a will are squared of zero cake so in summary slight 74, case number one if the wind is greater that an zero. That's the the the the regression coefficients are is a positive square root of the coefficient of determination and to for example. Sample number one wise equal to four plus three ask our squirt is equal to .36 soap bar is equal to the positive square root of our squared witches points to a positive point sex, case number two if B1 is less than zero are as the negative square root of the richer change the sign of the coefficient of determination for example. Why is equal to 80 -10 ask our squirt is equal to .49 for our is equal to negative points out a bowl of point number three last below .no example 2 has a stronger relationship as measured by the coefficient of determination and why because .7 is larger than .6 yes I know the number is negative, but that just means that when ex-increases why decreases the .7 is the absolute value .7 is larger than the absolute value .6 .6 is closer to 01.7 or so example to there's a stronger relationship and example to the merits of example 1 of 10 a.m. this is a summary of what we've seen before on slide 75 extreme values at Forest plus one that's a perfect positive correlation you won't see that very often, but it might occur may be in a hypothetical example are is negative one of perfect negative correlation again, you probably won't see that in in your work, but it might occur any far is zero you the same issue is if our square to zero or 0% this euro correlation there is no correlation between those two values, which is not what you're looking for generally you collected the sample on the basis that you thought that there was a relationship between those variables see you'd like to see it and our bats away from zero on either side up a slight 76 so some examples of what happens when you used when you put in some data into Excel and and use the regression output, which is under at the tools, data analysis, I believe to do regression and it gives you all of these values on the screen gives you the correlation coefficient is to the coefficient of determination and accusing the standardbearer of the estimate. So you have to calculate which is nice and the care she most importantly, the regression coefficients. -6.5 at the bottom of screen here and the wide intercept 59.5 okay. And that's it for linear regression