Module 6 Discussion-Stats 250 words Initial Post. will send replies later. Describe an example of a research question where a RegressionModel could

Module 6 Discussion-Stats
250 words Initial Post. will send replies later.

Describe an example of a research question where a RegressionModel could be been used. Describe the characteristics of the independent variables and their likelihood of explaining the variance in the dependent variable . Discuss your estimate what you would expect the correlation coefficientto be between the pairs of each independent variable and the dependent variable.
Discuss how r^2 is valuable in determining the effectiveness of the regression model.

Don't use plagiarized sources. Get Your Custom Assignment on
Module 6 Discussion-Stats 250 words Initial Post. will send replies later. Describe an example of a research question where a RegressionModel could
From as Little as $13/Page

In your two replies to classmates, provide insights for the similarities and dissimilarities between their example and yours.

Running head: MODULE 6 DISCUSSION 1
MODULE 6 DISCUSSION 2

Module 6 Discussion

Students name
Professors name
Course title
Date

250 words Initial Post. will send replies later.
1. Describe an example of a research question where a RegressionModel could be been used. Describe the characteristics of the independent variables and their likelihood of explaining the variance in the dependent variable . Discuss your estimate what you would expect the correlation coefficientto be between the pairs of each independent variable and the dependent variable.
A company that runs call centers can use regression analysis to determine the effect of decline in calls on sales. After understanding it, the company predicts how to solve the problem or figure out if this pattern will continue in the future. Generally, the sales tend to go down when the number of calls decline and vice versa. This reflects a positive relationship among the variables. Thus, they are strongly correlated to each other. Regression helps identify the strengths as well as the weaknesses in order to improve on them and push the calls drops upwards to increase sales. In this scenario, the independent variable is the calls whiles sales is the dependent variable. The regression equation can be stated as: y=bx+a where y is the dependent variable, x is independent variable, and a refers to the constant or y intercept. The values of x can be changed from time to time to find out what happens to y. Hence, independent variables are the variables which are manipulated. In addition, its impacts can be measured besides compared. They are also referred to as predictors because they forecast the dependent variable values. Dependent variables measure the effect of independent variable on test units. They depend on independent variables. They are the variables that are predicted. To be precise, regression is a method for relating the variables to one another (Black, 2017). The researcher observes the data, graphs it, and establishes if there is a pattern. Next, an equation is created that best matches the pattern. Deriving line of best fit is the most common form of linear regression.
2. Discuss how r^2 is valuable in determining the effectiveness of the regression model.
In your two replies to classmates, provide insights for the similarities and dissimilarities between their example and yours.
R2 or coefficient of determination analyzes strength of linear relationship among variables. It ranges between 0 and 1.0. 1.0 is a perfect fit and is most reliable for forecasting the future. 0 value shows that the model does not express the data accurately. R2 (0.60) reveals that 60% of the variation in dependent variable is explained by independent variable. A higher R2 means that the model can be strongly predicted. It indicates smaller variations between fitted values and observed data. After designing the linear model, the researcher wants to know how well the regression model matches with the data (Frost, 2020). R2 is one way of determining fitness. Goodness of fit examines the distance between the data points which are scattered in the diagram and the drawn line. Tight dataset will have a line which is close to the points and a high fitting level, a reflection that distance between the data and line is small. Notably, regression model best aligns with the data if the variations between observations and predicted values are small and not biased. Before evaluating the goodness of fit measures such as R2, assess residual plots. These plots show a biased model instead of numeric output. It indicates the patterns that have issues in residuals. The results are untrustworthy if he model is biased. If the residual plots are good then, proceed on to come up with r2. Adjusted R2 is recommended when analyzing the models fitness (Black, 2017). This adjusted model portrays the number of independent variables as well as the sample size. It may decrease or increase when an independent variable is added or removed. A rise in Adjusted R2 shows that the model has really improved.

References
Black, K. (2017). Business Statistics: For Contemporary Decision Making, (9th Edition). Hoboken, N.J.: Wiley.
Frost, J. (2020). How to interpret R-squared in regression analysis. Retrieved from https://statisticsbyjim.com/regression/interpret-r-squared-regression/ #Q1

Jtrain2

a. Fraction of men receiving job training is 0.4157 or 41.57%

b. Averages of re78 for the sample of men who received job training is 6.35 thousand dollars

whereas Averages of re78 for the sample of men not received job training is 4.55 thousand

dollars. This is a 1.8 thousand dollars which is a economically statistical figure.

c. Fraction of the men who received job training are unemployed is 0.2432 or 24.32% whereas

fraction of the men who not received job training are unemployed is 0.3538 or 35.38%. This

is a 11.06% difference which is very large and it tells us that job training decreases the

chances for the unemployment.

d. From both b) and C) we conclude that job training significantly increases the income and

reduced the unemployment.

#Q1

data(jtrain2, package=’wooldridge’)
View(jtrain2)
#a
sum(jtrain2$train==1)/nrow(jtrain2)

## [1] 0.4157303

#b
s1 <- subset(jtrain2,train==1) mean(s1$re78) ## [1] 6.349145 s2 <- subset(jtrain2,train==0) mean(s2$re78) ## [1] 4.554802 #c s3 <- subset(jtrain2,train==1) sum(s3$unem78==1)/sum(jtrain2$train==1) ## [1] 0.2432432 s4 <- subset(jtrain2,train==0) sum(s4$unem78==1)/sum(jtrain2$train==0) ## [1] 0.3538462 Q2 The mean, standard deviation, intercept, slope will differ each time because this is a simulation. a. The Sample mean is approximately equal to 5 which is same as the mean of the Uniform distribution in the range (0,10). The Standard deviation is 3.00 which is also very close to the standard deviation of the Uniform distribution in the range (0,10). b. No, the average of ui is not exactly 0 but very close to 0 which is also expected because this is the random numbers and mean value can differ. The Standard deviation is 5.95 and it is close to the actual standard deviation of 6. c. The Intercept is 2.087 and slope is 1.779. They are not equal to the Population values but they are very close especially the slope d. Both residual sum and Product if xi and residual is almost 0. e. The sum of error and Product of xi and error is not 0. f. The new Intercept is 0.6041 and slope is 2.0126. They are different from the first no because this is a simulation regression and results will come different each time but the slope is almost 2 in both cases. #Q2 #a. Generate 500 Uniform Distribution Observation in the range (0,10) xi<- runif(500)*10 mean(xi) ## [1] 5.007284 sd(xi) ## [1] 3.008299 #b.Generate 500 Normal Distribution Errors in the range (0,36) ui<-rnorm(500)*6 mean(ui) ## [1] -0.0205087 sd(ui) ## [1] 5.809095 #c yi <- 1+2*xi+ui resul <- lm(yi ~ xi) resul ## ## Call: ## lm(formula = yi ~ xi) ## ## Coefficients: ## (Intercept) xi ## 2.087 1.779 #d resid <- residuals(resul) options(scipen = 999) sum(resid) ## [1] -0.00000000000005384582 Prod <- xi*resid sum(Prod) ## [1] -0.000000000001600942 #e sum(ui) ## [1] -10.25435 Prod2 <- xi*ui sum(Prod2) ## [1] -1050.401 #f #a. Generate 500 Uniform Distribution Observation in the range (0,10) xi_new<- runif(500)*10 mean(xi_new) ## [1] 4.966936 sd(xi_new) ## [1] 2.899189 #b.Generate 500 Normal Distribution Errors in the range (0,36) ui_new<-rnorm(500)*6 mean(ui_new) ## [1] -0.3332974 sd(ui_new) ## [1] 6.400612 #c yi_new <- 1+2*xi_new+ui_new resul_new <- lm(yi_new ~ xi_new) resul_new ## ## Call: ## lm(formula = yi_new ~ xi_new) ## ## Coefficients: ## (Intercept) xi_new ## 0.6041 2.0126 Q3 Discrim a. The averages for prpblck and income are 0.113 and 47,053.78, respectively. The standard deviations for prpblck and income are 0.1824 and 13179.29 respectively. prpblck represents a proportion of the black population and the income is represented in dollar terms. b. The resulting regression is psoda = 0.956 + 0.115 *prpblck + 0.0000016 *Income. The Sample size is 401 and R-squared is 0.0642. The coefficient of prpblck indicates that if population of African-Americans increases by 1%, then price of the soda will increase by 1.15 cents. It is not economically large. c. The estimate of the coefficient on prpblack with the simple regression is 0.065 which is lower than the prior estimate. This tells us that discrimination effect get smaller when the income variable is excluded. d. If prpblck increases by 20 percentage points, estimated psoda will increase by 2.43% e. By adding the prppov variables, the estimate of the coefficient of prpblck falls to 0.0738. f. The correlation between log(income) and prppov is approximately -0.8385 which tells us that the relationship is strongly negative. This is also expected because decreases in income will increase the poverty rates. This makes sense, because one would expect that declines in income would result in higher poverty rates. g. Yes, they are highly correlated and include both may result in multicollinearity problem but our main purpose is to study the discrimination effect and we need to control as many measures of income as we can. So, include both would makes sense. #Q3 data(discrim, package='wooldridge') View(discrim) #a mean(discrim$prpblck,na.rm=TRUE) ## [1] 0.1134864 sd(discrim$prpblck,na.rm=TRUE) ## [1] 0.1824165 mean(discrim$income,na.rm=TRUE) ## [1] 47053.78 sd(discrim$income,na.rm=TRUE) ## [1] 13179.29 #b results <- lm(discrim$psoda ~ discrim$prpblck+discrim$income) options(scipen = 999) results ## ## Call: ## lm(formula = discrim$psoda ~ discrim$prpblck + discrim$income) ## ## Coefficients: ## (Intercept) discrim$prpblck discrim$income ## 0.956319626 0.114988191 0.000001603 # No of Observations nobs(results) ## [1] 401 # R-Square summary(results)$r.squared ## [1] 0.06422039 #c results2 <- lm(discrim$psoda ~ discrim$prpblck) options(scipen = 999) results2 ## ## Call: ## lm(formula = discrim$psoda ~ discrim$prpblck) ## ## Coefficients: ## (Intercept) discrim$prpblck ## 1.03740 0.06493 #d results3 <- lm(log(discrim$psoda) ~ discrim$prpblck+log(discrim$income)) options(scipen = 999) results3 ## ## Call: ## lm(formula = log(discrim$psoda) ~ discrim$prpblck + log(discrim$income)) ## ## Coefficients: ## (Intercept) discrim$prpblck log(discrim$income) ## -0.79377 0.12158 0.07651 # Percentage Change Coefficients <- coef(results3) 0.2*100*Coefficients["discrim$prpblck"] ## discrim$prpblck ## 2.431605 #e results4 <- lm(log(discrim$psoda) ~ discrim$prpblck+log(discrim$income)+discr im$prppov) results4 ## ## Call: ## lm(formula = log(discrim$psoda) ~ discrim$prpblck + log(discrim$income) + ## discrim$prppov) ## ## Coefficients: ## (Intercept) discrim$prpblck log(discrim$income) ## -1.46333 0.07281 0.13696 ## discrim$prppov ## 0.38036 #f cor(log(discrim$income), discrim$prppov,use = "complete.obs") ## [1] -0.838467 Q4 K401Subs a. 2017 single-person households are in the data set b. The resulting regression is nettfa = -43.0398+ 0.7993*Inc + 0.8427*age. The Sample size is 2017 and R-squared is 0.1193. The coefficient of Inc is 0.7993 and its indicates that if annual Income of the family increases by $1000 then net financial wealth will increase by $799.3 keeping age variable constant. The coefficient of age is 0.8427 and its indicates that if age of the survey respondent increases by one year then net financial wealth will increases by $842.7 keeping the annual family Income constant. I do not see any surprises in the result because as the family income, their net wealth grows too but not in same amount because of the expenses. Also, as the age grows, family accumulate more funds and their wealth grows too. c. Intercept is -43.0398 and its tells us that for 0 family income and 0 page, the net financial wealth is -43.0309 thousand dollars. I do not think it is any interpretation because it does not have any meaning, also the survey does not have any respondent with 0 age. d. P value for the left tailed test is 0.0437. Since P value is greater than 0.01, we failed to reject H0 and conclude that there is insufficient evidence that age coefficient is significantly less than 1. e. For simple regression of nettfa on inc, the estimated coefficient on inc is 0.8207 which is not so much differ from the earlier part estimate. The correlation between age and Income is 0.1056 which is very low and thus its also explained that why omitting an age variable does not affect the Income variable much because variables are very weakly corelated. So, there is no Omitted variable Bias. #Q4 data(k401ksubs,package='wooldridge') View(k401ksubs) #a sum(k401ksubs$fsize==1) ## [1] 2017 #b res <- lm(k401ksubs$nettfa ~ k401ksubs$inc+k401ksubs$age,subset = (k401ksubs$ fsize==1)) res ## ## Call: ## lm(formula = k401ksubs$nettfa ~ k401ksubs$inc + k401ksubs$age, ## subset = (k401ksubs$fsize == 1)) ## ## Coefficients: ## (Intercept) k401ksubs$inc k401ksubs$age ## -43.0398 0.7993 0.8427 # No of Observations nobs(res) ## [1] 2017 # R-Square summary(res)$r.squared ## [1] 0.1193432 #d # Age Coefficient estimate age_coeff <- res$coefficients[3] # Age Coefficient standard error age_Sder<-summary(res)$coefficients[3,"Std. Error"] #t statistic with Beta2 = 1 t_slope <- (age_coeff - (1)) / age_Sder t_slope ## k401ksubs$age ## -1.709944 # calculate Left sided p-value, degrees of freedom = n - 2 pvalue <- pt(t_slope,df=length(k401ksubs$age)-2) pvalue ## k401ksubs$age ## 0.04365485 #e res2 <- lm(k401ksubs$nettfa ~ k401ksubs$inc,subset = (k401ksubs$fsize==1)) res2 ## ## Call: ## lm(formula = k401ksubs$nettfa ~ k401ksubs$inc, subset = (k401ksubs$fsize = = ## 1)) ## ## Coefficients: ## (Intercept) k401ksubs$inc ## -10.5710 0.8207 cov(k401ksubs$age,k401ksubs$inc) ## [1] 26.21035 Devon Conway Conway M6 COLLAPSE Top of Form Picture this: my favorite food is peanut butter. I just started my own business creating organic, vegan peanut butter. I want to grow my business, so I turn toward a vegan wellness influencer on Instagram and pay her to post about my peanut butter in hopes of growing my business and acquiring more customers. A simple example of a real-life application of a regression modeladdresses the question:What is the relationship between the money spent on paying this influencer compared to revenue generated? The independent variable is the money allocated to the influencer to promote my peanut butter. The dependent variable is revenue generated directly correlated to the work of the influencer promoting my peanut butter. The formula to answer this question is:revenue =0+ 1(ad spending). The R^2 variable is beneficial in determining the effectiveness of the regression model because it measures the linear relationship between the independent and dependant relationships in the problem on is solving. For most statistical cases, the higher R^2 is, the better it will fit within the model. As noted in the Mini Tab Blog, R^2 can be between 0-100%. Zero representshat the model explains none of the variability of the response data around its mean and 100% representsthat the model explains all the variability of the response data around its mean. References: Editor, M. B.Regression Analysis: How Do I Interpret R-squared and Assess the Goodness-of-Fit?https://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit. Bottom of Form Shannon Song Module 6 Discussion COLLAPSE Top of Form I have used regression models in the past to examine if there was any correlation between variables and the likelihood of purchasing insurance. Some examples of these variables might include age, income, jobtitle, other types of insurance already owned, etc. Out of simplicity, I will discuss the variableagewhich would be our independent variable and its connection to the likelihood of obtaininginsurance, our dependent variable.I would expect that as age increases, the likelihood of obtaining insurance would also increase.Typically, an |r| that is greater than 0.5 and less than 1 is considered strong. However, since I was testing multiple variables, I only wanted to use |r| that were 0.75-1, as they were the most strongly correlated.I would expect this variable to be 0.5 < |r| < 1, but this depends on the specific data provided. If the correlation is above 0.75, I would say that age can be a strong predictor forlikelihoodof obtaining age. I would also expect thisrelationship to bepositivelycorrelated as age increases, so too does the chance of purchasing insurance. However, it could also be that age is positively correlated up to a certain point and then becomes less likely the older one gets.Other independent variables that might be closely associated could be income level we might expect to have a positive correlationbetween income and chance of purchasing insurance. One thing we need to check for when using multiple independent variables to predict a single dependent variable is multicollinearity. This occurs when the independent variables are correlated to each other and can provide inaccurate regression results between the dependent variable. One way to check for this is plotting all the independent variables on a heat map using a correlation matrix this is an easy way to visualize which variables might be too highly correlated with each other. Typically, I would drop any variable that is above 0.5 in their r value with other independent variables. An r^2 value measures the individual data points variability from the regression line.This measurement is also called the goodness of fit, for a linear regression model specifically. The r^2 is on a range between 0-100%.An r^2 value that is closer to 100% is typically considered to indicate a better fit of the regression line to the data points. It can be calculated by dividing the variance explained by the model by the total variance.When an r^2 is interpreted, it is in terms of the amount of variation explained by the model in terms of the response variable.An important thing to note is that obtaining an r^2, while theoretically possible, is almost impossible to achieve with real data. There will always be some unaccounted variability that cannot be explained by the regression line. R^2 witha low value are not always indicative of poor model choice. Some data will have more variability than other data and so naturally bring a lower r^2 value.Conversely, if an r^2 has too high of a value it may mean the model has been overfitted. This means the model is excellent for predicting for only the specific data set used to create the model, but it would likely be very poor in predicting other data.It is hard to say when a model has gone from a good fit model to overfitted, however in the past I havebeen told that percentages that are reaching 95% and over are considered overfitted. References: Frost, J. (2020, July 16). HowToInterpret R-squared in RegressionAnalysis.Retrieved August 14, 2020, from https://statisticsbyjim.com/regression/interpret-r-squared-regression/ Frost, J. (2020, July 16). Multicollinearity in Regression Analysis: Problems, Detection, and Solutions. Retrieved August 14, 2020, from https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/ Wilson, L. (2009, May 02). Statistical Correlation. Retrieved August 14, 2020, from https://explorable.com/statistical-correlation Bottom of Form