Coursework | COMP517 Data Analysis - Lab 9: Introduction to One-way ANOVA and Simple Linear Regression

COMP517 – Data Analysis
Lab 9 – Introduction to One-way ANOVA and Simple Linear Regression Assignment Writing Service

In this lab, we introduce the fundamental concepts of hypothesis testing using one-way analysis of variance (ANOVA) and simple linear regression models. We will use ANOVA statistical techniques to analyze exam scores across various streams of students. Then we perform Tukey's Honestly Significant Difference (HSD) post hoc test after conducting an analysis of variance (ANOVA) to compare test scores among different streams. In the latter part of the lab, we will employ the identical dataset for conducting a linear regression analysis. Simple linear regression serves various purposes, but in this lab, we concentrate on specific objectives. It enables the prediction of the dependent variable by leveraging the independent variable, facilitating the estimation, or forecasting of outcomes for new or unobserved data points. In addition, it provides a valuable means of understanding relationships between variables. The slope coefficient (β1) within the regression model quantifies how variations in the independent variable influence changes in the dependent variable while holding all other factors constant, thus offering insights into the cause-and-effect dynamics between the variables of interest. Assignment Writing Service

A preliminary version of this lab is made available under the name "Lab 09_Preload.ipynb". This Python file comes with explanatory comments and some existing codes designed to guide you through the tasks. Download the file and save it your lab 09 directory. Assignment Writing Service

Lab Objectives Assignment Writing Service

✓ One-Way ANOVA: One-way analysis of variance (ANOVA) is a statistical method to compare the means of three or more groups. We will extend our analysis to multiple streams and assess whether there is a significant variance in exam scores across these streams. Assignment Writing Service

✓ Tukey's Honestly Significant Difference (HSD) Test: Conduct Tukey's HSD post hoc test following an analysis of variance (ANOVA) to assess and compare test scores across various stream categories. Assignment Writing Service

✓ Simple Linear Regression: One of the objectives of simple linear regression is to establish and understand the relationship between two variables: one is the dependent variable (the variable you want to predict or explain), and the other is the independent variable (the variable used to make predictions or provide explanations). This analysis then aims to predict a student's test score based on their performance in the lab test. Following that, we will provide a concise discussion on the findings of the model. Assignment Writing Service

Data Source Assignment Writing Service

Our dataset comprises exam scores from students in different streams of a given course. By applying these hypothesis testing techniques, we aim to uncover potential disparities in academic performance among these streams. Download the ‘student_streamscores.csv’ and save it under your lab 09 directory. Assignment Writing Service

Assumptions Assignment Writing Service

For this lab, we assume that our data follows a normal distribution, and we also assume that the key assumptions for linear regression validity are met. We will cover these assumptions next week. Assignment Writing Service

Page 1 of 8 Assignment Writing Service

Grading of the Lab Assignment Writing Service

To receive credit for this lab, you must complete all assigned tasks and answer all questions. Submit your responses to the questions as a PDF file via Canvas. Assignment Writing Service

1. One-way Analysis of Variance (ANOVA) Assignment Writing Service

ANOVA is a statistical technique used to analyze and compare the means of two or more groups or populations to determine if there are statistically significant differences among them. It is a parametric test that extends the principles of the t-test to situations involving more than two groups. Assignment Writing Service

Null Hypothesis (H0): The null hypothesis in ANOVA states that there are no significant differences among the group means. In other words, all group means are equal. Assignment Writing Service
Alternative Hypothesis (Ha): The alternative hypothesis suggests that at least one group mean is different from the others. It implies that there is a significant difference among the groups. Assignment Writing Service
Test Statistic (F-Statistic): The test statistic in ANOVA is the F-statistic, which is a ratio of two variances: the variance between groups and the variance within groups. It measures how much the group means differ relative to the variability within each group. Assignment Writing Service
P-Value: The p-value associated with the F-statistic is used to make a decision about the null hypothesis. If the p-value is less than a chosen significance level (often denoted as alpha, typically 0.05), the null hypothesis is rejected in favour of the alternative hypothesis. Assignment Writing Service

Example: Let's conduct a one-way ANOVA analysis where we have multiple streams (more than two), and we want to determine if there is a statistically significant difference in the mean test scores among these streams. Assignment Writing Service

Research Question: Is there a statistically significant difference in the mean test scores among different streams? Assignment Writing Service

Step 1.1: Formulate Hypotheses Assignment Writing Service

Question 1.1: formulate the hypotheses for the above research question. Assignment Writing Service

Step 1.2: Collect Data: Read data from the CSV file, load it into a DataFrame, extract the unique values from the "Stream" column, and print these unique stream values to the console. It helps you understand the different categories or groups within the "Stream" column of your dataset. Assignment Writing Service

Step 1.3: Create and visualise histogram plots with Kernel Density Estimation (KDE): This step is essential in the context of a one-way analysis of variance (ANOVA) because it serves multiple crucial purposes. Firstly, it allows for the examination of the distribution of test scores within each group, helping to assess whether the assumption of normality is met. Additionally, it aids in identifying potential outliers and understanding variations in score distributions among different groups, all of which are fundamental considerations in ANOVA. Moreover, these visualizations facilitate data exploration and can reveal any violations of ANOVA assumptions, guiding decisions on appropriate statistical tests or data transformations. Lastly, they provide an intuitive means of communicating the findings to both technical and non-technical audiences, enhancing the interpretability of the results. Assignment Writing Service

Page 2 of 8 Assignment Writing Service

Question 1.2: Analyse your findings and provide a brief and definitive summary regarding the distributional attributes of both the overall test scores and the test scores for each individual stream. Assignment Writing Service

Step 1.4: Perform a One-way ANOVA test: using stats.f_oneway to analyze whether there are significant differences in the means of test scores among the different streams. To do so we took below steps: Assignment Writing Service

Grouping Data: Assignment Writing Service

This code creates a list called grouped_data that contains the test scores for each stream. It uses a list comprehension to iterate through each unique stream value in stream_values and filters the data DataFrame to select only the test scores corresponding to that stream. Assignment Writing Service
Performing ANOVA: Assignment Writing Service

The stats.f_oneway function from a library like SciPy is used to perform the one-way ANOVA. It takes the individual test score arrays from grouped_data as arguments and calculates the F- statistic and associated p-value. Assignment Writing Service

Step 1.5: Provide the results to make a decision based on significance level (alpha=0.05): The code in this step prints out the results of the one-way ANOVA. It displays the F-statistic, which measures the ratio of variation between group means to variation within groups, and the p-value, which indicates the significance level of the ANOVA test. We set the significance level (alpha) to 0.05. Assignment Writing Service

Compare the p-value to the chosen significance level (α).
Question 1.3: Fill in the table below with the results of your code. Assignment Writing Service

Table 1.1: Results of one-way ANOVA Dataset statistical output Assignment Writing Service

Sample Size Assignment Writing Service

Stream 50
Stream 51
Stream 52
Stream 53
one-way ANOVA Statical outputs f-statistic Assignment Writing Service

p-Value
Step 1.6: Draw a Conclusion: Assignment Writing Service

Question 1.4: Based on your results, draw a conclusion about whether the exam scores of different streams are significantly different or not. Assignment Writing Service

Page 3 of 8 Assignment Writing Service

Step 1.7: Use of Critical F-value to interpret the results and then draw a conclusion. Assignment Writing Service

In this step, we will be performing a statistical hypothesis test using an F-test to determine if there is a significant difference in test scores among different streams. The code provided in this section is part of a hypothesis test commonly used in ANOVA to compare the means of multiple groups (streams in this case). The F-test helps determine if the variance between group means is statistically significant or if it can be explained by random chance. Assignment Writing Service

1. Calculate the degrees of freedom for the numerator/between (dfn) and denominator/within (dfd) in the F-distribution where k is the number of groups or streams (in this case, 4 streams) and N is the total number of observations, which is the sum of the lengths of scores from each stream. Assignment Writing Service

df_between is the degrees of freedom between groups. It represents the degrees of freedom for the numerator and is calculated as k – 1 where K is number of unique streams. Assignment Writing Service

o Thisvalueisassociatedwiththevariabilitybetweengroupmeans.Itquantifies the number of group means that are free to vary independently when comparing them in the ANOVA. Assignment Writing Service
df_within is the degrees of freedom within groups. In represents the degrees of freedom for the denominator and is calculated as N – k where K is number of unique streams and N is the total number of dataset rows. Assignment Writing Service

o Thisvalueisassociatedwiththevariabilitywithineachgroup.Itquantifiesthe number of data points within the sample that are free to vary independently when calculating the within-group variability. Assignment Writing Service
"free to vary independently," implies that a parameter or value is not constrained or restricted by other factors or parameters in a way that limits its range of possible values. It can vary or change on its own without being overly influenced by external factors, and this independence is a fundamental concept in statistical analyses and hypothesis testing. Assignment Writing Service

2. Calculate the critical F-value for a given significance level (alpha) and degree of freedom: Assignment Writing Service

• stats.f.ppf(1 - alpha, df_between, df_within) uses the ppf (percent point function) method from scipy.stats to calculate the critical F-value. The critical F-value represents the value from the F-distribution that corresponds to a specific probability (1 - alpha) and the degrees of freedom for the numerator (df_between) and denominator (df_within). Assignment Writing Service

3. Print the critical F-value rounded to two decimal places. Assignment Writing Service

4. Test the hypothesis: f_statistic represents the computed F-statistic from your data, which is used to test the null hypothesis. Assignment Writing Service

If f_statistic is greater than the critical F-value, it means that the test statistic is in the tail of the F-distribution, and you reject the null hypothesis. Assignment Writing Service
If f_statistic is less than or equal to the critical F-value, you fail to reject the null hypothesis. Assignment Writing Service

Page 4 of 8 Assignment Writing Service

The null hypothesis in this context likely states that there is no significant difference in test scores among the different streams. If you reject the null hypothesis, it suggests that there is a significant difference in test scores among at least some of the streams. Assignment Writing Service

Question 1.5: What was the calculated critical f-value? Based on your results and the provided general rule, draw a conclusion about whether the exam scores of these streams are significantly different. Assignment Writing Service

Step 1.8: Let’s identify which specific stream pairs had significant differences in means after conducting an ANOVA. To do so, we can use Tukey's Honestly Significant Difference (HSD) post hoc test to determine where those differences lie when you have three or more groups. This is particularly useful when ANOVA indicates that there are significant overall differences among groups but doesn't specify which groups are different from each other. Below explains the steps we took to perform the test: Assignment Writing Service

• First, we create a MultiComparison object called multicomp. This object is used to perform multiple comparison tests, such as Tukey's HSD, to assess the differences between groups after an ANOVA. Assignment Writing Service

o data['Test_Score'] represents the dependent variable (test scores), and
o data['Stream'] represents the independent variable (the stream or group labels) that the Assignment Writing Service

Tukey's HSD test will compare. Assignment Writing Service

The multicomp.tukeyhsd() method is called on the multicomp object to compute Tukey's HSD Assignment Writing Service

test results. This test compares the means of all pairs of groups (streams) and determines if there Assignment Writing Service

are statistically significant differences between them. Assignment Writing Service
Displaying the Results: This output typically includes a table that provides information about Assignment Writing Service

the mean differences between groups, confidence intervals, and whether those differences are statistically significant. Assignment Writing Service

o "meandiff" indicates the mean difference between the two groups being compared.
o "p-adj" represents the adjusted p-value, which is corrected for multiple comparisons. o "lower" and "upper" refer to the lower and upper bounds of the confidence interval Assignment Writing Service

for the mean difference.
o "reject" is a binary indicator that tells us whether to reject the null hypothesis of no Assignment Writing Service

significant difference between the groups. Assignment Writing Service

Page 5 of 8 Assignment Writing Service

Example on how to interpret the results: Comparing "st50" and "st51" groups, the mean difference is 0.86, but the p-value (p-adj) is 0.9, indicating that there's no statistically significant difference between these two groups (reject = False). Assignment Writing Service

Question 1.6: Provide the results generated by Tukey's Honestly Significant Difference (HSD) test and interpret the result using above example. Assignment Writing Service

2. Simple Linear Regression Assignment Writing Service

Simple linear regression is a statistical technique used to model and quantify the linear relationship between two variables: one is considered the dependent variable, which is the outcome of interest, and the other is the independent variable, which is used to predict or explain changes in the dependent variable. It seeks to fit a straight line to the data that best represents the association between the two variables, typically by estimating two coefficients: the slope (indicating the direction and steepness of the relationship) and the intercept (representing the value of the dependent variable when the independent variable is zero). Simple linear regression is widely used for prediction, understanding cause-and-effect relationships, and assessing the strength and significance of the linear association between the variables. Assignment Writing Service

Research Question: Assignment Writing Service

Is there a statistically significant relationship between students' performance in their lab tasks (Lab_Grade) and their final test scores (Test_Score)? Specifically, do students who excel in their lab tasks tend to achieve higher scores on their final tests? Assignment Writing Service

Objective: Assignment Writing Service

The objective is to investigate and quantify the relationship between students' performance in their lab tasks (independent variable) and their test scores (dependent variable). By performing a simple linear regression analysis, the goal is to determine whether there exists a statistically significant linear association between these two variables. Additionally, the objective is to assess the direction and strength of this relationship, allowing us to understand if students who perform well in their labs tend to score higher on their final tests or vice versa. Assignment Writing Service

The objective of performing a simple linear regression analysis, in the context of above research question, is to: Assignment Writing Service

a) Model the Relationship: Build a statistical model that represents the linear relationship between the independent variable (Lab_Grade) and the dependent variable (Test_Score). Assignment Writing Service
b) Quantify the Relationship: Determine the strength and direction of the relationship by estimating the regression coefficients (slope and intercept). Assignment Writing Service
c) Assess Significance: Evaluate whether the observed relationship is statistically significant, which helps us determine if the relationship is unlikely to have occurred by random chance. Assignment Writing Service
d) Predict Outcomes: Utilize the regression model to predict or explain how changes in the Lab_Grade are associated with changes in the Test_Score. Assignment Writing Service
e) Provide Insights: Interpret the findings to answer the research question and draw conclusions about the extent to which students' lab performance relates to their final test scores. Assignment Writing Service

Step 2.1: Define the independent variable X (grade score) and the dependent variable y (test score) based on the data in the 'Lab_Grade' and 'Test_Score' columns of the dataset, respectively. Assignment Writing Service

Step 2.2: Add a constant term (intercept) to the independent variable X. This is necessary for estimating the intercept of the linear regression model. Assignment Writing Service

Page 6 of 8 Assignment Writing Service

Step 2.3: Create a linear regression model using the Ordinary Least Squares (OLS) method from StatsModels. It fits the model to the data, estimating the coefficients (slope and intercept) that best describe the linear relationship between the independent and dependent variables. Assignment Writing Service

Step 2.4: Print a summary of the linear regression model. The summary includes several important statistics and information. The regression model summary helps interpret the relationship between the grade score and test score, including the strength and significance of the association and other diagnostic statistics for assessing the goodness of fit of the model. We will go through the detail of each output in future lab. See Appendix A for more detail. Assignment Writing Service

❖ Interpretation of few metrics provided in the above model summary: Assignment Writing Service

a) F-statistic: The F-statistic (522.6) is a measure of the overall significance of the model. A high F-statistic suggests that at least one independent variable significantly contributes to explaining the dependent variable. In this case, the F-statistic is high, indicating the model's overall Assignment Writing Service

significance. Assignment Writing Service
b) Prob (F-statistic): The p-value associated with the F-statistic (1.92e-57) is extremely low, Assignment Writing Service

effectively zero. This indicates strong evidence against the null hypothesis that all coefficients Assignment Writing Service

are zero. In other words, the model as a whole is highly significant. Assignment Writing Service
c) Coefficients: The coefficients provide information about the intercept (const) and the Assignment Writing Service

coefficient of the independent variable (Lab_Grade). Assignment Writing Service

i. The intercept (const) is estimated to be 11.3232, suggesting that when Lab_Grade is Assignment Writing Service

zero, the predicted Test_Score is approximately 11.32.
d) P-values for Coefficients: The p-values associated with the coefficients are both very low Assignment Writing Service

(close to zero). This indicates that both the intercept and Lab_Grade coefficient are statistically significant. In this context, Lab_Grade significantly contributes to predicting Test_Score. Assignment Writing Service

Question 2.1: Provide the coefficient associated with Lab_Grade and provide an explanation of its meaning and significance. Assignment Writing Service

Page 7 of 8 Assignment Writing Service

Appendix A: Simple Linear Regression Model Summary Assignment Writing Service

➢ Coefficients: It provides information about the estimated coefficients (slope and intercept). Assignment Writing Service
➢ R-squared: The coefficient of determination, which measures the proportion of variance in the Assignment Writing Service

dependent variable explained by the independent variable. Assignment Writing Service
➢ p-values: Indicates the significance of the coefficients. Low p-values suggest that the Assignment Writing Service

coefficients are statistically significant. Assignment Writing Service
➢ Confidence intervals: Provides a range within which the true population coefficients are likely Assignment Writing Service

to fall. Assignment Writing Service
➢ Standard errors: Estimates of the variability of the coefficients. Assignment Writing Service
➢ F-statistic and p-value: Assess the overall significance of the model. Assignment Writing Service
➢ Residuals: Information about the residuals (differences between observed and predicted Assignment Writing Service

values). Assignment Writing Service
➢ Durbin-Watson statistic: Assesses the presence of autocorrelation in the residuals. Assignment Writing Service