I am excited to submit our thorough project report, which digs into the examination of data from the Centres for Disease Control and Prevention for 2018. Our primary focus has been on investigating the prevalence of diabetes, obesity, and inactivity in various counties across the United States. We discovered valuable insights into these critical health issues through thorough investigation and analysis. This report is the result of our work and devotion to understanding and addressing these issues. MTH522 Diabetes Report
Month: October 2023
October 6
Hi,
Today I was working on how to work with Latex and install the dependencies in my system. I have gone through the punch-line report and started my work on the report. I have formatted the Issues and findings in my report. As well as I have tried Huber Regression and we had a discussion on whether to choose Huber regression or Multiple Linear regression.
October 5
Hi, Today I worked on multiple linear regression and polynomial regression.I used the percentages of inactivity and obesity as predictors, with the percentage of diabetes as the dependent variable. I tried multiple linear regression and polynomial regression, adjusting the degrees to find the best R-squared result. The Breusch-Pagan test was then used to test the hypothesis that the residuals (variance in errors) in my model are constant across all levels of the independent variables.
I created a new regression model with the squared residuals as the dependent variable and the independent factors from my prior model as the independent variables. The purpose was to see if the independent variables could predict the squared residuals. I obtained a crucial value from after calculating a test statistic the one-degree-of-freedom chi-squared distribution (df=1). I determined that heteroskedasticity existed because the test statistic was bigger than the threshold value.
October 3
I used the R-squared metric for cross-validation last week, which estimates the fraction of variation in the dependent variable that is predictable given predictors. Today, I attempted analysing my models using several scoring measures, reading about their differences. Notably, in the absence of a stated scoring metric in the parameters, the cross_val_score function calculates the negative Mean Squared Error (MSE) for each fold, a metric that is extremely sensitive to outliers. Furthermore, I learned about the Mean Absolute Error (MAE) measure, which should be used when all errors should be given equal weightage.
September 29
The knowledge about heteroskedasticity in statistical analysis is primarily provided by this blog. Using the common factor, FIPS, to analyze the data using the three variables of diabetes, obesity, and inactivity. In my statistical analysis of the provided data frames, I got to the conclusion that there is no meaningful evidence of heteroskedasticity.
To do this, I used regression analysis to build a linear regression model, make predictions, then exhibit those predictions as a scatter plot. This observation can be related to my last blog, where I successfully identified the residuals (variance of the mistakes) in the regression model versus the predicted values (fitted values).
I utilized the’statsmodel’ package, which offers classes and functions for estimating and evaluating different statistical models, for this operation. It is frequently used for SM, regression modeling, and statistical analysis.The “least squares” element of the OLS library’s name refers to the Ordinary Least Square (OLS) regression model, which minimizes the sum of the squared differences (residuals) between the observed dependent variable (y) and the values predicted by the linear model
September 27
In today’s blog, I have improved my code and successfully fixed all problems with earlier iterations in the analysis. I had previously created a model for linear regression. However, I thoroughly assessed this model in the current investigation using three different scenarios that involved the computation of important statistical metrics.
Later on, I studied the code for the logistic regression technique. It is a statistical method for modelling the relationship between a binary dependent variable and one or more independent variables that is used in data analysis. This particular kind of regression analysis works well when the dependent variable is categorical and has two alternative outcomes, which are frequently coded as 0 and 1 (or “yes” and “no,” “success” and “failure,” etc.).
For the implementation of logistic regression in my analysis I have considered %Diabetic as dependent variable and %Obesity and %Inactivity as independent variable. I am attempting to forecast some data from one of the datasets using logistic regression. The formula for logistic regression is: logit(p)=ln(1−pp)=β0+β1X1+β2X2+…+βkXk, where a and b are regression coefficients from the actual data plot, x is one of the predictors, and y is the calculated value. Using the code, I was able to obtain the two regression coefficients.
September 25
In today’s research, we focused on creating a linear regression model to look at the data on “% Obesity” while using information on “% Diabetes” and “% Inactivity.” Additionally, we created a linear regression model to investigate the “% Diabetes” data while taking into account the impact of the “% Obesity” and “% Inactivity” data.
A statistical technique for simulating the relationship between a dependent variable and one or more independent variables is linear regression. Finding the line that fits the data points the best is the goal of linear regression. Predictions regarding the dependent variable based on the values of the independent variables can then be made using this line.
The equation for a linear regression model is as follows:
y = mx + b
where;
y is the dependent variable
x is the independent variable
m is the slope of the line
b is the y-intercept of the line
Our understanding of the dependent variable’s response to changes in the independent variable is determined by the slope of the line, which is expressed as a ratio. When the independent variable is equal to zero, we can determine the value of the dependent variable by looking at the line’s y-intercept.
September 22
On three datasets related to diabetes, obesity, and inactivity, I conducted a correlation study. The analysis showed a significant link among all three datasets, with the FIPS code acting as the tying element. I combined these three datasets into a single Excel spreadsheet in order to conduct a more thorough analysis after realizing the necessity for one.
I created a piece of code to integrate the three datasets, and it revealed 356 shared data points. The Excel file was then cleaned up, which entailed dealing with duplicate columns including data on the county, state, and year. I eliminated these columns to make the data more readable. I also increased the readability of the dataset by renaming particular columns and changing column widths to aid with data visualization.
Additionally, I am eager to defend the facts and ensure that the information is accurate. I am examining many tests, like the T-test and Bruesch-Pagan Test, to achieve this. Even though I have a preliminary T-test code, I have not yet tested it on the dataset.