Week 6,Monday

Today I learnt about Geo position and got to know about the library used for it- Altair library. We can also use matplot for the Geo position as I have worked with both in plotting map of USA.

I have also learnt little more about clustering, specifically the “Elbow Method.” To determine how many clusters we require, I created the WCSS graph, which shows how near the points in a group are to each other. It assisted me in determining the optimal number of groups when adding another group does not significantly improve performance; we call this point the “elbow.”

WCSS (Within cluster sum of squares) is essentially a measure of how neat our clusters are. Adding the squares of all the distances between each point and the cluster’s centre. Lower WCSS results in more organised clusters. WCSS tends to decrease as the number of clusters increases. It can be used to find the right number of clusters.

Week 5, Friday

In my previous post I mentioned about that there are 12 columns and today I tried to merge the csv’s on Race columns and tried to plot the graph. The graph clearly indicated that the Black people got shot is way higher than the white people got shot. But I can’t conclude with this graph that chances of black people getting shot is higher than white as there could be other factors too which resulted in black people getting shot.

I will be researching more about this and will see whether police shooting is somewhat based on race or is it some other grey area which we don’t know about.

 

Week 5, Wednesday

I’d want to highlight the datasets based on police shootings published by The Washington Post. Every year, it is clearly stated that police in the United States shoot and murder more than 1000 people.

As I have gone through the datasets I found out that the data is from 2015 till date and it is getting updated weekly. The dataset has 12 columns: date, name, age, gender, armed, race, city, state, escape, body camera, sign of mental illness, and police department engaged.

While examining the datasets to find the key factors leading to more number of police shootings. I feel that body camera and the ethnicity is the major factor which result in more killings than other factors. This is the raw analysis just by looking at the datasets, I will deep dive under all factors in order to find a better analysis of the dataset.

THE DAY!!

I am excited to submit our thorough project report, which digs into the examination of data from the Centres for Disease Control and Prevention for 2018. Our primary focus has been on investigating the prevalence of diabetes, obesity, and inactivity in various counties across the United States. We discovered valuable insights into these critical health issues through thorough investigation and analysis. This report is the result of our work and devotion to understanding and addressing these issues. MTH522 Diabetes Report

October 6

Hi,

Today I was working on how to work with Latex and install the dependencies in my system. I have gone through the punch-line report and started my work on the report. I have formatted the Issues and findings in my report. As well as I have tried Huber Regression and we had a discussion on whether to choose Huber regression or Multiple Linear regression.

October 5

Hi, Today I worked on multiple linear regression and polynomial regression.I used the percentages of inactivity and obesity as predictors, with the percentage of diabetes as the dependent variable. I tried multiple linear regression and polynomial regression, adjusting the degrees to find the best R-squared result. The Breusch-Pagan test was then used to test the hypothesis that the residuals (variance in errors) in my model are constant across all levels of the independent variables.

I created a new regression model with the squared residuals as the dependent variable and the independent factors from my prior model as the independent variables. The purpose was to see if the independent variables could predict the squared residuals. I obtained a crucial value from after calculating a test statistic the one-degree-of-freedom chi-squared distribution (df=1). I determined that heteroskedasticity existed because the test statistic was bigger than the threshold value.

October 3

I used the R-squared metric for cross-validation last week, which estimates the fraction of variation in the dependent variable that is predictable given predictors. Today, I attempted analysing my models using several scoring measures, reading about their differences. Notably, in the absence of a stated scoring metric in the parameters, the cross_val_score function calculates the negative Mean Squared Error (MSE) for each fold, a metric that is extremely sensitive to outliers. Furthermore, I learned about the Mean Absolute Error (MAE) measure, which should be used when all errors should be given equal weightage.

September 29

The knowledge about heteroskedasticity in statistical analysis is primarily provided by this blog. Using the common factor, FIPS, to analyze the data using the three variables of diabetes, obesity, and inactivity. In my statistical analysis of the provided data frames, I got to the conclusion that there is no meaningful evidence of heteroskedasticity.

 

To do this, I used regression analysis to build a linear regression model, make predictions, then exhibit those predictions as a scatter plot. This observation can be related to my last blog, where I successfully identified the residuals (variance of the mistakes) in the regression model versus the predicted values (fitted values).

 

I utilized the’statsmodel’ package, which offers classes and functions for estimating and evaluating different statistical models, for this operation. It is frequently used for SM, regression modeling, and statistical analysis.The “least squares” element of the OLS library’s name refers to the Ordinary Least Square (OLS) regression model, which minimizes the sum of the squared differences (residuals) between the observed dependent variable (y) and the values predicted by the linear model

September 27

In today’s blog, I have improved my code and successfully fixed all problems with earlier iterations in the analysis. I had previously created a model for linear regression. However, I thoroughly assessed this model in the current investigation using three different scenarios that involved the computation of important statistical metrics.

 

Later on, I studied the code for the logistic regression technique. It is a statistical method for modelling the relationship between a binary dependent variable and one or more independent variables that is used in data analysis. This particular kind of regression analysis works well when the dependent variable is categorical and has two alternative outcomes, which are frequently coded as 0 and 1 (or “yes” and “no,” “success” and “failure,” etc.).

 

For the implementation of logistic regression in my analysis I have considered %Diabetic as dependent variable and %Obesity and %Inactivity as independent variable. I am attempting to forecast some data from one of the datasets using logistic regression. The formula for logistic regression is: logit(p)=ln(1−pp​)=β0​+β1​X1​+β2​X2​+…+βkXk, where a and b are regression coefficients from the actual data plot, x is one of the predictors, and y is the calculated value. Using the code, I was able to obtain the two regression coefficients.

September 25

In today’s research, we focused on creating a linear regression model to look at the data on “% Obesity” while using information on “% Diabetes” and “% Inactivity.” Additionally, we created a linear regression model to investigate the “% Diabetes” data while taking into account the impact of the “% Obesity” and “% Inactivity” data.

 

A statistical technique for simulating the relationship between a dependent variable and one or more independent variables is linear regression. Finding the line that fits the data points the best is the goal of linear regression. Predictions regarding the dependent variable based on the values of the independent variables can then be made using this line.

 

The equation for a linear regression model is as follows:

y = mx + b

where;

y is the dependent variable

x is the independent variable

m is the slope of the line

b is the y-intercept of the line

 

Our understanding of the dependent variable’s response to changes in the independent variable is determined by the slope of the line, which is expressed as a ratio. When the independent variable is equal to zero, we can determine the value of the dependent variable by looking at the line’s y-intercept.