Last week I talked about how I plotted the binary RDEES values against the TotalUse values and hypothesized that there must have been no correlation even with a logistic regression because the points seemed to be mostly on top of each other. I was wrong about this. Using the glm() function in R on Professor Fugate’s data showed us that there is very likely a correlation between how much somebody uses electronics per day and somebody’s emotional granularity.
This past week, Professor Gary Davis and I talked about the cluster analysis issue and whether or not it would work on this data. We both are doubtful about how useful a cluster analysis would be with this data. However, after we talked I found that the within-cluster sum of squares by cluster obtained a higher value than any useful pairwise correlation by almost 0.15. I found the within-cluster sum of squares for the k-means cluster analysis on the data in 3 clusters, with no errors and no outliers, to be 0.4, or 40%. The highest I could find using pairwise correlations on the desired variables was 0.25, or 25%.
The code I used to find this was the most simple form of the k-means algorithm:
This code (data cleaned) gave me the following visualization as well:
I have not been able to pursue this path any further to determine if a 0.4 within-cluster sum of squares value is any good because Professor Davis found a stronger correlation with his own method and with the glm() function that I thought would lead to nothing last week. The glm function of all the data against the binary RDEES, when run on the cleaned data, displayed the following information:
This information tells us that both TotalUse and DERS, although we care more about TotalUse, have a significant effect on whether or not somebody is above or below average in their emotional granularity. This is very good because it is what we were asked to predict.
Based on the binary RDEES, Professor Davis went based off of the logistic regression and the data and determined how likely a participant was to be above or below average. Then, he checked whether or not they were above average in emotional granularity and marked the ones he got wrong. Next, he found that if you ignore the ones towards the middle in probability, since it’s basically just a flip of the coin in the middle, the relationship gets stronger. However, you don’t want to ignore a bin that’s too large because then the model is fabricated. Based on this, a range of 0.1 was established, which cut off probabilities between 0.45-0.55. Ignoring those values, an accuracy of 0.77 was found if you divide how many predictions were right over the total number of predictions. A graph was made to determine what the accuracy would be at a given width of exclusion above or below 0.5:
Seeing as this method found us an accuracy of 0.77 while only excluding 10% of all data, I am going to assume that this method is far better suited for Dr. Fugate’s data set than a k-means clustering algorithm is. This tells us that Professor Davis’s predictions were most accurate and he got his predictions based off of the logistic regression. Therefore, the logistic regression of binary RDEES vs the rest of the variables gives the most accurate predictions and is, so far, what should be our first and main conclusion when we present this data to Dr. Jennifer Fugate at the end of week 7.