# Week 4: The Slump

I have not gotten any response on whether or not cluster analysis, or any type of analysis by grouping, could be useful. I am starting to think that it could be quite useful, seeing as the class is now grouping the participants into above or below average emotional granularity (RDEES) score. Still, however, I don’t know in what sense it will be useful.

As a class we’ve discussed other possible solutions. One of those was to make the RDEES values binary based on whether the participant has an above or below average RDEES score. Using excel, I created a new column to portray this and plotted it against the TotalUse variable to determine whether or not somebody’s total electronic usage per day can determine their emotional granularity. Professor Davis will be going over how to use logistic regression using the glm() function in R today so that we can predict how likely somebody is to have above average RDEES. However, from what the graph shows me, there is little to no correlation even with a logistic regression on the binary variable. In a logistic regression, we are looking for as far away from a straight diagonal line as possible and, as you can see from the following figure, there will be a line that is quite close to straight, as far as logistic regressions go, because there is so much vertical overlap in the data. Binary RDEES plotted against the TotalUse variable shows little to no correlation.

After this, I went back to exploring the data, but this time I decided to use a 3d scatter plot to model the data to see if I could find any correlation between 3 variables. I did not find what I was looking for, but I did find that the frequency count had the largest effect on my summations, which I thought was interesting. After viewing this and comparing the two variables, I found an r-squared value of .408, which is quite strong. The issue with this is that it doesn’t help us solve the problem, but I did think it was an interesting observation. The 3d scatterplot (x=frequency, y=DERS, z=sum) that helped me find a relation between frequency and sum. The 2d scatterplot that represents how much frequency affected the summations. R-squared value for the above graph between Frequency and Summations.

Something we discussed that we might do in the future is correct our tests to avoid overfitting. We will train our model on 50% of the participants, and then test it on the rest to determine the accuracy of our model. This is something that I have not done yet and am interested in doing. I don’t see the logistic regression solution working so I hope we start working on another solution, possibly even using the cluster analysis idea I asked a question about on the class Q&A page, to make sure we can be more confident with our response to Dr. Fugate than the 0.25 r-squared value I found comparing the Summations to the RDEES with no outliers.