Blue Cup rating and the people choice: a statistical analysis (Warning: long!)

Started by , Thu 17/07/2008 20:17:44

Previous topic - Next topic

bicilotti

This which follows is a small statistical analysis I did using SSH php data and SPSS software. More info about it here.

Assumptions: before starting, a few words on the assumptions I've made. They are basically two: the "overall" rating reflects the view of the people of the community, the "cups" reflect the views of the expert panel.
Both these assumption are not 100% correct (e.g. the ratings can be vandalized; the "blue cups" rating is given by just one person and not by the entire panel itself, etc.), but overall the distortions on the final results should be minimal.

Let's Go!

First of all some simple descriptive graphs. Pie chart with % of blue cups:


Next one is an histogram with the distribution of the "people's rating" (overall):


Take a note of the standard deviation: it's a simple measure of dispersion of the data. It's 19 something.

More in depth!

Ok, now that we know a little bit more about our data let's make an experiment.
I have split the datafile in 5 pieces (one for every cup value). We'll plot 5 histograms, one for every segment. SD of every one of them should be significantly lower than the one of the full data (19,3).





Plot for "cup = 5" is not displayed since it has few entries (7).
What can be seen from the histograms? Splitting the data by cup rating reduces the standard deviation (albeit slightly), except for "cup = 1", where the plot is way ragged (SD = 22!).

Linear Regression

In this section I'll try to "discover" the formula (if there is such equation, of course) which "ties" the "people's ratings" and the "panel rating.

First of all let's analyse the correlation between the variables VISUAL, IMMERSION, PUZZLES and OVERALL.
Next is a Scatter/Dot graph. The more the "clouds" resemble a line the more variables are correlated.


Ok, everyone of them is way correlated to each other in a positive way. That reasonable if you think of it a bit: usually (not always, but usually) good graphics are sign of an overall serious effort; same goes with puzzles and immersion.

Having said that, we can safely take one parameter for our regression , OVERALL (i.e.: we will use one known variable to try to explain the other one, namely the "cup rating").

Here are the results, I've highlighted in yellow the important things to notice:


There are two things from the tables which needs to be noticed:

Adjusted R: this states how "well" the OVERALL variable explains the CUPS one (aka, how well the peoples views are reflected by the panel and vice versa). Here is 0.382, which is fine albeit not extremely good,

Coefficients: the most important part of the model. It means that, overall, to "forecast" a rating, you should use the following formula

CUPS RATING = OVERALL * 0.029 + 0.467

example: 4 of clubs got an overall score of 52%.
52 * 0.029 + 0.467 = 1.975. Cup rating is 2, so in this case the "panel" agrees with the "people".


Conclusions

Well, data pretty much speaks for itself. In some way the panel shares some views with the general public. This is true expecially for higher rated cups game.
The more we get towards the lower end the more that is no more true: we see unexplicables outliers; the one cup conditional distribution is quizzical to look at and so on.
One and two cups games accounts for more than 50% of the total, so this divergence is quite a relevant matter.
Let's call it an "hidden gem bias", many AGSers see the beauty or the potential of a game where the panel does not.

To the reader: thanks for having read this long and dull post :P



SSH

I understood that up to "Let's Go"...  :=

But thanks for this insight, bicilotti!
12

MashPotato

Quote from: bicilotti on Thu 17/07/2008 20:17:44
To the reader: thanks for having read this long and dull post :P
I honestly found it interesting :)  Thanks for all that effort!

LimpingFish

Quote from: bicilotti on Thu 17/07/2008 20:17:44
One and two cups games accounts for more than 50% of the total, so this divergence is quite a relevant matter.
Let's call it an "hidden gem bias", many AGSers sees the beauty or the potential of a game where the panel does not.

Or it simply means that 50 percent of the database is rubbish. I could've told you that without the pie charts.

The reasons why user ratings are so disproportionate to panel ratings regarding the lower rated games, is that users vote for a multitude of reasons. Take my own game "Unbound". Somebody purposely went out of their way to rate it "poor" in all areas, and added "Fuck Fuck Fuuuuuuck" as their reasons for doing so. This greatly, and falsely, reduced it's overall user rating average.

On the other hand, (Name Withheld) has 35 positive, but anonymous, votes, artificially inflating it's user rating into the high 90's. I've played it. It's a one cup game if ever I saw one.

But let's ignore all of this for a moment, and address the underlying truth about Panel vs User ratings:

It's...all...subjective.

Regarding "Hidden Gems": a games quality is not determined in a critical vacuum. With the exception of commercial games, the panel rates all AGS games equally and fairly. Very occasionally exceptions are made regarding MAGS or OROW games, simply because of the time frame involved.

The reasons why a panel member awards two cup or one cup ratings can vary, sure, but you can be assured of the following facts: It's never "personal", it's never done on a whim, and it's always open to objection from the other panel members.

Conversely, higher ratings are never awarded because the panel member is a friend of the developer, or that the developer deserves a higher rating based on their popularity within the community.

If Bicilotti's fine analysis teaches us anything, it's that the unreliability of the user ratings is clearly responsible for spikes in correlating Panel vs User opinions.
Steam: LimpingFish
PSN: LFishRoller
XB: TheActualLimpingFish
Spotify: LimpingFish

Vince Twelve

One other reason for the high variance among one-cup games is the number of people voting on them.  Without many votes, it's much harder to determine an actual representation of the game's quality.  I would wager that an examination of the database would reveal that the one-cup games, on average, have far fewer votes than the generally more popular three or four cup games.  Clicking on some random ones now, many of them have been around for years and still have never gotten enough votes to even show a percentage.  This lower number of votes could be greatly attributing to the high variance in the average scores. 

Hell, you'll probably even see a huge drop in the number of votes between the average 5 cup game and the average 4.

This, coupled with the vote vandalism (Unbound) and inflation (is there a reason we're not pointing directly at Rapstar?) that Limpy mentioned makes the user votes almost meaningless to me.  The panel cup rating seems far more useful in general.

Great work here, though.  A very interesting study!  Thanks!


LimpingFish

That's a good point, Vince. Not every game has votes, and not every game has enough variation in its voting to determine a valid total.

This, coupled with the fact that user ratings are dynamic, whereas cup ratings are static, largely undermines any attempts to correlate the two.

But that was an impressively detailed attempt by Bicilotti. :)
Steam: LimpingFish
PSN: LFishRoller
XB: TheActualLimpingFish
Spotify: LimpingFish

The Ivy


blueskirt

It doesn't really change my opinion on the Cup Rating Vs User Rating debate, but I can't help but applaud your efforts, bicilotti, that was certainly an interesting read! :)

bicilotti

Quote from: The Ivy on Fri 18/07/2008 03:00:11
Oh boy, SPSS...are you in the social sciences too?

An economics student. I thought Ivy to be a... chemist!  ;D


alkis21

Thank you for that excellent work, bicilotti. Very interesting read.

Ali

Quote from: LimpingFish on Thu 17/07/2008 21:15:12
Take my own game "Unbound". Somebody purposely went out of their way to rate it "poor" in all areas, and added "Fuck Fuck Fuuuuuuck" as their reasons for doing so. This greatly, and falsely, reduced it's overall user rating average.

Nelly Cootalot has an oddly similar (almost word-for-word) review. When will people learn... democracy doesn't work!

Dualnames

Well, it's mostly about rating games. I really don;t think that any of games is over or under. For example lone case 3 showdown, has good looks, medium puzzles,medium puzzles but has a terribly looking font and some bugs that frustrate people and lacks a gamma slider, all of which I'm fixing in the new version with the talkie. It took 2 blue cups. People who told me it should be rated more, neverminded the font and the bad stuff(not many) and considered the game quite superb. However we have to accept that not all people do. People have many categories. Some like adventure games some don't. The rating panel members were chosen in order to depict as many categories of people as possible in order to maintain an objective opinion on the game.
Worked on Strangeland, Primordia, Hob's Barrow, The Cat Lady, Mage's Initiation, Until I Have You, Downfall, Hunie Pop, and every game in the Wadjet Eye Games catalogue (porting)

LimpingFish

Quote from: Dualnames on Sat 19/07/2008 13:30:04
It took 2 blue cups. People who told me it should be rated more, neverminded the font and the bad stuff(not many) and considered the game quite superb. However we have to accept that not all people do. People have many categories. Some like adventure games some don't. The rating panel members were chosen in order to depict as many categories of people as possible in order to maintain an objective opinion on the game.

You see, this is what I mean about how people expect games to be rated. Sometimes a bad game is simply that. Bad. Making allowances for this or that, or awarding a rating based on what the developer intended to make, is not only pointless, but detracts from those games that do earn higher awards. I'm not speaking directly about Lone Case, but about something which plagues a number of games, and which usually results in a low cup rating.

To expand on what I was saying in an earlier post, games submitted to the AGS database are rated on both their strengths and their weaknesses. If a game is submitted to the database, unless it's a demo, it is regarded as a finished game, and rated as such. One of the reasons we have so many one cup games is that people simply want to make something, so as to have their name appear in the database.

What I'm trying to say is, if you feel the game still has some rough edges, then don't release it until you are happy that the finished product is made to the best of your abilities. There is no deadline, so there's no need to feel pressured into an early release. Of course the truth of the matter is, and I'm guilty of this myself, is that we get bored and distracted and we just want the damn thing to be finished! But if that's the case, then we really can't complain when somebody finds what we've created lacking.

Technical faults can be a factor in deciding an overall rating, but will fixing them automatically result in a higher cup rating? Only time will tell.
Steam: LimpingFish
PSN: LFishRoller
XB: TheActualLimpingFish
Spotify: LimpingFish

SMF spam blocked by CleanTalk