Firstly bμg's tool can only cope with replays from 2021 onwards. This immediately means that the approach isn't able to shed light on the identity of Misery from RAGL S08 or Archangel (which happy has claimed a long time ago).
Another reason I can't identify Archangel is that I'm limited to replays that I have, or have downloaded from the Ladder (and I don't have enough free disk space/patience to download everything from the ladder!)
The classification algorithm I've put together is pretty crude. It computes different metrics for each replay, averages these per player account and then compares the absolute difference between the metric scores. This could definitely be improved by filtering out metrics which are adding noise or by computing an optimal weighting for the different metrics. There are probably other metrics that could be included to improve the score too. Having said all this, using a 2:1 train/test split, the algorithm does seem to match accounts correctly.
Since I have a limited data set then the script can only guess at players which are within the data set. This means that since there are no LorryDriver replays (because he played before 2021) then it will never guess that an account is a Lorry smurf.
There are a number of factors which I deliberately did NOT use:
- Player chat
- Player names
- Skill level
- IP address
- Time of day/day of week
- Game count (some players play lots more games than others)
- List of opponents (since it's hard for a smurf to play against themself)
A final note before we get on to some results in the next post: if you start a witch hunt then you're going to find witches. The script simply points out accounts that play in similar ways - therefore it will definitely find similar accounts. This does not mean that the players are smurfs of each other (and in many cases there is plenty of evidence that they are not smurfs of each other).
