-
Improvement
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
None
The weights used by Picard's similiarity algorithm internally are like dark magic. Over the years, different people have tweaked these number so that they just work. There is no way to test these values or to find out how well do these work. It is tiring and cumbersome to adjust these numbers and test a number of files manually.
However, it is a classic machine learning problem to determine the best set of weights for different parameters to arrive at a result. My suggestion is that a machine learning algorithm be created to output these weights. We can run it every once in a while say before a release and update the weights.
I am not an expert at machine learning but text processing and text matching is long solved problem. Fundamentally, it is just like detecting plagiarism or code similarity. The problem has been solved over and over again. We just need to apply it in a different context.