Uploaded image for project: 'Picard'
  1. Picard
  2. PICARD-1885

Use Machine Learning to determine the set of weights used by the similarity algorithm

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Normal
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Lookup & Match
    • Labels:
      None

      Description

      The weights used by Picard's similiarity algorithm internally are like dark magic. Over the years, different people have tweaked these number so that they just work. There is no way to test these values or to find out how well do these work. It is tiring and cumbersome to adjust these numbers and test a number of files manually.

      However, it is a classic machine learning problem to determine the best set of weights for different parameters to arrive at a result. My suggestion is that a machine learning algorithm be created to output these weights. We can run it every once in a while say before a release and update the weights.

      I am not an expert at machine learning but text processing and text matching is long solved problem. Fundamentally, it is just like detecting plagiarism or code similarity. The problem has been solved over and over again. We just need to apply it in a different context.

        Attachments

          Activity

            People

            Assignee:
            Unassigned Unassigned
            Reporter:
            kartik1712 amCap1712
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated:

                Packages

                Version Package