Uploaded image for project: 'MusicBrainz Search Server'
  1. MusicBrainz Search Server
  2. SEARCH-386

Use bigrams to improve relevancy of CJK results

XMLWordPrintable

    • Icon: Improvement Improvement
    • Resolution: Fixed
    • Icon: Normal Normal
    • 2014-10-08
    • None
    • None
    • None

      Right now, it seems that Han characters are split up into unigrams, i.e. every single character becomes a separate token, which tends to produce a large number of largely very poor matches. We could improve this by using bigrams, i.e. ABCD turns into "AB BC CD" instead of "A B C D".

      It seems Lucene and Solr have the ability to do that already:
      http://www.hathitrust.org/blogs/large-scale-search/multilingual-issues-part-1-word-segmentation
      https://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/cjk/CJKBigramFilter.html

      For example:

      An artist search for "田中一郎" is the same as an artist search for "田 中 一 郎", which currently produces 3907 results and the artist with that name isn't even the top result:
      https://beta.musicbrainz.org/search?query=%E7%94%B0%E4%B8%AD%E4%B8%80%E9%83%8E&type=artist&limit=100&method=advanced

      Using bigrams instead brings the number down to 259 results which is a fraction of the original set of results and it also causes the artist with that name to be the top result:
      https://beta.musicbrainz.org/search?query=%22%E7%94%B0%E4%B8%AD%22+%22%E4%B8%AD%E4%B8%80%22+%22%E4%B8%80%E9%83%8E%22&type=artist&limit=100&method=advanced

      It's still far from ideal (the results on https://beta.musicbrainz.org/search?query=%28%22%E7%94%B0%E4%B8%AD%E4%B8%80%22+%22%E4%B8%AD%E4%B8%80%E9%83%8E%22%29+%28%E7%94%B0+AND+%E4%B8%AD+AND+%E4%B8%80+AND+%E9%83%8E%29&type=artist&limit=100&method=advanced are the only things I would actually consider relevant matches for the input) but it would certainly be a step in the right direction.

      Some more random selected examples:

      Release search for "陪著我的時候想著她" goes from 1382 to 57:
      https://beta.musicbrainz.org/search?query=%E9%99%AA%E8%91%97%E6%88%91%E7%9A%84%E6%99%82%E5%80%99%E6%83%B3%E8%91%97%E5%A5%B9&type=release&limit=25&method=advanced
      https://beta.musicbrainz.org/search?query=%22%E9%99%AA%E8%91%97%22+%22%E8%91%97%E6%88%91%22+%22%E6%88%91%E7%9A%84%22+%22%E7%9A%84%E6%99%82%22+%22%E6%99%82%E5%80%99%22+%22%E5%80%99%E6%83%B3%22+%22%E6%83%B3%E8%91%97%22+%22%E8%91%97%E5%A5%B9%22&type=release&limit=25&method=advanced

      Release search for "恋の天気予報" goes from 4614 to 81:
      https://beta.musicbrainz.org/search?query=%E6%81%8B%E3%81%AE%E5%A4%A9%E6%B0%97%E4%BA%88%E5%A0%B1&type=release&limit=25&method=advanced
      https://beta.musicbrainz.org/search?query=%22%E6%81%8B%E3%81%AE%22+%22%E3%81%AE%E5%A4%A9%22+%22%E5%A4%A9%E6%B0%97%22+%22%E6%B0%97%E4%BA%88%22+%22%E4%BA%88%E5%A0%B1%22&type=release&limit=25&method=advanced

      Artist search for "陳奕迅" goes from 185 to 2:
      https://beta.musicbrainz.org/search?query=%E9%99%B3%E5%A5%95%E8%BF%85&type=artist&limit=25&method=advanced
      https://beta.musicbrainz.org/search?query=%22%E9%99%B3%E5%A5%95%22+%22%E5%A5%95%E8%BF%85%22&type=artist&limit=25&method=advanced

      Artist search for "松任谷由実" goes from 1261 to 8:
      https://beta.musicbrainz.org/search?query=%E6%9D%BE%E4%BB%BB%E8%B0%B7%E7%94%B1%E5%AE%9F&type=artist&limit=25&method=advanced
      https://beta.musicbrainz.org/search?query=%22%E6%9D%BE%E4%BB%BB%22+%22%E4%BB%BB%E8%B0%B7%22+%22%E8%B0%B7%E7%94%B1%22+%22%E7%94%B1%E5%AE%9F%22&type=artist&limit=25&method=advanced

      Artist search for "森野文子" goes from 3412 to 11:
      https://beta.musicbrainz.org/search?query=%E6%A3%AE%E9%87%8E%E6%96%87%E5%AD%90&type=artist&limit=25&method=indexed
      https://beta.musicbrainz.org/search?query=%22%E6%A3%AE%E9%87%8E%22+%22%E9%87%8E%E6%96%87%22+%22%E6%96%87%E5%AD%90%22&type=artist&limit=25&method=advanced

      Artist search for "五月天" goes from 456 to 12:
      https://beta.musicbrainz.org/search?query=%E4%BA%94%E6%9C%88%E5%A4%A9&type=artist&limit=25&method=advanced
      https://beta.musicbrainz.org/search?query=%22%E4%BA%94%E6%9C%88%22+%22%E6%9C%88%E5%A4%A9%22&type=artist&limit=25&method=advanced

            ijabz Paul Taylor
            nikki nikki
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved:

                Version Package
                2014-10-08