-
Improvement
-
Resolution: Fixed
-
Normal
-
None
-
None
-
None
Right now, it seems that Han characters are split up into unigrams, i.e. every single character becomes a separate token, which tends to produce a large number of largely very poor matches. We could improve this by using bigrams, i.e. ABCD turns into "AB BC CD" instead of "A B C D".
It seems Lucene and Solr have the ability to do that already:
http://www.hathitrust.org/blogs/large-scale-search/multilingual-issues-part-1-word-segmentation
https://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/cjk/CJKBigramFilter.html
For example:
An artist search for "田中一郎" is the same as an artist search for "田 中 一 郎", which currently produces 3907 results and the artist with that name isn't even the top result:
https://beta.musicbrainz.org/search?query=%E7%94%B0%E4%B8%AD%E4%B8%80%E9%83%8E&type=artist&limit=100&method=advanced
Using bigrams instead brings the number down to 259 results which is a fraction of the original set of results and it also causes the artist with that name to be the top result:
https://beta.musicbrainz.org/search?query=%22%E7%94%B0%E4%B8%AD%22+%22%E4%B8%AD%E4%B8%80%22+%22%E4%B8%80%E9%83%8E%22&type=artist&limit=100&method=advanced
It's still far from ideal (the results on https://beta.musicbrainz.org/search?query=%28%22%E7%94%B0%E4%B8%AD%E4%B8%80%22+%22%E4%B8%AD%E4%B8%80%E9%83%8E%22%29+%28%E7%94%B0+AND+%E4%B8%AD+AND+%E4%B8%80+AND+%E9%83%8E%29&type=artist&limit=100&method=advanced are the only things I would actually consider relevant matches for the input) but it would certainly be a step in the right direction.
Some more random selected examples:
Release search for "陪著我的時候想著她" goes from 1382 to 57:
https://beta.musicbrainz.org/search?query=%E9%99%AA%E8%91%97%E6%88%91%E7%9A%84%E6%99%82%E5%80%99%E6%83%B3%E8%91%97%E5%A5%B9&type=release&limit=25&method=advanced
https://beta.musicbrainz.org/search?query=%22%E9%99%AA%E8%91%97%22+%22%E8%91%97%E6%88%91%22+%22%E6%88%91%E7%9A%84%22+%22%E7%9A%84%E6%99%82%22+%22%E6%99%82%E5%80%99%22+%22%E5%80%99%E6%83%B3%22+%22%E6%83%B3%E8%91%97%22+%22%E8%91%97%E5%A5%B9%22&type=release&limit=25&method=advanced
Release search for "恋の天気予報" goes from 4614 to 81:
https://beta.musicbrainz.org/search?query=%E6%81%8B%E3%81%AE%E5%A4%A9%E6%B0%97%E4%BA%88%E5%A0%B1&type=release&limit=25&method=advanced
https://beta.musicbrainz.org/search?query=%22%E6%81%8B%E3%81%AE%22+%22%E3%81%AE%E5%A4%A9%22+%22%E5%A4%A9%E6%B0%97%22+%22%E6%B0%97%E4%BA%88%22+%22%E4%BA%88%E5%A0%B1%22&type=release&limit=25&method=advanced
Artist search for "陳奕迅" goes from 185 to 2:
https://beta.musicbrainz.org/search?query=%E9%99%B3%E5%A5%95%E8%BF%85&type=artist&limit=25&method=advanced
https://beta.musicbrainz.org/search?query=%22%E9%99%B3%E5%A5%95%22+%22%E5%A5%95%E8%BF%85%22&type=artist&limit=25&method=advanced
Artist search for "松任谷由実" goes from 1261 to 8:
https://beta.musicbrainz.org/search?query=%E6%9D%BE%E4%BB%BB%E8%B0%B7%E7%94%B1%E5%AE%9F&type=artist&limit=25&method=advanced
https://beta.musicbrainz.org/search?query=%22%E6%9D%BE%E4%BB%BB%22+%22%E4%BB%BB%E8%B0%B7%22+%22%E8%B0%B7%E7%94%B1%22+%22%E7%94%B1%E5%AE%9F%22&type=artist&limit=25&method=advanced
Artist search for "森野文子" goes from 3412 to 11:
https://beta.musicbrainz.org/search?query=%E6%A3%AE%E9%87%8E%E6%96%87%E5%AD%90&type=artist&limit=25&method=indexed
https://beta.musicbrainz.org/search?query=%22%E6%A3%AE%E9%87%8E%22+%22%E9%87%8E%E6%96%87%22+%22%E6%96%87%E5%AD%90%22&type=artist&limit=25&method=advanced
Artist search for "五月天" goes from 456 to 12:
https://beta.musicbrainz.org/search?query=%E4%BA%94%E6%9C%88%E5%A4%A9&type=artist&limit=25&method=advanced
https://beta.musicbrainz.org/search?query=%22%E4%BA%94%E6%9C%88%22+%22%E6%9C%88%E5%A4%A9%22&type=artist&limit=25&method=advanced