Uploaded image for project: 'MusicBrainz Search Server'
  1. MusicBrainz Search Server
  2. SEARCH-167

Artist search should deal better with artists being entered misspelt into basic artist search

    • Icon: Improvement Improvement
    • Resolution: Fixed
    • Icon: Normal Normal
    • 2012-03-23
    • None
    • None
    • None

      Several search queries seem to return results that are not nearly as good as they could be. It seems to me that some of the techniques described below for improving search results should be applied by default.

      In particular:
      The results of searching for "rudy wiedoeft" does not include "rudy wiedoeft's Californians" or "rudy wiedoeft's palace trio" on the first page; instead they appear on page five. I would expect a single-character difference to appear much higher in the search results.

      Similarly, a search for "rudy wied" should return results for all the above, much closer to the top than page 5.

      And, "rudy green" does not return "rudy greene" anywhere near the top of the results, despite the one-character difference.

      These results can be improved by the following techniques:
      (advanced search):
      (rudy green) OR (rudy* green) OR (rudy green*)

      (rudy wied) OR (rudy* wied) OR (rudy wied*)

      "rudy* green*" gives great results compared to "rudy green"

      A simple spelling mistake can turn a great search result into a terrible one:
      Search for "go-cart mozart", expect to find go-kart mozart.

      results contain nothing useful

      Change it to
      (go cart mozart) OR (go* cart mozart) OR (go cart* mozart) or (go cart mozart*)
      and the results are great, suggesting that hyphens need to be considered word breaks.

      Even simply switching to advanced search on a total misspelling can improve things:

      (simple search): aaron lebedeef does not include the desired artist[1] anywhere near the top of the results (not even on the first page)

      An advanced search for the same returns it as the ninth result.

      Using the previously-described technique improves it even further: (aaron lebedeef) OR (aaron* lebedeef) OR (aaron lebedeef*) returns it as the seventh result.

      Using a fuzzy search on all the above improves many of the above results even further:
      "rudy~ wiedoeft~" is great.
      "rudy~ wied~" is not so great, but no worse.
      "rudy~ green~" is great.
      "go-cart~ mozart~" is not so great, but no worse.
      "aaron~ lebedeef~" is great.

      Combining all techniques works out the best, though:
      (rudy~ wiedoeft~) OR (rudy wiedoeft*) or (rudy* wiedoeft): great
      (rudy~ wied~) OR (rudy wied*) OR (rudy* wied): great
      (rudy~ green~) OR (rudy green*) OR (rudy* green): great.
      (go~ cart~ mozart~) OR (go*~ cart mozart) OR (go cart*~ mozart) OR (go cart mozart*~): great
      (aaron~ lebedeef~) OR (aaron lebedeef*) OR (aaron* lebedeef): great

      So it seems to me that:
      1. Advanced search should be on by default
      2. fuzzy matching of search terms should be on by default.
      3. hyphens should break words
      4. combinations of fuzzy and non-fuzzy matching (appending a wildcard and fuzziness to each of the words separately and ORing that with the fuzzy search on all words) should be performed by default

      Reference http://chatlogs.musicbrainz.org/musicbrainz-devel/2011/2011-03/2011-03-09.html#T20-15-07-468551 for conversation where the problems were found and discussed.

      1. http://test.musicbrainz.org/artist/911d1b3b-e93a-4896-9fda-42013b2c8a7e

          [SEARCH-167] Artist search should deal better with artists being entered misspelt into basic artist search

          Paul Taylor added a comment - - edited

          Fixed and released, the performance problem was due to me not setting a minimum length that fuzzy matches should be matched to, I also implemented a better rewrite method.

          Paul Taylor added a comment - - edited Fixed and released, the performance problem was due to me not setting a minimum length that fuzzy matches should be matched to, I also implemented a better rewrite method.

          Paul Taylor added a comment -

          Should also look into using shingles for phrase searches, this would improve the speed of all phrase searches , I dont know if there are any downsides to using phrase searches.

          Paul Taylor added a comment - Should also look into using shingles for phrase searches, this would improve the speed of all phrase searches , I dont know if there are any downsides to using phrase searches.

          Paul Taylor added a comment -

          Does seem a bit slow, due to combination of matching multiple fields and constructimg fuzzy and wildcard searches, to improve this.

          1. Dont fuzzy search all fields, i.e no need to fuzzy search a barcode
          2. Take another look at multiterm search rewrite method see if we can get one thats better performance but also scores comparably to standard and phrase search.
          3. Consider re-indexing some fiels with another analyzer such as metaphone analyzer and then maybe wouldnt need to fuzzy match the field.

          Paul Taylor added a comment - Does seem a bit slow, due to combination of matching multiple fields and constructimg fuzzy and wildcard searches, to improve this. 1. Dont fuzzy search all fields, i.e no need to fuzzy search a barcode 2. Take another look at multiterm search rewrite method see if we can get one thats better performance but also scores comparably to standard and phrase search. 3. Consider re-indexing some fiels with another analyzer such as metaphone analyzer and then maybe wouldnt need to fuzzy match the field.

          Paul Taylor added a comment -

          Fixed now on test.

          Paul Taylor added a comment - Fixed now on test.

          Paul Taylor added a comment -

          Changed title to better reflect the issue.

          Paul Taylor added a comment - Changed title to better reflect the issue.

          Paul Taylor added a comment -

          Actually I think it would make sense if we added another parameter to the search server to let it know if the query was from standard or advanced search, then if it was from standard search just get the server to pass the user entered query and let the search server deal with rewriting the code, then all the searching logic is kept in one place.

          Paul Taylor added a comment - Actually I think it would make sense if we added another parameter to the search server to let it know if the query was from standard or advanced search, then if it was from standard search just get the server to pass the user entered query and let the search server deal with rewriting the code, then all the searching logic is kept in one place.

          Paul Taylor added a comment -

          Moved because the fuzzy search stuff requires work on my part even if the actual implemenatation will reside on the musicbrainz server.

          Paul Taylor added a comment - Moved because the fuzzy search stuff requires work on my part even if the actual implemenatation will reside on the musicbrainz server.

          Alex Mauer added a comment -

          1. It is entirely likely that I misunderstand this option. However, I don't see why it would be impossible for the basic search to recognize the various "advanced" operators and pass it as an advanced search in that case. In addition, whatever it is that makes the search better when you do an advanced search with nothing "advanced" in it should happen by default. (a basic search for 'aaron lebedeef' does not include the desired artist anywhere near the top of the results while an advanced search for the same returns it as the ninth result. ) Whether that means "advanced search by default" or "stop doing whatever basic search does to screw it up"...doesn't much matter.

          2. Any improvement is better than nothing, I guess.

          3-4. These techniques provide results that are significantly better than the status quo. In my opinion, Musicbrainz current search results are pretty awful, and I think it is a mistake to ignore ways to make them better.

          Alex Mauer added a comment - 1. It is entirely likely that I misunderstand this option. However, I don't see why it would be impossible for the basic search to recognize the various "advanced" operators and pass it as an advanced search in that case. In addition, whatever it is that makes the search better when you do an advanced search with nothing "advanced" in it should happen by default. (a basic search for 'aaron lebedeef' does not include the desired artist anywhere near the top of the results while an advanced search for the same returns it as the ninth result. ) Whether that means "advanced search by default" or "stop doing whatever basic search does to screw it up"...doesn't much matter. 2. Any improvement is better than nothing, I guess. 3-4. These techniques provide results that are significantly better than the status quo. In my opinion, Musicbrainz current search results are pretty awful, and I think it is a mistake to ignore ways to make them better.

          Paul Taylor added a comment -

          1. I think you misunderstand the Advanced Search option. Advanced Search means that the query is passed to the search server untouched. So if this was enabled by default it means the Musicbrainz Server wouldn't be able to construct a query from user entry as it does now, and as you are reuesting via 2. Secondly it means casual users need to have an understanding of query syntax so it doesnt make sense. However there should be a way to enable Advanced Search from the first page instead of having to do a query and then select the advanced search checkbox in order to do the query you really wanted to do.

          2. I think there is a case for using the fuzzy search operator, currently when you enter an artist search it is rewritten to search artist, sortname and alias but to give higher score to a perfect match on artist rather than sortname. It also prefers phrase matches to just matching singles,but generally speaking phrase search will only match is it is an exact match (excepting punctation and case differences) and this could be better id replaced by fuzzy search ANDing the operators so instead of
          "rudy green"
          http://musicbrainz.org/search?query=%22rudy+green%22&type=artist&limit=25&advanced=1
          you would get
          rudy~ AND green~
          http://musicbrainz.org/search?query=rudy~++AND+green~&type=artist&limit=25&advanced=1

          but fuzzy search if the term contains escaped characters http://tickets.musicbrainz.org/browse/SEARCH-72 and it doesnt make sense to fuzzy search stop words, short common words like 'the' and 'at'. So there is a bit of extra prcoessing to do.

          Also just adding ~ at the end allows a match with simalarity of 0.5, I think this returns to many matches, compare these two queries
          http://musicbrainz.org/search?query=pixies~&type=artist&limit=25&advanced=1
          and
          http://musicbrainz.org/search?query=pixies~0.7&type=artist&limit=25&advanced=1

          Fuzzy search makes more sense that wildcards, because the matches must sound similar

          3. Undecided about this, but gut feeling is no.

          4. This is addressed in 2, I think that is as far as it should go.

          Paul Taylor added a comment - 1. I think you misunderstand the Advanced Search option. Advanced Search means that the query is passed to the search server untouched. So if this was enabled by default it means the Musicbrainz Server wouldn't be able to construct a query from user entry as it does now, and as you are reuesting via 2. Secondly it means casual users need to have an understanding of query syntax so it doesnt make sense. However there should be a way to enable Advanced Search from the first page instead of having to do a query and then select the advanced search checkbox in order to do the query you really wanted to do. 2. I think there is a case for using the fuzzy search operator, currently when you enter an artist search it is rewritten to search artist, sortname and alias but to give higher score to a perfect match on artist rather than sortname. It also prefers phrase matches to just matching singles,but generally speaking phrase search will only match is it is an exact match (excepting punctation and case differences) and this could be better id replaced by fuzzy search ANDing the operators so instead of "rudy green" http://musicbrainz.org/search?query=%22rudy+green%22&type=artist&limit=25&advanced=1 you would get rudy~ AND green~ http://musicbrainz.org/search?query=rudy~++AND+green~&type=artist&limit=25&advanced=1 but fuzzy search if the term contains escaped characters http://tickets.musicbrainz.org/browse/SEARCH-72 and it doesnt make sense to fuzzy search stop words, short common words like 'the' and 'at'. So there is a bit of extra prcoessing to do. Also just adding ~ at the end allows a match with simalarity of 0.5, I think this returns to many matches, compare these two queries http://musicbrainz.org/search?query=pixies~&type=artist&limit=25&advanced=1 and http://musicbrainz.org/search?query=pixies~0.7&type=artist&limit=25&advanced=1 Fuzzy search makes more sense that wildcards, because the matches must sound similar 3. Undecided about this, but gut feeling is no. 4. This is addressed in 2, I think that is as far as it should go.

            ijabz Paul Taylor
            hawke Alex Mauer
            Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved:

                Version Package
                2012-03-23