Uploaded image for project: 'MusicBrainz Search Server'
  1. MusicBrainz Search Server
  2. SEARCH-314

Combining diacritics are not handled correctly

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Normal Normal
    • 2014-10-08
    • None
    • None
    • None

      Compare https://beta.musicbrainz.org/search?query=b%C3%A4r&type=artist&method=indexed and https://beta.musicbrainz.org/search?query=ba%CC%88r&type=artist&limit=25&method=indexed
      The first one is using U+00E4 LATIN SMALL LETTER A WITH DIAERESIS, the second one is using U+0061 LATIN SMALL LETTER A followed by U+0308 COMBINING DIAERESIS. Unicode considers those as canonically equivalent, i.e. they should look and behave the same, so searching for either should find exactly the same results.

      Also compare https://beta.musicbrainz.org/search?query=%D0%A8%D0%BE%D1%81%D1%82%D0%B0%D0%BA%D0%BE%CC%81%D0%B2%D0%B8%D1%87&type=artist&method=indexed and https://beta.musicbrainz.org/search?query=%D0%A8%D0%BE%D1%81%D1%82%D0%B0%D0%BA%D0%BE%D0%B2%D0%B8%D1%87&type=artist&limit=25&method=indexed
      The second one has a combining diacritic which is not present at all in the first, but we ignore accents so those should also give the same results.

      What actually happens, looking at the first one, is that the combining diacritic is treated like a space, so the search results find "ba" and "r" for the first example and "Шостако" and "вич" for the second. A combining diacritic is not a word separator though and shouldn't be treated as one.

            ijabz Paul Taylor
            nikki nikki
            Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved:

                Version Package
                2014-10-08