• Icon: Task Task
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • None
    • None
    • None

      Since the commit c129087eb0 for MBS-3208 in 2012, non-printable characters have started to be removed or replaced with whitespaces, de facto forbidding the use of whitespace variants in newly edited data.

      However, some of these characters have already been entered either before 2012 or in not trimmed fields:

      • U+3000 IDEOGRAPHIC SPACE for Japanese, see MBS-5555,
      • U+00A0 NO-BREAK SPACE for French, see MBS-9576.

      And there are even more variants, see whitespace characters in Unicode but line-breaks. Most of these characters are used in real releases and, as such, should probably be allowed to be entered in the database.

          [STYLE-1402] Welcome back whitespace variants

          Calvin Walton added a comment - - edited

          Note that in addition to allowing different types of spaces, multiple consecutive spaces also should be permitted (showing a warning or requiring confirmation would be fine). An example where both of these problems are hit is track 3 of https://musicbrainz.org/release/948a66d5-6993-446a-9deb-cc732c5326f4 which is correctly written as

          -       -
          

          (i.e. a dash, 7 ideographic spaces, then another dash).

          I entered this release back before the normalization changes, which is why it's still showing up correctly for the time being - but it's in kind of a fragile state due to the fact that some unrelated tracklist edit might apply normalization.

          Calvin Walton added a comment - - edited Note that in addition to allowing different types of spaces, multiple consecutive spaces also should be permitted (showing a warning or requiring confirmation would be fine). An example where both of these problems are hit is track 3 of https://musicbrainz.org/release/948a66d5-6993-446a-9deb-cc732c5326f4 which is correctly written as -       - (i.e. a dash, 7 ideographic spaces, then another dash). I entered this release back before the normalization changes, which is why it's still showing up correctly for the time being - but it's in kind of a fragile state due to the fact that some unrelated tracklist edit might apply normalization.

          yvanzo added a comment - - edited

          For the record, here is a list of space characters that was recommended for sacrifice on the altar of normalization; text copied from the wiki page Internationalization (imported from MoinMoin in 2005):
           

          There are also ranges of Unicode characters that should be avoided as they do not provide increased range of expression but merely create interoperability issues for those without complete Unicode fonts. In particular, the following should be explicitly prohibited by style guidelines (and perhaps enforced by the database):

          1. Non-breaking space U+00A0
          2. Fullwidth latin and halfwidth kana/hangul U+FF00-FFEF
          3. Narrow non-breaking space U+202F
          4. Ideographic space U+3000
          5. Medium Mathematical space U+205F
          6. Typographic spaces U+2000-200B

          The first [is] not specifically Unicode issue as [it] occur[s] in Latin-1, but [this] plus the [second] are among the ones where database enforcement is most desirable as they can lead to artist names that are visually identical in appearance but which are, in fact, different.

          yvanzo added a comment - - edited For the record, here is a list of space characters that was recommended for sacrifice on the altar of normalization; text copied from the wiki page Internationalization (imported from MoinMoin in 2005):   There are also ranges of Unicode characters that should be avoided as they do not provide increased range of expression but merely create interoperability issues for those without complete Unicode fonts. In particular, the following should be explicitly prohibited by style guidelines (and perhaps enforced by the database): Non-breaking space U+00A0 Fullwidth latin and halfwidth kana/hangul U+FF00-FFEF Narrow non-breaking space U+202F Ideographic space U+3000 Medium Mathematical space U+205F Typographic spaces U+2000-200B The first [is] not specifically Unicode issue as [it] occur [s] in Latin-1, but [this] plus the [second] are among the ones where database enforcement is most desirable as they can lead to artist names that are visually identical in appearance but which are, in fact, different.

          yvanzo added a comment -

          I don’t know exactly why whitespace variants have been disabled to start with, but I guess that was the easiest way to fix some bugs related to these characters. Since most of the current code is assuming these are not allowed, technical issues can be expected if these characters get entered again. Below are the few difficulties that have been reported in various discussions:

          1. Whitespace variants are difficult to type in.
            Allowing variants is not the same as making these mandatory. If you don’t need it, don’t use it. Editors who need it generally know how to enter it easily. This is the same logic as letting others using alphabet you cannot even read.
          2. Whitespace changes are difficult to detect.
            This can be fixed by showing edits preview for all entity types (MBS-8815) and improving diff display to render whitespace variants with substitutes.
          3. Whitespace variants make search difficult in the database.
            Not with a proper search implementation and proper search queries. If there are still bugs with that, these should be fixed first.
          4. Whitespace variants make search difficult in tagged files.
            Not with a tagger like Picard that has an option to standardize characters (source).

          To sum up, I think we should first document that whitespace variants cannot be currently entered, then create tickets to overcome the above mentioned difficulties, and finally recommend using whitespace variants again once these tickets have been fixed.

          There are many related bugs that have been reported or fixed recently, so it would be nice to take a decision now for all of further code changes.

          yvanzo added a comment - I don’t know exactly why whitespace variants have been disabled to start with, but I guess that was the easiest way to fix some bugs related to these characters. Since most of the current code is assuming these are not allowed, technical issues can be expected if these characters get entered again. Below are the few difficulties that have been reported in various discussions: Whitespace variants are difficult to type in. Allowing variants is not the same as making these mandatory. If you don’t need it, don’t use it. Editors who need it generally know how to enter it easily. This is the same logic as letting others using alphabet you cannot even read. Whitespace changes are difficult to detect. This can be fixed by showing edits preview for all entity types ( MBS-8815 ) and improving diff display to render whitespace variants with substitutes . Whitespace variants make search difficult in the database. Not with a proper search implementation and proper search queries. If there are still bugs with that, these should be fixed first. Whitespace variants make search difficult in tagged files. Not with a tagger like Picard that has an option to standardize characters (source). To sum up, I think we should first document that whitespace variants cannot be currently entered, then create tickets to overcome the above mentioned difficulties, and finally recommend using whitespace variants again once these tickets have been fixed. There are many related bugs that have been reported or fixed recently, so it would be nice to take a decision now for all of further code changes.

            reosarevok Nicolás Tamargo
            yvanzo yvanzo
            Votes:
            6 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:

                Version Package