Uploaded image for project: 'Picard'
  1. Picard
  2. PICARD-803

tagging "8½ Minutes" with "replace with non-ascii characters" results in a directory being created

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Normal Normal
    • 2.0.4
    • None
    • None
    • None
    • OS X, Picard 1.3.2

      track: http://musicbrainz.org/recording/ada94805-940e-44ca-b3a9-3e4602626c43

      Tagging this track resulted in Picard attempting to name the file "11 81/2 Minutes.flac", which turned into a directory named "11 81" with a file named "2 Minutes.flac" inside it.

      Looks like Picard can catch a "/" and replace it with "_" when it's actually in the string, but the replacement of "1/2" for "½" is slipping past that filter.

      Expected behavior: Picard replaces "½" with "1/2" and then replaces "1/2" with "1_2" when it's written as a filename. If the "½" to "1/2" replacement never happens outside of the filename context (I'm not sure), then just change that substitution to "1_2".

          [PICARD-803] tagging "8½ Minutes" with "replace with non-ascii characters" results in a directory being created

          Philipp Wolfer added a comment - https://github.com/metabrainz/picard/pull/929

          Philipp Wolfer added a comment - - edited

          I cannot reproduce this at all, at least not in Picard 2.0.

          Not sure if something changed or not, but actually slashes in metadata should never cause folder to being created (and that statement was true for 1.3.2). It's the reason why you cannot convert "/" and "\" to something else in the naming script, because before the metadata gets passed to the script it gets sanitized by replacing / and \ with _, see https://github.com/metabrainz/picard/blob/master/picard/file.py#L310

          --The only thing that I could think about is that the replacement is done actually with tagger script inside the renamig script, e.g. by using $replace(8½,8 1/2). But in that case it would be expected behavior, because it is totally valid to generate path separators with script.--

          Should we close this?

           

          Scratch this, this is reproducible if you disable Metadata > Convert Unicode punctuation characters to ASCII, but enable File Naming > Replace non-ASCII characters. I will submit a PR

          Philipp Wolfer added a comment - - edited I cannot reproduce this at all, at least not in Picard 2.0. Not sure if something changed or not, but actually slashes in metadata should never cause folder to being created (and that statement was true for 1.3.2). It's the reason why you cannot convert "/" and "\" to something else in the naming script, because before the metadata gets passed to the script it gets sanitized by replacing / and \ with _, see https://github.com/metabrainz/picard/blob/master/picard/file.py#L310 --The only thing that I could think about is that the replacement is done actually with tagger script inside the renamig script, e.g. by using $replace(8½,8 1/2). But in that case it would be expected behavior, because it is totally valid to generate path separators with script.-- Should we close this?   Scratch this, this is reproducible if you disable Metadata > Convert Unicode punctuation characters to ASCII, but enable File Naming > Replace non-ASCII characters. I will submit a PR

          Zas added a comment -

          I think Philipp is working on it.

          Zas added a comment - I think Philipp is working on it.

          Philipp Wolfer added a comment - - edited

          We have to escape "/" (or "\" depending on platform) before writing any file, and take care of full paths, ie. "/a/b/8½.ogg" -> "/a/b/81\/2.ogg". To be safe full path conversion has to split path elements and convert individual elements, taking care of escaping the separator character, i didn't review the code in details (which is rather complex, especially with all those options) but this bug report shows we don't escape "/" as it should.

          There is something fishy about this. The line you point to just changes the metadata, and yes, of course there ½ will also be changed. But actually we safe guard against slashes in the metadata before generating the filename:

          https://github.com/metabrainz/picard/blob/master/picard/file.py#L254

          So if the slash is in the metadata before that it should actually not survive. But replace_non_ascii also changes ½ to 1/2, but now this would mean that the metadata was actually not changed before. I have not yet debugged this inside Picard, but trying to fix the slash problem inside track.py seems to be wrong. There should be no problem having slashes in the tags.

          Philipp Wolfer added a comment - - edited We have to escape "/" (or "\" depending on platform) before writing any file, and take care of full paths, ie. "/a/b/8½.ogg" -> "/a/b/81\/2.ogg". To be safe full path conversion has to split path elements and convert individual elements, taking care of escaping the separator character, i didn't review the code in details (which is rather complex, especially with all those options) but this bug report shows we don't escape "/" as it should. There is something fishy about this. The line you point to just changes the metadata, and yes, of course there ½ will also be changed. But actually we safe guard against slashes in the metadata before generating the filename: https://github.com/metabrainz/picard/blob/master/picard/file.py#L254 So if the slash is in the metadata before that it should actually not survive. But replace_non_ascii also changes ½ to 1/2, but now this would mean that the metadata was actually not changed before. I have not yet debugged this inside Picard, but trying to fix the slash problem inside track.py seems to be wrong. There should be no problem having slashes in the tags.

          Zas added a comment -

          @Philipp Wolfer : during my tests it was pretty obvious the change occured at https://github.com/metabrainz/picard/blob/master/picard/track.py#L170 printing debug output before and after this line

          `replace_non_ascii()` is calling `unicode_simplify_punctuation()`, see https://github.com/metabrainz/picard/blob/master/picard/util/textencoding.py#L431

          8½ is replaced by 81/2 because of https://github.com/metabrainz/picard/blob/master/picard/util/textencoding.py#L384 but the same goes for anything replaced by something with "/" in it.

          We have to escape "/" (or "\" depending on platform) before writing any file, and take care of full paths, ie. "/a/b/8½.ogg" -> "/a/b/81\/2.ogg".
          To be safe full path conversion has to split path elements and convert individual elements, taking care of escaping the separator character, i didn't review the code in details (which is rather complex, especially with all those options) but this bug report shows we don't escape "/" as it should.

          Feel free to propose a fix

          Zas added a comment - @Philipp Wolfer : during my tests it was pretty obvious the change occured at https://github.com/metabrainz/picard/blob/master/picard/track.py#L170 printing debug output before and after this line `replace_non_ascii()` is calling `unicode_simplify_punctuation()`, see https://github.com/metabrainz/picard/blob/master/picard/util/textencoding.py#L431 8½ is replaced by 81/2 because of https://github.com/metabrainz/picard/blob/master/picard/util/textencoding.py#L384 but the same goes for anything replaced by something with "/" in it. We have to escape "/" (or "\" depending on platform) before writing any file, and take care of full paths, ie. "/a/b/8½.ogg" -> "/a/b/81\/2.ogg". To be safe full path conversion has to split path elements and convert individual elements, taking care of escaping the separator character, i didn't review the code in details (which is rather complex, especially with all those options) but this bug report shows we don't escape "/" as it should. Feel free to propose a fix

          @Zas: Are you sure this is caused by asciipunct? Isn't that applied to the metadata directly, which in theory should be made safe for use in filenames at https://github.com/metabrainz/picard/blob/master/picard/file.py#L254 ?

          From looking at the code I think it is actually replace_non_ascii called at https://github.com/metabrainz/picard/blob/master/picard/file.py#L262 that is causing the issues. The proper solution then is maybe calling this on the metadata in the loop just before sanitize_filename. Or actually splitting the filename at \ and / and then performing replace_non_ascii() followed by sanitize_filename() on the individial path components.

          Something like this (but probably nicer than this one liner):

          [code]
          filename = "/".[sanitize_filename(replace_non_ascii) for x in re.split("[/\\\\]", filename)]
          [/code]

          Philipp Wolfer added a comment - @Zas: Are you sure this is caused by asciipunct? Isn't that applied to the metadata directly, which in theory should be made safe for use in filenames at https://github.com/metabrainz/picard/blob/master/picard/file.py#L254 ? From looking at the code I think it is actually replace_non_ascii called at https://github.com/metabrainz/picard/blob/master/picard/file.py#L262 that is causing the issues. The proper solution then is maybe calling this on the metadata in the loop just before sanitize_filename. Or actually splitting the filename at \ and / and then performing replace_non_ascii() followed by sanitize_filename() on the individial path components. Something like this (but probably nicer than this one liner): [code] filename = "/".[sanitize_filename(replace_non_ascii ) for x in re.split(" [/\\\\]", filename)] [/code]

          Stephen added a comment -

          Yeah, my use case is that I have my music on one system which is sharing it over NFS to another. Both ends support Unicode filenames, but I've had some issues in the past anyway (I think related to different Unicode normalization schemes across the three components there). It's safer for me to make sure everything is ascii on disc and keep Unicode in the tags.

          ASCII on disc also makes it nicer while working from the command line, since I don't end up with any characters I can't type easily. Substitutions make it more meaningful, although of course music named in different character sets are just going to end up "01 __________.flac" and there's nothing to be done about it.

          Stephen added a comment - Yeah, my use case is that I have my music on one system which is sharing it over NFS to another. Both ends support Unicode filenames, but I've had some issues in the past anyway (I think related to different Unicode normalization schemes across the three components there). It's safer for me to make sure everything is ascii on disc and keep Unicode in the tags. ASCII on disc also makes it nicer while working from the command line, since I don't end up with any characters I can't type easily. Substitutions make it more meaningful, although of course music named in different character sets are just going to end up "01 __________.flac" and there's nothing to be done about it.

          Zas added a comment -

          I don't think about removing it for now, but instead disabling it by default (it is for now).
          I think about post processing the converted string, search and replace "/" and "\", by "_". It would not be very fancy, but at least safer.

          Zas added a comment - I don't think about removing it for now, but instead disabling it by default (it is for now). I think about post processing the converted string, search and replace "/" and "\", by "_". It would not be very fancy, but at least safer.

          Well, I agree it's impossible to intelligently convert all unicode chars (and some of the replacements there might be dubious), but I think a lot are sensible. Personally I use the conversion so that the metadata in my files matches what Spotify has (they use ASCII, except where there isn't a substitute) and doesn't break my playcounts on last.fm. Otherwise I'd be glad to use no replacements at all. The conversion was an oft-requested feature when it was added (but it mainly just did quotes when it first came out), so I wouldn't remove it outright.

          Michael Wiencek added a comment - Well, I agree it's impossible to intelligently convert all unicode chars (and some of the replacements there might be dubious), but I think a lot are sensible. Personally I use the conversion so that the metadata in my files matches what Spotify has (they use ASCII, except where there isn't a substitute) and doesn't break my playcounts on last.fm. Otherwise I'd be glad to use no replacements at all. The conversion was an oft-requested feature when it was added (but it mainly just did quotes when it first came out), so I wouldn't remove it outright.

          Zas added a comment -

          I found the source of the issue, but i'm not sure how to fix it.

          While testing, i noticed 8½ is replaced by 81/2 even before saving, the change comes from the call to `asciipunct()`, at
          https://github.com/metabrainz/picard/blob/master/picard/track.py#L170

          When looking at https://github.com/metabrainz/picard/blob/master/picard/util/textencoding.py#L421 , one can noticed we have many replacements using "/", check for example https://github.com/metabrainz/picard/blob/master/picard/util/textencoding.py#L378-L402

          So the issue concerns all unicode characters which are converted to anything with a slash (or even anti-slash on certain plaforms).
          Of course, this doesn't happen when the option "convert_punctuation" is disabled.

          During my test, i used the whole album, https://musicbrainz.org/release/433caf55-ee9d-4ee4-a33d-fce4ba3355c6
          and track 6 is titled "I ♥ a Magician" which is left untouched by asciipunct(), while track 11's title "8½ Minutes" is converted to "81/2 Minutes"...

          I wonder why we even try to convert unicode characters this way, either we should convert all, or none, partial conversion doesn't make sense to me.

          So, i don't know how to fix this issue, we should ensure we never end with "/" in filename to start with, but it looks to me the whole conversion attempt is meant to fail, unicode to ascii is just something we should avoid, especially nowadays, unicode is now widely supported.

          What do you think ?

          Zas added a comment - I found the source of the issue, but i'm not sure how to fix it. While testing, i noticed 8½ is replaced by 81/2 even before saving, the change comes from the call to `asciipunct()`, at https://github.com/metabrainz/picard/blob/master/picard/track.py#L170 When looking at https://github.com/metabrainz/picard/blob/master/picard/util/textencoding.py#L421 , one can noticed we have many replacements using "/", check for example https://github.com/metabrainz/picard/blob/master/picard/util/textencoding.py#L378-L402 So the issue concerns all unicode characters which are converted to anything with a slash (or even anti-slash on certain plaforms). Of course, this doesn't happen when the option "convert_punctuation" is disabled. During my test, i used the whole album, https://musicbrainz.org/release/433caf55-ee9d-4ee4-a33d-fce4ba3355c6 and track 6 is titled "I ♥ a Magician" which is left untouched by asciipunct(), while track 11's title "8½ Minutes" is converted to "81/2 Minutes"... I wonder why we even try to convert unicode characters this way, either we should convert all, or none, partial conversion doesn't make sense to me. So, i don't know how to fix this issue, we should ensure we never end with "/" in filename to start with, but it looks to me the whole conversion attempt is meant to fail, unicode to ascii is just something we should avoid, especially nowadays, unicode is now widely supported. What do you think ?

            outsidecontext Philipp Wolfer
            tungolcraft Stephen
            Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved:

                Version Package
                2.0.4