Uploaded image for project: 'Picard'
  1. Picard
  2. PICARD-2716

Accept encodings other than UTF-8 when opening CD extraction logs

    • Icon: Improvement Improvement
    • Resolution: Fixed
    • Icon: Normal Normal
    • 2.12
    • 2.8.5
    • Lookup & Match
    • None
    • Debian GNU/Linux

      When opening CD extraction logs, it seems that Picard expects the file to be in UTF-8 encoding. If the file is in ISO encoding, Picard fails to read it with an "utf8 codec can't decode byte 0xce in position...", for example.

      While I agree that it's 2023 and everyone should be using UTF-8, there are people that still don't. It would be great if the file encoding would be detected instead of assuming UTF-8, because having to convert log files to UTF-8 before opening them is a bit annoying.

          [PICARD-2716] Accept encodings other than UTF-8 when opening CD extraction logs

          GitHub Bot added a comment -

          See code changes in pull request #2372 submitted by ShubhamBhut.

          GitHub Bot added a comment - See code changes in pull request #2372 submitted by ShubhamBhut .

          Zas added a comment -

          kailorston Yes, I guess you correctly understood the issue, the solution is likely to use `detect_unicode_encoding()` return value instead of hardcoded utf-8 encoding, as in other source files you pointed.

          Zas added a comment - kailorston Yes, I guess you correctly understood the issue, the solution is likely to use `detect_unicode_encoding()` return value instead of hardcoded utf-8 encoding, as in other source files you pointed.

          kailorston added a comment -

          Hey, I would like to work on this issue.

          Actually here it is shown that python by defaults supports a ton of encoding including windows-1251.

          I am actually quite new to picard codebase, so I have some uncertainties regarding the issue at hand. Here's my interpretation and proposed solution:

          • Currently, the detect_unicode_encoding function relies on the Byte Order Mark (BOM) for encoding detection. However, many older encodings, such as Windows-1251 and Windows-1252, don't use a BOM. I am considering using the chardet library for encoding detection instead.
          • I believe only the toc_from_file function in the picard/disc/whipperlog.py file needs to be updated to support other encodings (similar to toc_from_file in eaclog.py and dbpoweramplog.py ). And additional test cases (including attached log file) should be added for the updated function.

          Please do let me know whether I have correctly understood the problem and have missed any part of the solution.

          kailorston added a comment - Hey, I would like to work on this issue. Actually here it is shown that python by defaults supports a ton of encoding including windows-1251. I am actually quite new to picard codebase, so I have some uncertainties regarding the issue at hand. Here's my interpretation and proposed solution: Currently, the detect_unicode_encoding function relies on the Byte Order Mark (BOM) for encoding detection. However, many older encodings, such as Windows-1251 and Windows-1252, don't use a BOM. I am considering using the chardet library for encoding detection instead. I believe only the toc_from_file function in the picard/disc/whipperlog.py file needs to be updated to support other encodings (similar to toc_from_file in eaclog.py and dbpoweramplog.py ). And additional test cases (including attached log file) should be added for the updated function. Please do let me know whether I have correctly understood the problem and have missed any part of the solution.

          Zas added a comment -

          Using `chardet`:

          `Vapor Trails.log

          {'encoding': 'windows-1251', 'confidence': 0.9632708397017065, 'language': 'Russian'}

          `

          Zas added a comment - Using `chardet`: `Vapor Trails.log {'encoding': 'windows-1251', 'confidence': 0.9632708397017065, 'language': 'Russian'} `

          Which encoding has this file actually?

          Philipp Wolfer added a comment - Which encoding has this file actually?

          csaavedra added a comment -

          Attached.

          csaavedra added a comment - Attached.

          Zas added a comment -

          Can you attach one of the files you got that triggered the issue?

          Zas added a comment - Can you attach one of the files you got that triggered the issue?

            Unassigned Unassigned
            csaavedra csaavedra
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:

                Version Package
                2.12