Uploaded image for project: 'Picard'
  1. Picard
  2. PICARD-975

Reimplement the threading code in picard

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 1.4
    • None

      According to qt docs only the UI tasks should be processed in the main thread and everything else should be done in background.

      The current threading code in picard is a bit hacky in this area, but since we are shifting to qt5 with 2.0 I think this is one of the major areas picard needs to improve upon. We can leverage the Qt 5 threading timer and concurrent modules to allow for better multitasking.

          [PICARD-975] Reimplement the threading code in picard

          Sophist added a comment -

          @outsidecontext Philipp

          Actually, it may be that we should stop passing (synchronous) callbacks at all, and instead switch to (asynchronous) queues (and possibly signals).

          The main issue with this is that it would introduce a significant backwards incompatibility unless we continue to allow plugins to use callbacks in the same way they do now.

          Sophist added a comment - @outsidecontext Philipp Actually, it may be that we should stop passing (synchronous) callbacks at all, and instead switch to (asynchronous) queues (and possibly signals). The main issue with this is that it would introduce a significant backwards incompatibility unless we continue to allow plugins to use callbacks in the same way they do now.

          Thanks for the comment. Not much to add here, as I actually agree with that proposal. Especially the splitting in smaller independent tasks, that then run in the proper thread pool based on whether they are IO or CPU bound.

          Only one note about implementation. Currently with the callback based approach we have in Qt splitting actions up into smaller tasks supposed to run separately in threads is quite cumbersome. I especially came across this when reverting the change to run File.update in a separate thread (https://github.com/metabrainz/picard/pull/2253).

          Generally the original change, that got reverted in the PR above, improved responsiveness quite significantly. It's one of those CPU intensive things that greatly benefit from running of the main thread (it ran on the single thread priority thread pool). But actually in the Picard code we currently need in some cases to run code after several parallel File.update calls have all finished. Synchronising callback based async tasks this way is a PITA.

          Would be great if we would refactor Picard to have all those async operations based on futures. Ideally we would have some kind of generic task class that can be run and queued up. Ideally it would then not even matter if the task is running on a separate thread or is some kind of async IO on the main thread (like e.g. the networking requests).

           

          Philipp Wolfer added a comment - Thanks for the comment. Not much to add here, as I actually agree with that proposal. Especially the splitting in smaller independent tasks, that then run in the proper thread pool based on whether they are IO or CPU bound. Only one note about implementation. Currently with the callback based approach we have in Qt splitting actions up into smaller tasks supposed to run separately in threads is quite cumbersome. I especially came across this when reverting the change to run File.update in a separate thread ( https://github.com/metabrainz/picard/pull/2253). Generally the original change, that got reverted in the PR above, improved responsiveness quite significantly. It's one of those CPU intensive things that greatly benefit from running of the main thread (it ran on the single thread priority thread pool). But actually in the Picard code we currently need in some cases to run code after several parallel File.update calls have all finished. Synchronising callback based async tasks this way is a PITA. Would be great if we would refactor Picard to have all those async operations based on futures. Ideally we would have some kind of generic task class that can be run and queued up. Ideally it would then not even matter if the task is running on a separate thread or is some kind of async IO on the main thread (like e.g. the networking requests).  

          Sophist added a comment -

          I think it might be helpful for Philipp to summarize the #2172 discussion here. But to take this a little further, perhaps we should be looking at micro-threads. i.e. split a single file load into several parts that run in different threadpools according to whether they are CPU or I/O intensive.

          (Note: Apologues for this being a bit of an unorganised brain dump.)

          In essence, in current Python with a GIL, you can have as many parallel I/O threads as you need to optimise I/O, but you can only have a single CPU thread executing at any one time (and because of the poor way that Python handles deciding which thread runs next (competitive/random rather than deterministic), attempting to run >1 CPU thread can create major lags on the main thread.

          That is why we have separate threadpools in Picard for I/O and CPU workloads.

          At present I have a feeling that we e.g. queue a file load onto an I/O worker thread and this first does the actual I/O (via Mutagen - which includes some processing also), but then we do e.g. the old vs. new comparison of metadata also on an I/O thread whereas we should probably do this on CPU thread. Thus we should perhaps look at splitting e.g. a single file load into multiple smaller pieces that are individually scheduled onto a CPU thread.

          The #2172 discussion also talks about doing the directory scans on an I/O thread, and in general terms we should probably give these priority (because they load up the file queue). So the priority for I/O queues would be Directory scans, then the mutagen parts of file saves, then the mutagen parts of file loads.

          The priority for the CPU queue should be e.g.: first the CPU intensive preps (if any) for file saves, then the processing for MB WS API responses, then the CPU intensive processing for file loads, then the CPU intensive parts of e.g. matching files in a release to the tracks in a release, etc.

          As you can see, we would need to maintain several queues of micro-tasks and service them in priority sequence.

          The `sleep` option put forward by Gabriel Carver would seem to need only apply to CPU threads, but would be exactly what is needed to allow the main thread to get some CPU to update the GUI and keep it lively. We should perhaps also consider the use of sleep on a CPU intensive thread every so many interations of a loop so as to give the GUI more update slots, or even break and queue such a task into several microtasks at the outset in order to allow other higher priority micro-tasks to get a look in. (Or you simply start a single CPU task, do a bunch of iterations, then queue yourself onto the head of the appropriate CPU intensive queue to do the next chunk whilst allowing other higher priority types of CPU task to get a lookin.)

          We might also need to have some mechanism to make exceptions to the strict priority sequencing for the I/O threads if the highest priority items are only I/O without CPU intensive workload because these will leave the CPU intensive thread idle. So if e.g. we have 1 or more directory read micro-tasks AND 1 or more file load tasks, then before deciding to dispatch the first of the directory read tasks we might want to check whether the CPU intensive thread is idle and if so dispatch a file load item instead.

          Finally as an entirely separate brainstorm idea, we might want to think about whether any of the CPU intensive tasks (like metadata summarisation or old/new metadata comparison) can be undertaken using NUMPY processing - because NUMPY runs without the GIL and so can actually utilise multi-core processors (and using compiled code rather than interpreted Q-code - and so run faster anyway) . My initial gut reaction without having done the detailed analysis is that some of them probably can, in which case this might be worth a detailed look. P.S. To avoid the overhead of converting e.g. `metadata` objects into Numpy objects and then convert back again, we might also need to consider holding such data in Numpy ready form (in addition to or instead of current form).

          And apologies again for the braindump.

          Sophist added a comment - I think it might be helpful for Philipp to summarize the #2172 discussion here. But to take this a little further, perhaps we should be looking at micro-threads. i.e. split a single file load into several parts that run in different threadpools according to whether they are CPU or I/O intensive. (Note: Apologues for this being a bit of an unorganised brain dump.) In essence, in current Python with a GIL, you can have as many parallel I/O threads as you need to optimise I/O, but you can only have a single CPU thread executing at any one time (and because of the poor way that Python handles deciding which thread runs next (competitive/random rather than deterministic), attempting to run >1 CPU thread can create major lags on the main thread. That is why we have separate threadpools in Picard for I/O and CPU workloads. At present I have a feeling that we e.g. queue a file load onto an I/O worker thread and this first does the actual I/O (via Mutagen - which includes some processing also), but then we do e.g. the old vs. new comparison of metadata also on an I/O thread whereas we should probably do this on CPU thread. Thus we should perhaps look at splitting e.g. a single file load into multiple smaller pieces that are individually scheduled onto a CPU thread. The #2172 discussion also talks about doing the directory scans on an I/O thread, and in general terms we should probably give these priority (because they load up the file queue). So the priority for I/O queues would be Directory scans, then the mutagen parts of file saves, then the mutagen parts of file loads. The priority for the CPU queue should be e.g.: first the CPU intensive preps (if any) for file saves, then the processing for MB WS API responses, then the CPU intensive processing for file loads, then the CPU intensive parts of e.g. matching files in a release to the tracks in a release, etc. As you can see, we would need to maintain several queues of micro-tasks and service them in priority sequence. The `sleep` option put forward by Gabriel Carver would seem to need only apply to CPU threads, but would be exactly what is needed to allow the main thread to get some CPU to update the GUI and keep it lively. We should perhaps also consider the use of sleep on a CPU intensive thread every so many interations of a loop so as to give the GUI more update slots, or even break and queue such a task into several microtasks at the outset in order to allow other higher priority micro-tasks to get a look in. (Or you simply start a single CPU task, do a bunch of iterations, then queue yourself onto the head of the appropriate CPU intensive queue to do the next chunk whilst allowing other higher priority types of CPU task to get a lookin.) We might also need to have some mechanism to make exceptions to the strict priority sequencing for the I/O threads if the highest priority items are only I/O without CPU intensive workload because these will leave the CPU intensive thread idle. So if e.g. we have 1 or more directory read micro-tasks AND 1 or more file load tasks, then before deciding to dispatch the first of the directory read tasks we might want to check whether the CPU intensive thread is idle and if so dispatch a file load item instead. Finally as an entirely separate brainstorm idea, we might want to think about whether any of the CPU intensive tasks (like metadata summarisation or old/new metadata comparison) can be undertaken using NUMPY processing - because NUMPY runs without the GIL and so can actually utilise multi-core processors (and using compiled code rather than interpreted Q-code - and so run faster anyway) . My initial gut reaction without having done the detailed analysis is that some of them probably can, in which case this might be worth a detailed look. P.S. To avoid the overhead of converting e.g. `metadata` objects into Numpy objects and then convert back again, we might also need to consider holding such data in Numpy ready form (in addition to or instead of current form). And apologies again for the braindump.

          > Is this ticket related to https://github.com/metabrainz/picard/pull/2172 maybe?

          Yes, it is related. This ticket here is in itself a bit odd. It is here more for discussion of the general issue of optimizing threading. It is very likely that improvements then result in separate smaller tickets that address certain issues.

          But the ticket here is not really related to your fingerprinting issue in   PICARD-1945 , as the fingerprinting is using fpcalc and is separate from the internal threading code used by Picard.

          Philipp Wolfer added a comment - > Is this ticket related to https://github.com/metabrainz/picard/pull/2172 maybe? Yes, it is related. This ticket here is in itself a bit odd. It is here more for discussion of the general issue of optimizing threading. It is very likely that improvements then result in separate smaller tickets that address certain issues. But the ticket here is not really related to your fingerprinting issue in   PICARD-1945 , as the fingerprinting is using fpcalc and is separate from the internal threading code used by Picard.

          Sophist added a comment -
          1. You can't really "stop multi-threading" in Picard without major consequences - like the GUI being completely unresponsive during file activities!!!!
          2. However, (I think) AcoustId runs by running a separate executable in the background, and there are various ways to stop this from grinding the CPU to a halt (general comments - I haven't checked to see what happens inside Picard at the moment:
            1. Limit the number of copies of AcoustId that run simultaneously;
            2. Run these at a lower CPU priority than normal

           

          Sophist added a comment - You can't really "stop multi-threading" in Picard without major consequences - like the GUI being completely unresponsive during file activities!!!! However, (I think) AcoustId runs by running a separate executable in the background, and there are various ways to stop this from grinding the CPU to a halt (general comments - I haven't checked to see what happens inside Picard at the moment: Limit the number of copies of AcoustId that run simultaneously; Run these at a lower CPU priority than normal  

          jesus2099 added a comment -

          Sorry I don't understand everything but are going to stop multi-threading the file reading and the AcoustID scanning?
          It would be nice when reading CD drive directly (PICARD-1945)

          Is this ticket related to https://github.com/metabrainz/picard/pull/2172 maybe?

          jesus2099 added a comment - Sorry I don't understand everything but are going to stop multi-threading the file reading and the AcoustID scanning? It would be nice when reading CD drive directly ( PICARD-1945 ) Is this ticket related to https://github.com/metabrainz/picard/pull/2172 maybe?

          Gabriel Ferreira added a comment - - edited

          Started working on a PR for PICARD-840 and but ended up touching on this one and PICARD-936 Replaced the QThreads of the loading pipeline with Python threads, which are way better in terms of instrumentation for debugging/fine tuning. The UI feels very responsive during load, but still need to figure out how to measure it objectively.

          Threads and worker process remain locked waiting for their input queues, which get fed either by user interaction or during exit (just to wake them up). Threads further on the pipeline get slowed down until the queues from the previous threads have been emptied.

          Tested on both Windows 10 and Ubuntu. In terms of load times, things are basically unchanged (compared to the Picard version with the first commit of the same PR, which speeds up load times by ~3x). Having issues with pyinstaller -onefile, but that is it.
          The PR components relative to this issue are currently looking like this:

          Gabriel Ferreira added a comment - - edited Started working on a PR for PICARD-840 and but ended up touching on this one and PICARD-936 Replaced the QThreads of the loading pipeline with Python threads, which are way better in terms of instrumentation for debugging/fine tuning. The UI feels very responsive during load, but still need to figure out how to measure it objectively. Threads and worker process remain locked waiting for their input queues, which get fed either by user interaction or during exit (just to wake them up). Threads further on the pipeline get slowed down until the queues from the previous threads have been emptied. Tested on both Windows 10 and Ubuntu. In terms of load times, things are basically unchanged (compared to the Picard version with the first commit of the same PR, which speeds up load times by ~3x). Having issues with pyinstaller -onefile, but that is it. The PR components relative to this issue are currently looking like this:

          Sophist added a comment -

          I agree that GUI update calls need to be reduced. But in the end, that is a symptom of the wider architectural issue.

          Having been here before, and put a lot of effort into fine tuning, I genuinely believe that effort is best put into re-architecting for multi-processing. That could be done in an incremental basis:

          • Add multi-processing capability to Picard as additional functionality. 
          • Migrate functionality bit by bit to sub-processes.

          I don't under estimate the difficulties of adding multi-processing. It has to work running from source and executable, and on Linux and Windows. But it will avoid most of the fine tuning effort you are currently putting into #1531. But if we can achieve that it will put us on a whole new trajectory for Picard performance and stability.

          Sophist added a comment - I agree that GUI update calls need to be reduced. But in the end, that is a symptom of the wider architectural issue. Having been here before, and put a lot of effort into fine tuning, I genuinely believe that effort is best put into re-architecting for multi-processing. That could be done in an incremental basis: Add multi-processing capability to Picard as additional functionality.  Migrate functionality bit by bit to sub-processes. I don't under estimate the difficulties of adding multi-processing. It has to work running from source and executable, and on Linux and Windows. But it will avoid most of the fine tuning effort you are currently putting into #1531. But if we can achieve that it will put us on a whole new trajectory for Picard performance and stability.

          Tried a dispatcher to reduce the rate of tasks run in the worker threads, by modifying run_task and making it enqueue the runnable and priority instead of forwarding the task to the threadpool. Results were meh.

          After profiling and testing, everything points to the same direction: the problem is the UI, not the threading. Everything is linked to the UI and call for updates way too frequently. During load alone, there are at least 3 UI update calls per song. Assuming each update triggers a redrawn of the UI, 10-20 songs per second = 30-60 redraws per second. If each redraw takes 100ms, then it takes 3-6 seconds to update 1 second of processing.

          I want to rewire the entire UI when I have the time for that (e.g. entries in UI lists could be a list entry that simply schedules tasks that are then forwarded to the worker process, not a file entity that executes tasks on the main process).

          Gabriel Ferreira added a comment - Tried a dispatcher to reduce the rate of tasks run in the worker threads, by modifying run_task and making it enqueue the runnable and priority instead of forwarding the task to the threadpool. Results were meh. After profiling and testing, everything points to the same direction: the problem is the UI, not the threading. Everything is linked to the UI and call for updates way too frequently. During load alone, there are at least 3 UI update calls per song. Assuming each update triggers a redrawn of the UI, 10-20 songs per second = 30-60 redraws per second. If each redraw takes 100ms, then it takes 3-6 seconds to update 1 second of processing. I want to rewire the entire UI when I have the time for that (e.g. entries in UI lists could be a list entry that simply schedules tasks that are then forwarded to the worker process, not a file entity that executes tasks on the main process).

          Sophist added a comment -

          P.S. One other thing that we perhaps should think about is some sort of cancel function. For I/O activities which may be CPU intensive in the hundreds but which individually finish within a second or two, simply emptying the request queue and waiting for the worker threads / subprocess to finish their current tasks should suffice. If you have a long-running CPU-intensive activity on a worker thread/sub-process then we might want to find a way for the GUI to tag the worker code to abandon the task.

          Sophist added a comment - P.S. One other thing that we perhaps should think about is some sort of cancel function. For I/O activities which may be CPU intensive in the hundreds but which individually finish within a second or two, simply emptying the request queue and waiting for the worker threads / subprocess to finish their current tasks should suffice. If you have a long-running CPU-intensive activity on a worker thread/sub-process then we might want to find a way for the GUI to tag the worker code to abandon the task.

            Unassigned Unassigned
            samj1912 Sambhav Kothari
            Votes:
            3 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:

                Version Package