Uploaded image for project: 'MusicBrainz Search Server'
  1. MusicBrainz Search Server
  2. SEARCH-609

SIR triggers do not work as intended when running a MusicBrainz Slave instance

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Normal Normal
    • sir-3.0.0
    • None
    • Indexer
    • None

      After struggling for several months with very slow live indexing from a MusicBrainz slave instance's database, I've identified what I think is the root cause.

      SIR triggers effectively filter database changes to ensure that only data changes relevant to search indexing result in a indexing message being published to the SIR message queue. For example, an artist_credit update should only trigger an index message when the name field is updated.

      Unfortunately, MusicBrainz's ProcessReplicationChanges script generates updates that include every field in the record. For example, in replication packet #130964, the SIR trigger fires for the following changes on artist credit ID 1:

      UPDATE "musicbrainz"."artist_credit" SET "artist_count" = ?, "created" = ?, "edits_pending" = ?, "id" = ?, "name" = ?, "ref_count" = ? WHERE "id" = ? (1 2011-05-16 16:32:11.963929+00 0 1 Various Artists 320340 1)
      UPDATE "musicbrainz"."artist_credit" SET "artist_count" = ?, "created" = ?, "edits_pending" = ?, "id" = ?, "name" = ?, "ref_count" = ? WHERE "id" = ? (1 2011-05-16 16:32:11.963929+00 0 1 Various Artists 320341 1)
      UPDATE "musicbrainz"."artist_credit" SET "artist_count" = ?, "created" = ?, "edits_pending" = ?, "id" = ?, "name" = ?, "ref_count" = ? WHERE "id" = ? (1 2011-05-16 16:32:11.963929+00 0 1 Various Artists 320342 1)
      UPDATE "musicbrainz"."artist_credit" SET "artist_count" = ?, "created" = ?, "edits_pending" = ?, "id" = ?, "name" = ?, "ref_count" = ? WHERE "id" = ? (1 2011-05-16 16:32:11.963929+00 0 1 Various Artists 320343 1)
      UPDATE "musicbrainz"."artist_credit" SET "artist_count" = ?, "created" = ?, "edits_pending" = ?, "id" = ?, "name" = ?, "ref_count" = ? WHERE "id" = ? (1 2011-05-16 16:32:11.963929+00 0 1 Various Artists 320342 1)
      

      It's clear that the only material change to the artist_credit record is that the ref_count has been incremented with each change.

      Understandably this isn't a problem on the MusicBrainz master as the trigger receives the original update that only updates the ref_count field. However on a slave, it generates an extremely large numbers of spurious indexing jobs that result in an unmanageable queue depth effectively rendering live indexing useless without throwing lots of compute and database resource at the problem.

      I believe that a simple solution would be to generate a set of stored procedures specifically for Slave instances which would ensure that only changes to the fields that matter result in indexing jobs. For example by refactoring the following to fetch the record and compare which fields have actually been modified.

      Another possible solution could leverage the work I did on removing the dependency on the Postgres AMQP extension. If the master were to persist SIR messages in the sir.message table, this could be included in hourly replication packets. A slave could either implement a triggered function on sir.message inserts to publish messages via Postgres' AMQP extension or it could pull the messages from the sir.message table and publish them to AMQP using the amqp_publish Python process also implemented in the pull request. This approach is more invasive as it necessitates a schema change on the Master which will impact all replicating slaves. However it does ensure that indexing slaves will always be receiving the correct indexing jobs, even when the criteria for an indexing change is modifed.

            yvanzo yvanzo
            mglubb Matthew Glubb
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:

                Version Package
                sir-3.0.0