Uploaded image for project: 'Zapped: AcousticBrainz'
  1. Zapped: AcousticBrainz
  2. AB-135

Store ID Type in acousticbrainz.low-level table (Schema Change)

    • Icon: Improvement Improvement
    • Resolution: Fixed
    • Icon: Normal Normal
    • None
    • None
    • Server
    • None

      This improvement permit to store the recording type in the low-level table, we need a flag to identify if the uuid is an MBID or is generated in in the AB server.

      It is connected with the possibility of submissions without MBID.
      The details are visible in the documentation of the GSoC project: https://docs.google.com/document/d/1wVkJFvGzzINMSORVcGYZEI27kXAgKcAKwsRE3yyxHeE/edit#bookmark=id.9m7qf9s439lx

      Actions:
      -Add a field to the low-level table
      -Create an index for this field (not mandatory)
      -Add an upgrade sql script
      -Modify the schema creation sql script

          [AB-135] Store ID Type in acousticbrainz.low-level table (Schema Change)

          Merged with metabrainz/Acousticrainz-server master.
          Details of the code merged:
          https://github.com/metabrainz/acousticbrainz-server/commit/a05763593cbecb802ace895e0a26024220ef8a51

          Daniele Scarano added a comment - Merged with metabrainz/Acousticrainz-server master. Details of the code merged: https://github.com/metabrainz/acousticbrainz-server/commit/a05763593cbecb802ace895e0a26024220ef8a51

          The PR 195 has been merged and the schema change is now on the master branch of Acousticbrainz-server.

          Daniele Scarano added a comment - The PR 195 has been merged and the schema change is now on the master branch of Acousticbrainz-server.

          Yesterday we(Alastair, Dmitry, Daniele) discussed on how to implement this behaviour:

          To accept submission we defined the minimum requirement: each item must have at least artist and song/title in the metadata

          • We are aware that datasets exist without those information, but for initially we will support only those dataset that contain those metadata.
          • the idea behind that is to encourage people to build good datasets with meaningful metadata and avoid 'bad' datasets
          • for particular cases that we are interested in ad-hoc solutions can be implemented

          The infrastructure we are building is ready to accept other kind of dataset (e.g. Ballroom dataset is useful but does not have those data), even if at this stage we will not accept them.

          • For dataset item submission we will ask messybrainz for an msid
          • The gid_type should be an enum so we can add support for more gid types in the future

          Daniele Scarano added a comment - Yesterday we(Alastair, Dmitry, Daniele) discussed on how to implement this behaviour: To accept submission we defined the minimum requirement: each item must have at least artist and song/title in the metadata We are aware that datasets exist without those information, but for initially we will support only those dataset that contain those metadata. the idea behind that is to encourage people to build good datasets with meaningful metadata and avoid 'bad' datasets for particular cases that we are interested in ad-hoc solutions can be implemented The infrastructure we are building is ready to accept other kind of dataset (e.g. Ballroom dataset is useful but does not have those data), even if at this stage we will not accept them. For dataset item submission we will ask messybrainz for an msid The gid_type should be an enum so we can add support for more gid types in the future

          We talked about this last week and came up with the following ideas:

          • The `gid` column is a good name.
          • For now we can't think of a reason why we want to filter by the type of identifier. Because of this, let's not try and over-optimise the database through normalisation. Instead let's keep the type column in low_level, but call it `gid_type` and make it an enum, not a boolean.

          We also thought that perhaps now is the right time to go directly in and use messybrainz to generate the IDs instead of doing it ourselves. This is more work than we expected to do with this part of the project, but will give us a good base for accepting generic files from the submitter tool in addition to the specific dataset submission tool. The advantage of doing this now is that we won't have to rewrite or delete uuids which are in place when we do add messybrainz into the picture.

          To do this we'd need to do a few things:

          • Make sure messybrainz is correctly deployed and working - see if we clear the database for it
          • Make sure messybrainz has authentication for the submit API - this should only be called from ListenBrainz and AcousticBrainz
          • In the submission of non-mbid items make sure that we have at least and artist and title in the metadata
          • Send the metadata to messybrainz to get a msid instead of using str(uuid.uuid4()) - this is an http query, we can use requests.
          • Use the returned msid in the gid field

          Alastair Porter added a comment - We talked about this last week and came up with the following ideas: The `gid` column is a good name. For now we can't think of a reason why we want to filter by the type of identifier. Because of this, let's not try and over-optimise the database through normalisation. Instead let's keep the type column in low_level, but call it `gid_type` and make it an enum, not a boolean. We also thought that perhaps now is the right time to go directly in and use messybrainz to generate the IDs instead of doing it ourselves. This is more work than we expected to do with this part of the project, but will give us a good base for accepting generic files from the submitter tool in addition to the specific dataset submission tool. The advantage of doing this now is that we won't have to rewrite or delete uuids which are in place when we do add messybrainz into the picture. To do this we'd need to do a few things: Make sure messybrainz is correctly deployed and working - see if we clear the database for it Make sure messybrainz has authentication for the submit API - this should only be called from ListenBrainz and AcousticBrainz In the submission of non-mbid items make sure that we have at least and artist and title in the metadata Send the metadata to messybrainz to get a msid instead of using str(uuid.uuid4()) - this is an http query, we can use requests. Use the returned msid in the gid field

          Roman added a comment -

          I would say it's an ID type, not recording type.

          I agree with your point about keeping same data together. What I propose is to avoid any significant changes from the user perspective. That means assuming that all requests for data with a specific ID are just for those submissions that have MBID as their ID type. That way you'll just need to modify all affected SQL queries and keep working on dataset creation tools (extension for existing implementation). After that's all done we can think about how exactly do we want to display that data and use it further.

          Roman added a comment - I would say it's an ID type, not recording type. I agree with your point about keeping same data together. What I propose is to avoid any significant changes from the user perspective. That means assuming that all requests for data with a specific ID are just for those submissions that have MBID as their ID type. That way you'll just need to modify all affected SQL queries and keep working on dataset creation tools (extension for existing implementation). After that's all done we can think about how exactly do we want to display that data and use it further.

          Once we decided to accept submission without MBID we have to keep the distinction between these types of submissions clear even in the code, if we don't do so in the future we may have problems to understand things, so with this premise here are my thoughts:

          • I do not think that a separate table is a good idea, it keeps things more simple for sure, but we can have a lot of duplicates stored in two different tables. From a conceptual perspective a dataset item is a recording, even if it has no correspondent MBID it is basically the same object with the same properties, the only distinction I see is if it is related to a (MBID/MB recording) or not.
          • I think we should calculate the high-level data on those type of submission, that can be useful for the user who submitted the dataset, and can be also the reason for submitting a dataset in AB for example if it's a dataset for genre classification.
          • The data should be accessible using the API endpoint to get low-leve and high-level. I imagine a user who submitted a dataset and want to download it from the server, the access to those data is important in this case and even if we can design a special behavior for datasets download I think it's better to have a single endpoint that serves a specific type of information.
          • Once a user query the API using an UUID we can decide if we get the metadata from MusicBrainz or from the lowlevel_json table or hig_level_json according to the value of lowlevel.is_mbid field.
          • Using views for sure simplify the code, but it depends on how many queries are affected by the new field value. This requires further investigation and is related to the other decision we make
          • If we consider the dataset items as recordings, that is what I'm suggesting, we have to compute statistics and keep them in the general numbers. I think it's a good idea to add an indicator of 'Dataset Items' that count the number of submissions that does not have a corresponding MBID.

          Anyway I think that we have to divide those changes in different PR once we have a clear idea on how to proceed. If we start working on views/statistics/queries and so on it will end up in a new GSoC project, but I wish to hear from you!

          If we decide to rename the "mbid" column in "gid" we have to reflect this change (that is only conceptual) in all the tables that have a field with the same name, here is the list:

          dataset_class_member -> "dataset_class_member_pkey" PRIMARY KEY, btree (class, mbid)
          highlevel

          Daniele Scarano added a comment - Once we decided to accept submission without MBID we have to keep the distinction between these types of submissions clear even in the code, if we don't do so in the future we may have problems to understand things, so with this premise here are my thoughts: I do not think that a separate table is a good idea, it keeps things more simple for sure, but we can have a lot of duplicates stored in two different tables. From a conceptual perspective a dataset item is a recording, even if it has no correspondent MBID it is basically the same object with the same properties, the only distinction I see is if it is related to a (MBID/MB recording) or not. I think we should calculate the high-level data on those type of submission, that can be useful for the user who submitted the dataset, and can be also the reason for submitting a dataset in AB for example if it's a dataset for genre classification. The data should be accessible using the API endpoint to get low-leve and high-level. I imagine a user who submitted a dataset and want to download it from the server, the access to those data is important in this case and even if we can design a special behavior for datasets download I think it's better to have a single endpoint that serves a specific type of information. Once a user query the API using an UUID we can decide if we get the metadata from MusicBrainz or from the lowlevel_json table or hig_level_json according to the value of lowlevel.is_mbid field. Using views for sure simplify the code, but it depends on how many queries are affected by the new field value. This requires further investigation and is related to the other decision we make If we consider the dataset items as recordings, that is what I'm suggesting, we have to compute statistics and keep them in the general numbers. I think it's a good idea to add an indicator of 'Dataset Items' that count the number of submissions that does not have a corresponding MBID. Anyway I think that we have to divide those changes in different PR once we have a clear idea on how to proceed. If we start working on views/statistics/queries and so on it will end up in a new GSoC project, but I wish to hear from you! If we decide to rename the "mbid" column in "gid" we have to reflect this change (that is only conceptual) in all the tables that have a field with the same name, here is the list: dataset_class_member -> "dataset_class_member_pkey" PRIMARY KEY, btree (class, mbid) highlevel

          A index will be needed for this field. Every time we use the field in an SQL query, the index helps make this query fast.

          Alastair Porter added a comment - A index will be needed for this field. Every time we use the field in an SQL query, the index helps make this query fast.

          Daniele Scarano added a comment - - edited

          This is the related PR: https://github.com/metabrainz/acousticbrainz-server/pull/195

          The index is not in the PR, we don't need it at this stage, we will use this field to mark the dataset item submission.

          Daniele Scarano added a comment - - edited This is the related PR: https://github.com/metabrainz/acousticbrainz-server/pull/195 The index is not in the PR, we don't need it at this stage, we will use this field to mark the dataset item submission.

            hellska Daniele Scarano
            hellska Daniele Scarano
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved:

                Version Package