Uploaded image for project: 'Image Archives'
  1. Image Archives
  2. IMG-129

CAA state on archive.org out-of-sync with site

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • None

      While working on the re-derivation of thumbnails (CAA-88) and voting (e.g., edit #78591137) I started noticing some problems with the state of index.json files on IA. I decided to run a large-scale "audit" of CAA on IA, and I'll dump the results in this ticket. Spoiler alert: They weren't great.

      I've written a tool to automate this process and scan all items in bulk. I ran this tool on more than 1.4 million items, including all releases that were in the 2021-04-17 data dump, in addition to all items in CAA as of 2021-04-18 that weren't linked to releases in the DB anymore. Of these 1.4M items, 63% or ~919k had at least one issue, some ranging from minor inconsistencies to very severe problems. In total, nearly 140M individual checks were performed, of which more than 4M (3%) failed.

      The results

      The table below only includes checks that had at least one failure. In reality, more checks are implemented, but those all passed.

        #checks #checked rels #failed %failed #failed rels %failed rels
      DeletedItem::derivatives are absent 4050 4050 3864 95.41% 3864 95.41%
      DeletedItem::images are absent 4050 4050 3981 98.30% 3981 98.30%
      DeletedItem::index is absent 4050 4050 4050 100.00% 4050 100.00%
      DeletedItem::mb_metadata is absent 4050 4050 4049 99.98% 4049 99.98%
      DeletedItem::release url is absent 4050 4050 4037 99.68% 4037 99.68%
      EmptyItem::CAAIndex::Image::id is int 52 40 41 78.85% 32 80.00%
      EmptyItem::CAAIndex::Image::order 38734 38734 40 0.10% 40 0.10%
      EmptyItem::CAAIndex::Image::unexpected image 52 40 52 100.00% 40 100.00%
      EmptyItem::CAAIndex::is present 43966 43966 5232 11.90% 5232 11.90%
      EmptyItem::CAAIndex::release url correct 38734 38734 32102 82.88% 32102 82.88%
      EmptyItem::Files::index.json exists 43966 43966 5232 11.90% 5232 11.90%
      EmptyItem::Files::mb_metadata.xml exists 43966 43966 5179 11.78% 5179 11.78%
      EmptyItem::Metadata::creators correct 43966 43966 7764 17.66% 7764 17.66%
      EmptyItem::Metadata::date correct 43966 43966 10609 24.13% 10609 24.13%
      EmptyItem::Metadata::item is noindex 43966 43966 1 0.00% 1 0.00%
      EmptyItem::Metadata::language correct 43966 43966 4418 10.05% 4418 10.05%
      EmptyItem::Metadata::mediatype is image 43966 43966 2 0.00% 2 0.00%
      EmptyItem::Metadata::missing external id::asin 14054 13919 1871 13.31% 1846 13.26%
      EmptyItem::Metadata::missing external id::mb_artist_id 49853 43966 9184 18.42% 7034 16.00%
      EmptyItem::Metadata::missing external id::mb_release_id 43966 43966 5196 11.82% 5196 11.82%
      EmptyItem::Metadata::missing external id::upc 17526 17526 2597 14.82% 2597 14.82%
      EmptyItem::Metadata::title correct 43966 43966 5850 13.31% 5850 13.31%
      EmptyItem::Metadata::unexpected external id::asin 12719 12719 536 4.21% 536 4.21%
      EmptyItem::Metadata::unexpected external id::mb-artist-id 1986 1914 1986 100.00% 1914 100.00%
      EmptyItem::Metadata::unexpected external id::mb-release-id 1913 1913 1913 100.00% 1913 100.00%
      EmptyItem::Metadata::unexpected external id::mb_artist_id 41733 38770 1064 2.55% 946 2.44%
      EmptyItem::Metadata::unexpected external id::upc 15193 15193 264 1.74% 264 1.74%
      Item::CAAIndex::Image::approved correct 2858928 1373065 10906 0.38% 5689 0.41%
      Item::CAAIndex::Image::back correct 2858928 1373065 294 0.01% 263 0.02%
      Item::CAAIndex::Image::comment correct 2858928 1373065 8 0.00% 7 0.00%
      Item::CAAIndex::Image::front correct 2858928 1373065 3255 0.11% 3184 0.23%
      Item::CAAIndex::Image::id correct 2858928 1373065 927819 32.45% 480041 34.96%
      Item::CAAIndex::Image::id is int 2859226 1373076 927967 32.46% 480048 34.96%
      Item::CAAIndex::Image::missing image 2859110 1373082 182 0.01% 79 0.01%
      Item::CAAIndex::Image::order 1373082 1373082 309 0.02% 309 0.02%
      Item::CAAIndex::Image::thumbnails correct 2858928 1373065 1284910 44.94% 646190 47.06%
      Item::CAAIndex::Image::types correct 2858928 1373065 6748 0.24% 5204 0.38%
      Item::CAAIndex::Image::unexpected image 2859226 1373076 298 0.01% 235 0.02%
      Item::CAAIndex::is present 1373853 1373853 757 0.06% 757 0.06%
      Item::CAAIndex::is well-formed 1373096 1373096 14 0.00% 14 0.00%
      Item::CAAIndex::release url correct 1373082 1373082 367718 26.78% 367718 26.78%
      Item::Files::1200px thumbnail exists 2860728 1373853 642 0.02% 366 0.03%
      Item::Files::250px thumbnail exists 2860728 1373853 436 0.02% 216 0.02%
      Item::Files::500px thumbnail exists 2860728 1373853 436 0.02% 216 0.02%
      Item::Files::image id is unique 2860728 1373853 76 0.00% 33 0.00%
      Item::Files::index.json exists 1373853 1373853 755 0.05% 755 0.05%
      Item::Files::mb_metadata.xml exists 1373853 1373853 757 0.06% 757 0.06%
      Item::Files::original image exists 2860728 1373853 76 0.00% 33 0.00%
      Item::Metadata::creators correct 1373853 1373853 171845 12.51% 171845 12.51%
      Item::Metadata::date correct 1373853 1373853 122963 8.95% 122963 8.95%
      Item::Metadata::in caa collection 1373853 1373853 1 0.00% 1 0.00%
      Item::Metadata::item is noindex 1373853 1373853 334 0.02% 334 0.02%
      Item::Metadata::language correct 1373853 1373853 121355 8.83% 121355 8.83%
      Item::Metadata::mediatype is image 1373853 1373853 362 0.03% 362 0.03%
      Item::Metadata::missing external id::asin 333319 328021 25674 7.70% 24467 7.46%
      Item::Metadata::missing external id::mb_artist_id 1626853 1373853 30604 1.88% 21269 1.55%
      Item::Metadata::missing external id::mb_release_id 1373853 1373853 1537 0.11% 1537 0.11%
      Item::Metadata::missing external id::upc 599968 599968 25017 4.17% 25017 4.17%
      Item::Metadata::title correct 1373853 1373853 47876 3.48% 47876 3.48%
      Item::Metadata::unexpected external id::asin 313115 313102 5470 1.75% 5470 1.75%
      Item::Metadata::unexpected external id::mb-artist-id 455 307 455 100.00% 307 100.00%
      Item::Metadata::unexpected external id::mb-release-id 238 238 238 100.00% 238 100.00%
      Item::Metadata::unexpected external id::mb_artist_id 1613265 1372315 17016 1.05% 13931 1.02%
      Item::Metadata::unexpected external id::upc 577462 577457 2511 0.43% 2511 0.43%
      Item::exists 1386166 1386166 13 0.00% 13 0.00%
      MergedItem::derivatives are absent 12238 12238 7321 59.82% 7321 59.82%
      MergedItem::images are absent 12238 12238 1463 11.95% 1463 11.95%
      MergedItem::index is absent 12238 12238 12238 100.00% 12238 100.00%
      MergedItem::mb_metadata is absent 12238 12238 12235 99.98% 12235 99.98%
      SKIPPED ITEMS            
      DeletedItem::darkened     10      
      DeletedItem::ia modified     1017      
      DeletedItem::test item     984      
      EmptyItem::darkened     27      
      EmptyItem::ia modified     456      
      Item::darkened     151      
      Item::ia modified     12149      
      MergedItem::darkened     7      
      MergedItem::ia modified     31      
      MergedItem::test item     979      
      TOTAL 139513434 1449931 4262005 3.05% 918741 63.36%

      Why these checks fail, and how to fix them

      This may seem like a dire situation, but in reality many of these issues can be fixed by reindexing the release. I could go ahead and reorder images for each of these releases, or add a dummy comment to their covers, but that would be a suboptimal solution.

      There’s a few checks that are not as interesting, and are just included for reference. Ideally, they would be fixed though.

      • The DeletedItem::* checks relate to items that (presumably) once were releases on MB, that have now been removed. They haven’t been fully purged (CAA-126).
      • The MergedItem::* checks are similar. Some of them still have some images left over, others still have thumbnails and an index.json file (CAA-128).
      • The EmptyItem::* checks relate to releases that still exist, but have no covers (may have been removed previously). There’s some items where the index.json still lists an image although none are on MB.

      The more interesting checks are the Item::* checks, which check items for releases that still exist and still have cover art.

      • *::Metadata::* checks verify the metadata attached to the item, which IA extracts from the mb_metadata.xml file (specifically: release name, artists, release date, language, barcode, ASINs). It seems that not all of these properties have the correct triggers set up, and over time, these have been edited and IA and MB metadata has drifted. Fixing these would be nice, but this information isn’t used in MB as far as I know. Reindexing should fix these. Some of the triggers in the DB seem to be written specifically to update this information, so if that's indeed desired, let me know and I'll open a ticket for the missing ones (and maybe brush up on my SQL and submit a PR).
      • *::Metadata::item is noindex checks whether the noindex flag is set on the item, which was requested by IA. I’ll submit those handful of items to IA, since it requires an IA admin to change that. This may have happened in between the time IA set this flag in bulk and the time the change was made to MBS. Diagnosed as CAA-130.
      • *::Metadata::mediatype is image is similar to the previous one, also requiring an IA admin to fix. Will be submitted too, also mainly caused by CAA-130.
      • *::Metadata::in caa collection checks whether the item is in the CAA collection. I still have to check why the one failure is not in there, and will request a fix for that one too if it’s necessary.
      • *::Files::* checks verify that the expected files are present on IA. In the case of index.json and mb_metadata.xml, reindexing would create those files. For thumbnails, a re-derivation would need to happen (this likely includes many of the failed re-derivations in CAA-88, and they likely failed for reasons uncovered by other checks, so the other issues would need to be fixed first). For original images, a failing check means that an image is on MB, but does not exist in IA and can therefore not be loaded. That’s bad…
      • *::CAAIndex::* checks verify the index.json file. The high number of failures for id is correct is because the old index schema used strings, now they’re integers (also reported in the comments of CAA-88, I believe). Similarly for release url correct, these used to be http://, now they’re https://, but that change was never retroactively applied. The other issues are likely just one-off cases where a change wasn’t propagated to IA. These would all be fixed via reindexing (even the missing index.json ones).
      • Item::exists simply checks whether the item exists at all. Apparently, there’s some items which don’t. Again, that’s bad…

      In terms of item skips, here’s what each skip reason means:

      • darkened means that the item was darkened (temporarily taken down) by IA, so no info can be retrieved
      • ia modified means that the item on IA was last modified after the TIMESTAMP in the DB dump, as to not compare to outdated data.
      • DeletedItem::test item means that there’s no index.json in either the root or history/files/ of the item, so it was likely an item uploaded from the test instance. Ideally, these should be removed when the test instance is reset.

      Actionable results

      I’ve taken the liberty of categorising each of the check failures into groups based on how to resolve them. The result is here. Each file is structured as one release per line, with release ID, MB URL, IA item URL, and fail reasons, separated by tabs.

      • reindex_high_priority are IDs that need to be reindexed fairly urgently, which includes those where the index.json is missing, malformed, is missing entries for one or more images, or has one or more images that aren’t on MB. I categorised these as urgent as Picard relies on this information. These could be fed into the CAA-Indexer queue and that should fix it automatically.
      • reindex is a superset of the above, containing also those releases where the index.json has wrong information (such as types, comments, approved status, and order). Again, queueing a reindex should fix these. It also includes those releases where the index.json schema is outdated (thumbnails missing 250/500/1200 keys, id as string).
      • reindex_w_metadata is a superset of reindex, including those releases where the index itself isn’t wrong, but the mb_metadata.xml-extracted metadata is. Again, can be fixed by a reindex.
      • ia_set_mediatype and ia_set_noindex are files I’ll submit to IA for an auto_submit task. Submitted and fixed.
      • deleted_properly_delete, merged_properly_delete, and properly_emptied relate to the deleted, merged, and empty item checks. Those would require more work to bulk-fix.
      • manual_check are items that I’ll manually check (the one not in the CAA collection, for example), as well as those that need a manual fix (missing images and missing items). I’ll check those later today, hopefully I can find out where those images have gone, in the worst case I’ll have to remove the images from MB since they’re currently unusable anyway.

      There’s also the darkened_items file, which doesn’t follow the same structure (it’s just one ID per line, not all IDs are guaranteed to exist). Those are all darkened items. That should be the same file as the one described in the comments of MBS-6567, it’s the same number of IDs too. That file could be used to set the darkened status in the DB (but I’ll propose a better solution further down).

      Data

      I’ve made available all the auditing data. Here’s an overview of the files:

      • actionable_results.tar.xz is the same file as I’ve attached to this ticket.
      • all_results.xz expands to a file which lists all of the 140M checks and their status (warning: >10GB uncompressed!)
      • audit_data.tar.xz is an archive of the per-item results produced by the auditing tool. Each item is in its own subdirectory, spread out across 3 levels. In each subdir, failures.log contains a semi-structured list of failed checks, audit_log is (fairly verbose) logging output of the auditor for this item, ia_metadata.json is the response from IA’s metadata endpoint for this item, and index.json is the item’s index, if it exists. Warning: Decompression bomb, expands to 5M+ files and 40GB+ of data.
      • audit_results.tar.xz is an archive of aggregated results for all items, which were later postprocessed into the actionable results.
      • input_data.tar.xz is the input data provided to the auditor.

      Since the audit_data.tar.xz archive contains the IA metadata, which lists all files and their sizes, these could be used to backfill image and thumbnail sizes in the DB. Those columns already exist, but seem to be empty.

      The tool

      I’ve already plugged the tool previously, but I’ll go a bit more in depth again. It’s available in this repo and there’s some documentation available on how to run it and how it works (TL;DR: A bunch of async code and a queue to handle massive concurrency). Feel free to reuse this tool to re-run a similar analysis in the future. At the moment, it’s built to scan all of these items reasonably quickly with a high number of concurrent requests. For reference, this analysis took about 10h at 1000 concurrent tasks, the main bottleneck was single-core speed.

      I think an interesting idea would be to deploy a similar tool to periodically audit the CAA items and automatically fix (i.e., queue a reindex) whatever failure can be fixed, and log the others somewhere for manual intervention. However, it’s probably not a good idea to hammer IA servers for 10h straight every month, so instead such a tool would need to check an item continually, say every second or so. If a new check is started every second, the whole CAA collection could be audited every month without putting heavy load on IA. Further optimisations could be made to not check items which were correct previously and which are known not to have changed since the previous check. This auditor could perhaps also fill the darkened status of MB releases, although there's better ways to handle that.

      I think the existing tool could be used as a basis for such a scanner. Some changes would need to be made, I documented some of them in one of the docs in the repo. If such a thing would be desirable, I’d be happy to help out.

            Unassigned Unassigned
            ROpdebee ROpdebee
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:

                Version Package