-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
None
While working on the re-derivation of thumbnails (CAA-88) and voting (e.g., edit #78591137) I started noticing some problems with the state of index.json files on IA. I decided to run a large-scale "audit" of CAA on IA, and I'll dump the results in this ticket. Spoiler alert: They weren't great.
I've written a tool to automate this process and scan all items in bulk. I ran this tool on more than 1.4 million items, including all releases that were in the 2021-04-17 data dump, in addition to all items in CAA as of 2021-04-18 that weren't linked to releases in the DB anymore. Of these 1.4M items, 63% or ~919k had at least one issue, some ranging from minor inconsistencies to very severe problems. In total, nearly 140M individual checks were performed, of which more than 4M (3%) failed.
The results
The table below only includes checks that had at least one failure. In reality, more checks are implemented, but those all passed.
#checks | #checked rels | #failed | %failed | #failed rels | %failed rels | |
---|---|---|---|---|---|---|
DeletedItem::derivatives are absent | 4050 | 4050 | 3864 | 95.41% | 3864 | 95.41% |
DeletedItem::images are absent | 4050 | 4050 | 3981 | 98.30% | 3981 | 98.30% |
DeletedItem::index is absent | 4050 | 4050 | 4050 | 100.00% | 4050 | 100.00% |
DeletedItem::mb_metadata is absent | 4050 | 4050 | 4049 | 99.98% | 4049 | 99.98% |
DeletedItem::release url is absent | 4050 | 4050 | 4037 | 99.68% | 4037 | 99.68% |
EmptyItem::CAAIndex::Image::id is int | 52 | 40 | 41 | 78.85% | 32 | 80.00% |
EmptyItem::CAAIndex::Image::order | 38734 | 38734 | 40 | 0.10% | 40 | 0.10% |
EmptyItem::CAAIndex::Image::unexpected image | 52 | 40 | 52 | 100.00% | 40 | 100.00% |
EmptyItem::CAAIndex::is present | 43966 | 43966 | 5232 | 11.90% | 5232 | 11.90% |
EmptyItem::CAAIndex::release url correct | 38734 | 38734 | 32102 | 82.88% | 32102 | 82.88% |
EmptyItem::Files::index.json exists | 43966 | 43966 | 5232 | 11.90% | 5232 | 11.90% |
EmptyItem::Files::mb_metadata.xml exists | 43966 | 43966 | 5179 | 11.78% | 5179 | 11.78% |
EmptyItem::Metadata::creators correct | 43966 | 43966 | 7764 | 17.66% | 7764 | 17.66% |
EmptyItem::Metadata::date correct | 43966 | 43966 | 10609 | 24.13% | 10609 | 24.13% |
EmptyItem::Metadata::item is noindex | 43966 | 43966 | 1 | 0.00% | 1 | 0.00% |
EmptyItem::Metadata::language correct | 43966 | 43966 | 4418 | 10.05% | 4418 | 10.05% |
EmptyItem::Metadata::mediatype is image | 43966 | 43966 | 2 | 0.00% | 2 | 0.00% |
EmptyItem::Metadata::missing external id::asin | 14054 | 13919 | 1871 | 13.31% | 1846 | 13.26% |
EmptyItem::Metadata::missing external id::mb_artist_id | 49853 | 43966 | 9184 | 18.42% | 7034 | 16.00% |
EmptyItem::Metadata::missing external id::mb_release_id | 43966 | 43966 | 5196 | 11.82% | 5196 | 11.82% |
EmptyItem::Metadata::missing external id::upc | 17526 | 17526 | 2597 | 14.82% | 2597 | 14.82% |
EmptyItem::Metadata::title correct | 43966 | 43966 | 5850 | 13.31% | 5850 | 13.31% |
EmptyItem::Metadata::unexpected external id::asin | 12719 | 12719 | 536 | 4.21% | 536 | 4.21% |
EmptyItem::Metadata::unexpected external id::mb-artist-id | 1986 | 1914 | 1986 | 100.00% | 1914 | 100.00% |
EmptyItem::Metadata::unexpected external id::mb-release-id | 1913 | 1913 | 1913 | 100.00% | 1913 | 100.00% |
EmptyItem::Metadata::unexpected external id::mb_artist_id | 41733 | 38770 | 1064 | 2.55% | 946 | 2.44% |
EmptyItem::Metadata::unexpected external id::upc | 15193 | 15193 | 264 | 1.74% | 264 | 1.74% |
Item::CAAIndex::Image::approved correct | 2858928 | 1373065 | 10906 | 0.38% | 5689 | 0.41% |
Item::CAAIndex::Image::back correct | 2858928 | 1373065 | 294 | 0.01% | 263 | 0.02% |
Item::CAAIndex::Image::comment correct | 2858928 | 1373065 | 8 | 0.00% | 7 | 0.00% |
Item::CAAIndex::Image::front correct | 2858928 | 1373065 | 3255 | 0.11% | 3184 | 0.23% |
Item::CAAIndex::Image::id correct | 2858928 | 1373065 | 927819 | 32.45% | 480041 | 34.96% |
Item::CAAIndex::Image::id is int | 2859226 | 1373076 | 927967 | 32.46% | 480048 | 34.96% |
Item::CAAIndex::Image::missing image | 2859110 | 1373082 | 182 | 0.01% | 79 | 0.01% |
Item::CAAIndex::Image::order | 1373082 | 1373082 | 309 | 0.02% | 309 | 0.02% |
Item::CAAIndex::Image::thumbnails correct | 2858928 | 1373065 | 1284910 | 44.94% | 646190 | 47.06% |
Item::CAAIndex::Image::types correct | 2858928 | 1373065 | 6748 | 0.24% | 5204 | 0.38% |
Item::CAAIndex::Image::unexpected image | 2859226 | 1373076 | 298 | 0.01% | 235 | 0.02% |
Item::CAAIndex::is present | 1373853 | 1373853 | 757 | 0.06% | 757 | 0.06% |
Item::CAAIndex::is well-formed | 1373096 | 1373096 | 14 | 0.00% | 14 | 0.00% |
Item::CAAIndex::release url correct | 1373082 | 1373082 | 367718 | 26.78% | 367718 | 26.78% |
Item::Files::1200px thumbnail exists | 2860728 | 1373853 | 642 | 0.02% | 366 | 0.03% |
Item::Files::250px thumbnail exists | 2860728 | 1373853 | 436 | 0.02% | 216 | 0.02% |
Item::Files::500px thumbnail exists | 2860728 | 1373853 | 436 | 0.02% | 216 | 0.02% |
Item::Files::image id is unique | 2860728 | 1373853 | 76 | 0.00% | 33 | 0.00% |
Item::Files::index.json exists | 1373853 | 1373853 | 755 | 0.05% | 755 | 0.05% |
Item::Files::mb_metadata.xml exists | 1373853 | 1373853 | 757 | 0.06% | 757 | 0.06% |
Item::Files::original image exists | 2860728 | 1373853 | 76 | 0.00% | 33 | 0.00% |
Item::Metadata::creators correct | 1373853 | 1373853 | 171845 | 12.51% | 171845 | 12.51% |
Item::Metadata::date correct | 1373853 | 1373853 | 122963 | 8.95% | 122963 | 8.95% |
Item::Metadata::in caa collection | 1373853 | 1373853 | 1 | 0.00% | 1 | 0.00% |
Item::Metadata::item is noindex | 1373853 | 1373853 | 334 | 0.02% | 334 | 0.02% |
Item::Metadata::language correct | 1373853 | 1373853 | 121355 | 8.83% | 121355 | 8.83% |
Item::Metadata::mediatype is image | 1373853 | 1373853 | 362 | 0.03% | 362 | 0.03% |
Item::Metadata::missing external id::asin | 333319 | 328021 | 25674 | 7.70% | 24467 | 7.46% |
Item::Metadata::missing external id::mb_artist_id | 1626853 | 1373853 | 30604 | 1.88% | 21269 | 1.55% |
Item::Metadata::missing external id::mb_release_id | 1373853 | 1373853 | 1537 | 0.11% | 1537 | 0.11% |
Item::Metadata::missing external id::upc | 599968 | 599968 | 25017 | 4.17% | 25017 | 4.17% |
Item::Metadata::title correct | 1373853 | 1373853 | 47876 | 3.48% | 47876 | 3.48% |
Item::Metadata::unexpected external id::asin | 313115 | 313102 | 5470 | 1.75% | 5470 | 1.75% |
Item::Metadata::unexpected external id::mb-artist-id | 455 | 307 | 455 | 100.00% | 307 | 100.00% |
Item::Metadata::unexpected external id::mb-release-id | 238 | 238 | 238 | 100.00% | 238 | 100.00% |
Item::Metadata::unexpected external id::mb_artist_id | 1613265 | 1372315 | 17016 | 1.05% | 13931 | 1.02% |
Item::Metadata::unexpected external id::upc | 577462 | 577457 | 2511 | 0.43% | 2511 | 0.43% |
Item::exists | 1386166 | 1386166 | 13 | 0.00% | 13 | 0.00% |
MergedItem::derivatives are absent | 12238 | 12238 | 7321 | 59.82% | 7321 | 59.82% |
MergedItem::images are absent | 12238 | 12238 | 1463 | 11.95% | 1463 | 11.95% |
MergedItem::index is absent | 12238 | 12238 | 12238 | 100.00% | 12238 | 100.00% |
MergedItem::mb_metadata is absent | 12238 | 12238 | 12235 | 99.98% | 12235 | 99.98% |
SKIPPED ITEMS | ||||||
DeletedItem::darkened | 10 | |||||
DeletedItem::ia modified | 1017 | |||||
DeletedItem::test item | 984 | |||||
EmptyItem::darkened | 27 | |||||
EmptyItem::ia modified | 456 | |||||
Item::darkened | 151 | |||||
Item::ia modified | 12149 | |||||
MergedItem::darkened | 7 | |||||
MergedItem::ia modified | 31 | |||||
MergedItem::test item | 979 | |||||
TOTAL | 139513434 | 1449931 | 4262005 | 3.05% | 918741 | 63.36% |
Why these checks fail, and how to fix them
This may seem like a dire situation, but in reality many of these issues can be fixed by reindexing the release. I could go ahead and reorder images for each of these releases, or add a dummy comment to their covers, but that would be a suboptimal solution.
There’s a few checks that are not as interesting, and are just included for reference. Ideally, they would be fixed though.
- The DeletedItem::* checks relate to items that (presumably) once were releases on MB, that have now been removed. They haven’t been fully purged (CAA-126).
- The MergedItem::* checks are similar. Some of them still have some images left over, others still have thumbnails and an index.json file (
CAA-128). - The EmptyItem::* checks relate to releases that still exist, but have no covers (may have been removed previously). There’s some items where the index.json still lists an image although none are on MB.
The more interesting checks are the Item::* checks, which check items for releases that still exist and still have cover art.
- *::Metadata::* checks verify the metadata attached to the item, which IA extracts from the mb_metadata.xml file (specifically: release name, artists, release date, language, barcode, ASINs). It seems that not all of these properties have the correct triggers set up, and over time, these have been edited and IA and MB metadata has drifted. Fixing these would be nice, but this information isn’t used in MB as far as I know. Reindexing should fix these. Some of the triggers in the DB seem to be written specifically to update this information, so if that's indeed desired, let me know and I'll open a ticket for the missing ones (and maybe brush up on my SQL and submit a PR).
- *::Metadata::item is noindex checks whether the noindex flag is set on the item, which was requested by IA. I’ll submit those handful of items to IA, since it requires an IA admin to change that.
This may have happened in between the time IA set this flag in bulk and the time the change was made to MBS.Diagnosed asCAA-130. - *::Metadata::mediatype is image is similar to the previous one, also requiring an IA admin to fix. Will be submitted too, also mainly caused by
CAA-130. - *::Metadata::in caa collection checks whether the item is in the CAA collection. I still have to check why the one failure is not in there, and will request a fix for that one too if it’s necessary.
- *::Files::* checks verify that the expected files are present on IA. In the case of index.json and mb_metadata.xml, reindexing would create those files. For thumbnails, a re-derivation would need to happen (this likely includes many of the failed re-derivations in
CAA-88, and they likely failed for reasons uncovered by other checks, so the other issues would need to be fixed first). For original images, a failing check means that an image is on MB, but does not exist in IA and can therefore not be loaded. That’s bad… - *::CAAIndex::* checks verify the index.json file. The high number of failures for id is correct is because the old index schema used strings, now they’re integers (also reported in the comments of
CAA-88, I believe). Similarly for release url correct, these used to be http://, now they’re https://, but that change was never retroactively applied. The other issues are likely just one-off cases where a change wasn’t propagated to IA. These would all be fixed via reindexing (even the missing index.json ones). - Item::exists simply checks whether the item exists at all. Apparently, there’s some items which don’t. Again, that’s bad…
In terms of item skips, here’s what each skip reason means:
- darkened means that the item was darkened (temporarily taken down) by IA, so no info can be retrieved
- ia modified means that the item on IA was last modified after the TIMESTAMP in the DB dump, as to not compare to outdated data.
- DeletedItem::test item means that there’s no index.json in either the root or history/files/ of the item, so it was likely an item uploaded from the test instance. Ideally, these should be removed when the test instance is reset.
Actionable results
I’ve taken the liberty of categorising each of the check failures into groups based on how to resolve them. The result is here. Each file is structured as one release per line, with release ID, MB URL, IA item URL, and fail reasons, separated by tabs.
- reindex_high_priority are IDs that need to be reindexed fairly urgently, which includes those where the index.json is missing, malformed, is missing entries for one or more images, or has one or more images that aren’t on MB. I categorised these as urgent as Picard relies on this information. These could be fed into the CAA-Indexer queue and that should fix it automatically.
- reindex is a superset of the above, containing also those releases where the index.json has wrong information (such as types, comments, approved status, and order). Again, queueing a reindex should fix these. It also includes those releases where the index.json schema is outdated (thumbnails missing 250/500/1200 keys, id as string).
- reindex_w_metadata is a superset of reindex, including those releases where the index itself isn’t wrong, but the mb_metadata.xml-extracted metadata is. Again, can be fixed by a reindex.
ia_set_mediatype and ia_set_noindex are files I’ll submit to IA for an auto_submit task.Submitted and fixed.- deleted_properly_delete, merged_properly_delete, and properly_emptied relate to the deleted, merged, and empty item checks. Those would require more work to bulk-fix.
- manual_check are items that I’ll manually check (the one not in the CAA collection, for example), as well as those that need a manual fix (missing images and missing items). I’ll check those later today, hopefully I can find out where those images have gone, in the worst case I’ll have to remove the images from MB since they’re currently unusable anyway.
There’s also the darkened_items file, which doesn’t follow the same structure (it’s just one ID per line, not all IDs are guaranteed to exist). Those are all darkened items. That should be the same file as the one described in the comments of MBS-6567, it’s the same number of IDs too. That file could be used to set the darkened status in the DB (but I’ll propose a better solution further down).
Data
I’ve made available all the auditing data. Here’s an overview of the files:
- actionable_results.tar.xz is the same file as I’ve attached to this ticket.
- all_results.xz expands to a file which lists all of the 140M checks and their status (warning: >10GB uncompressed!)
- audit_data.tar.xz is an archive of the per-item results produced by the auditing tool. Each item is in its own subdirectory, spread out across 3 levels. In each subdir, failures.log contains a semi-structured list of failed checks, audit_log is (fairly verbose) logging output of the auditor for this item, ia_metadata.json is the response from IA’s metadata endpoint for this item, and index.json is the item’s index, if it exists. Warning: Decompression bomb, expands to 5M+ files and 40GB+ of data.
- audit_results.tar.xz is an archive of aggregated results for all items, which were later postprocessed into the actionable results.
- input_data.tar.xz is the input data provided to the auditor.
Since the audit_data.tar.xz archive contains the IA metadata, which lists all files and their sizes, these could be used to backfill image and thumbnail sizes in the DB. Those columns already exist, but seem to be empty.
The tool
I’ve already plugged the tool previously, but I’ll go a bit more in depth again. It’s available in this repo and there’s some documentation available on how to run it and how it works (TL;DR: A bunch of async code and a queue to handle massive concurrency). Feel free to reuse this tool to re-run a similar analysis in the future. At the moment, it’s built to scan all of these items reasonably quickly with a high number of concurrent requests. For reference, this analysis took about 10h at 1000 concurrent tasks, the main bottleneck was single-core speed.
I think an interesting idea would be to deploy a similar tool to periodically audit the CAA items and automatically fix (i.e., queue a reindex) whatever failure can be fixed, and log the others somewhere for manual intervention. However, it’s probably not a good idea to hammer IA servers for 10h straight every month, so instead such a tool would need to check an item continually, say every second or so. If a new check is started every second, the whole CAA collection could be audited every month without putting heavy load on IA. Further optimisations could be made to not check items which were correct previously and which are known not to have changed since the previous check. This auditor could perhaps also fill the darkened status of MB releases, although there's better ways to handle that.
I think the existing tool could be used as a basis for such a scanner. Some changes would need to be made, I documented some of them in one of the docs in the repo. If such a thing would be desirable, I’d be happy to help out.
- is a dependency of
-
IMG-155 1200 resolution covers won't download - not being returned as options
- Closed