[IMG-123] Change "Add Cover Art" form to upload images so are available faster

Type: Improvement
Resolution: Unresolved
Priority: Normal
Component/s: None
Labels:
None

I found out today that the Internet Archive has a process that runs automatically when files are uploaded that delays the contents being available, sometimes for quite a long time. Files are called "derived", and the process is documented here:

https://archive.org/services/docs/api/ias3.html

The API notes that the delay of the uploaded contents being available can be avoided by adding a x-archive-queue-derive:0 header to skip the deriving process.

https://archive.org/services/docs/api/ias3.html#skip-derive-process

This is a suggestion to add this header to the upload so that the images are available faster, which will improve the user feedback and confirmation that image uploads were successful when adding new cover art.

ROpdebee added a comment - 2021-04-12 08:45

There might be some merit to this, but it would be in exceptional cases only.

Basically, each image that is uploaded spawns an individual "archive" task in IA's queue. Those normally get picked up quite quickly, and they "snowball", meaning that the task that is executed looks ahead in the queue to find newer archive tasks to process as well. That snowballing stops when 1) no other archive tasks exist, 2) a limit of snowballed tasks has been reached (I think it's 50), or 3) when the lookahead hits a non-archive task (such as derive). Only after snowballing is done, a derive task is queued, so, as long as snowballing works properly, derives will always be queued after the images have all been imported.

However, this will not be the case when IA is doing the archive tasks quicker than you can upload the images. In that case, it will hit snowballing stop condition 1, and will queue up a derive at the end of the queue. If you're still uploading images at that time, the derive will have to run before the new images can be imported. Nonetheless, since archive tasks run at a much higher priority than most of the derive tasks, the derive will likely be "rushed", meaning that it is executed but forcefully interrupted momentarily after so that later non-derive tasks can be executed quicker. The later archive tasks will then queue up the new derive when done.

You can probably avoid having an early derive blocking up the queue for a while by indeed not queuing derives for images when uploaded, and only have the upload for the index.json queue the derive since that one always happens at the very end, as far as I know. But, in my opinion, because of rushing and snowballing, such cases would be rare enough that making such a change may cause more harm than good. Moreover, since deriving is vital to create the thumbnails, until a derive is done, the images will still display as "This image is not available yet" anyway. Finally, derives for CAA items often take less than a minute, so the delay shouldn't be too large either.

ROpdebee added a comment - 2021-04-12 08:45 There might be some merit to this, but it would be in exceptional cases only. Basically, each image that is uploaded spawns an individual "archive" task in IA's queue. Those normally get picked up quite quickly, and they "snowball", meaning that the task that is executed looks ahead in the queue to find newer archive tasks to process as well. That snowballing stops when 1) no other archive tasks exist, 2) a limit of snowballed tasks has been reached (I think it's 50), or 3) when the lookahead hits a non-archive task (such as derive). Only after snowballing is done, a derive task is queued, so, as long as snowballing works properly, derives will always be queued after the images have all been imported. However, this will not be the case when IA is doing the archive tasks quicker than you can upload the images. In that case, it will hit snowballing stop condition 1, and will queue up a derive at the end of the queue. If you're still uploading images at that time, the derive will have to run before the new images can be imported. Nonetheless, since archive tasks run at a much higher priority than most of the derive tasks, the derive will likely be "rushed", meaning that it is executed but forcefully interrupted momentarily after so that later non-derive tasks can be executed quicker. The later archive tasks will then queue up the new derive when done. You can probably avoid having an early derive blocking up the queue for a while by indeed not queuing derives for images when uploaded, and only have the upload for the index.json queue the derive since that one always happens at the very end, as far as I know. But, in my opinion, because of rushing and snowballing, such cases would be rare enough that making such a change may cause more harm than good. Moreover, since deriving is vital to create the thumbnails, until a derive is done, the images will still display as "This image is not available yet" anyway. Finally, derives for CAA items often take less than a minute, so the delay shouldn't be too large either.

Robo Tardis added a comment - 2020-09-02 23:15

It seemed to me that adding this header would only keep the "derive" process from being done before the original contents were made available/visible. I guess I had assumed that the derive to produce the thumbnails would still be done, but not until some later postprocessing step.

Robo Tardis added a comment - 2020-09-02 23:15 It seemed to me that adding this header would only keep the "derive" process from being done before the original contents were made available/visible. I guess I had assumed that the derive to produce the thumbnails would still be done, but not until some later postprocessing step.

Michael Wiencek added a comment - 2020-09-02 23:07

Unless I'm misunderstanding, that sounds like it skips the derive process entirely. The derive process is essential to the CAA because it's what generates the 250px, 500px, and 1200px thumbnails made available via our API.

Michael Wiencek added a comment - 2020-09-02 23:07 Unless I'm misunderstanding, that sounds like it skips the derive process entirely. The derive process is essential to the CAA because it's what generates the 250px, 500px, and 1200px thumbnails made available via our API.

Details

Description

Attachments

Activity

Collapse comment: ROpdebee added a comment - 2021-04-12 08:45

Expand comment: ROpdebee added a comment - 2021-04-12 08:45

Collapse comment: Robo Tardis added a comment - 2020-09-02 23:15

Expand comment: Robo Tardis added a comment - 2020-09-02 23:15

Collapse comment: Michael Wiencek added a comment - 2020-09-02 23:07

Expand comment: Michael Wiencek added a comment - 2020-09-02 23:07

People

Dates

Packages