-
Improvement
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
None
This is something that regularly manages to annoy me for one reason or another, so I'm making a ticket to collect ideas on and to hopefully find out if it's just me who wishes it would be changed or if there are other people who'd like it.
Right now the edit dump is 2.8 gb, which is huge - the rest of the dumps put together are only just over 2 gb. We don't seem to have a problem splitting other things out even though they're tiny (e.g. wikidocs, 7.1 kb or documentation, 18 kb), but we still continue to provide edit history data as one monolithic file.
Not everyone has tons of space available (I just tried to extract the edit dump on my server, only to be met with "Cannot write: No space left on device" after extracting 15 gb of the edit table). Not everyone has a super fast internet connection (as dufferzafar coincidentally mentioned on IRC while I was writing this)
Some use cases where smaller dumps would help:
I wanted to extract URLs from edit notes and get the IA to archive them. I could download the huge dump as a one off, but if I want to do it regularly, I would have to download the edit dump regularly too, even though the data I'm interested in is only a fraction of it. Replicating the edit history too (MBS-4510) would kinda help there, except for the fact I'm not actually interested in most of the data and don't really have enough space for it all.
I wanted to search edit notes for references to medium formats we'd just added to the list so I could fix them. Being able to search in edit notes (SEARCH-148) would work for that specific case, but it would still be nice to be able to load that data locally too.
I wanted to create some statistics about votes over time. The votes, however, are again only available in the edit dump.
I wanted to create some statistics about votes over the past year. Similar to the previous one.
I wanted to create some statistics about edits over the past year. Finally, something that requires something from the edit table... but if I want to know about edits in 2014, I don't need the 10+ years prior to that.
I wanted to look at the distribution of open edits. This is very similar to the previous one.
I wanted to test some changes which affect the display of edits. I don't need the entire edit history, but I also don't want to create a load of edits myself. Being able to download a section of the edit history would most likely provide enough data for what I need.
I wanted to look at the way the data is stored in the edit. I can look at individual edits via the raw data link, but that's not very good for looking at a larger sample.
These days I can use rika for quite a few of those, provided I don't mind how old the data is (doesn't really work for open edits, for example), but that doesn't really work as a general solution.
Some possible ideas:
Separate edit notes and votes from the main edit table. I'm not sure how big they are, but I imagine they're a much more manageable size. This seems like it would be an easy option, since they're separate tables anyway, they'd just be grouped differently.
Separate edits by year. We rarely ever change historical edits, so this would potentially mean we could stop dumping the entire edit table twice a week and only dump the more recent edits. People who do like to import all of the edit history would only need to download the more recent dump if they already downloaded the older files. Of course, if we do ever change historical edits, we would need to redo the older dumps and people who want them would need to redownload them, but that's still an improvement over dumping the entire thing twice a week and downloading the entire thing every time you want to do a full import. This seems more complicated, since it would mean having a way to import multiple files into the same table and stuff.
Have a partial edit dump with only recent edits (e.g. now to n months ago) alongside the existing full edit dump. This option would be more either/or than the previous one, people would pick whether they want the full edit history, partial edit history or no edit history at all.