An EPrints repository administrator wanted to make a fuss over the author of the 50,000th fulltext download in their repository, and reached out. After some time looking at the problem, I discovered that finding out the 50,000th fulltext download from an EPrints repository is not as simple as it sounds. Ironically, if IRStats wasn’t installed, it would have been far easier.
The Simple Approach
EPrints stores every view of an abstract page or download of a fulltext document in its ‘access’ table. A query on this will give us the 50,000th download:
SELECT referent_id FROM access WHERE service_type_id = '?fulltext=yes' LIMIT 50000,1
This will give you an answer that is in some way correct, but if you have IRStats, you can do better.
The Subtleties of IRStats
IRStats 2 starts with the access table discussed above and filters out spam and optimises the data for its visualisations, creating a new set of tables. The most useful table that IRStats maintains is the ‘irstats2_downloads’ table. To get the total number of downloads from this table, we need to run this query:
SELECT SUM(count) FROM irstats2_downloads
The reason for this is that this table has a granularity of a day. Each row represents the number of downloads of a given eprint’s fulltext on any given day. To find the 50,000th download I repeatedly ran queries like this:
SELECT SUM(count) FROM irstats2_downloads WHERE datestamp <= 20150413
…until I found the first day on which the sum of the count column was greater than 50,000. Then, I checked the UIDs of the rows for that day, and repeated the process again using ascending UIDs until the sum was >= 50,000.
Differently Inaccurate
I wrapped both of these approaches into a script (https://github.com/gobfrey/nth_download/blob/master/nth_download.pl), which produced the following output for the repository in question:
EPrint 3160 was download number 50000 on 20150711 This result was generated from the IRStats2 downloads table EPrint 1064 was download number 50000 on 20141027 This result was generated from the EPrints access table
These are obviously very different numbers, but each is flawed in different ways. The access table method doesn’t take spam into account (repeat downloaders, spiders, etc), but more importantly, it significantly disagrees with IRStats, which is the primary source of repository statistics.
The IRStats result is flawed in a far more subtle way. There’s a margin for error which is related to the number of downloads on the day that the 50,000th download occurred. The reason for this is that each download of an item is sequentially inserted into the irstats table with a count of 1, but if a second download happens for any given item, the count on the existing row is incremented. This means we have no way of knowing if it happened immediately after the first download, or whether it was the last download of the day. The approach I’ve taken with the queries above means that I treat all downloads of the same item on the same day to have happened concurrently.
Here’s what the day in question actually looks like:
select `count`, COUNT(*) from irstats2_downloads where datestamp = '20150711' group by `count`; +-------+----------+ | count | COUNT(*) | +-------+----------+ | 1 | 82 | | 2 | 17 | | 3 | 2 | +-------+----------+ 3 rows in set (0.00 sec)
Around 19% of downloads were repeats, which will skew the result. There’s quite a high chance we’ve chosen the wrong one (in fact, it’s quite unlikely that we have the right answer).
The final piece of data is the difference in size of the two sets of download data. The access table has 88310 downloads, which is 58% more than the 51739 in the irstats downloads table.
So, we have one method that will guarantee the item is in the right place in the sequence of downloads, but the sequence is somewhat bloated. We also have another method that uses a better sequence, but chooses with a random offset.
More Development Required
The way to solve this is to develop an IRStats plugin that explicitly stores Nth downloads for important values of N. That, however, is a job for another day.