EPrints User Group Meetings Roundup

I’ve had a busy couple of weeks, with a UKCoRR Members’ Day, the German Language User Group Meeting, an EPrints Hack Day and the EPrints UK User Group meeting all occurring in rapid succession.

UKCoRR Members’ Day

Hosted at the University of Glasgow, this event was a chance to hear talk on the policy issues that keep repository managers up at night. The quality of the presentations was very high, and I particularly enjoyed hearing Ben Johnson from HEFCE talk about Open Access.

I caught up with a few of the repository community’s usual suspects, I presented about the direction EPrints is heading, both from the perspective of the software and the community [ slides ].

German Language User Group Meeting

A few days later, I found myself in Zurich with the German Speaking part of the EPrints Community. I presented the same slides as I did in the UKCoRR meeting, but went into much greater detail while presenting them. We had an active discussion where the following requests were made:

  • jquery integration (i.e. throw away prototype)
  • Features to make responsive templates simpler to implement
  • Bazaar package documentation
    • List of repositories that have each package installed
    • Some kind of indication of how easy a package is to install

While the first two requests are squarely in the camp of core development, the Bazaar package documentation requests are quite easily achievable. EPrints repositories make no secrets about which bazaar packages are installed (check the /cgi/counter url on your repository), so we can harvest bazaar package usage. However, there are some concerns that advertising who has what installed may have privacy or security implications. We may start by simply showing a count of how many installations of each bazaar package are installed, and then we can do cool things like order bazaar packages by popularity.

The second bazaar documentation request stems from the lack of indication on the repository of how easy it is to install a package. Some packages are an easy one-click install. Some require some configuration (API keys and the like). Some require a systems administrator to install additional libraries on your server. Watch this space for new accolades in the bazaar to help with this.

There was also talk about creating more formal channels of communication from the community through to EPrints, and I’m looking forward to seeing how these conversations develop.

It was really great to meet with some of the non-UK community and to hear about their successes, issues and concerns. After the meeting we went to a coffee shop on the roof of the University next door where the view over Zurich was spectacular.

EPrints Hack Day

John Salter decided at the Repository Fringe that an EPrints developer event would be a good idea. We put out the call and were able to gather a handful of developers who met the day before the UK User Group Meeting with the stated goal of closing bugs and pull requests on github. We put it the day before the UK User Group Meeting because EPrints Developers might want to go to that, too.

John ran a tight ship, and there were a number of important outputs to the day:

  • Bugs got fixed
  • We learned how to build EPrints development environments (training video to follow soon)
  • The developer community got stronger

It’s the third point on the list above that excites me the most. We plan to try to have hack days piggy-backing onto user group meetings in future. We also may be running remote hack days, where there’s no meeting, but there is EPrints development. I also think we need to come up with a better name than ‘Hack Day’.

The slides that John presented the following day at the UK User Group Meeting are [ here ]

UK User Group Meeting

The final of my three meetings in three days was the UK User Group Meeting, hosted at the University of Southampton. I got the first presentation slot, and presented some of my ideas for how we can build a stronger community [ Slides ]. Among these:

  • Showing off my Training Videos and encouraging the community to produce some.
  • Talking about how to create wiki documentation from interactions on the eprints-tech mailing list.
  • Soliciting ideas for EPrints feature on tricider

I’ve promised the community that I will attempt to build whichever tricider idea gets the most votes. To my bemusement, there were requests for me to livestream the development.

The rest of the day went well, with interesting presentations from community members, including JISC presenting about the REF plugin, and Peter West showing ORCID integration.

It’s been a busy couple of weeks. I’ve really enjoyed meeting so many of the community in so many contexts, but I’m looking forward to a few quiet days in the office now.

EPrints Nth Fulltext Download

An EPrints repository administrator wanted to make a fuss over the author of the 50,000th fulltext download in their repository, and reached out.  After some time looking at the problem, I discovered that finding out the 50,000th fulltext download from an EPrints repository is not as simple as it sounds.  Ironically, if IRStats wasn’t installed, it would have been far easier.

The Simple Approach

EPrints stores every view of an abstract page or download of a fulltext document in its ‘access’ table.  A query on this will give us the 50,000th download:

SELECT referent_id FROM access WHERE service_type_id = '?fulltext=yes' LIMIT 50000,1

This will give you an answer that is in some way correct, but if you have IRStats, you can do better.

The Subtleties of IRStats

IRStats 2 starts with the access table discussed above and filters out spam and optimises the data for its visualisations, creating a new set of tables.  The most useful table that IRStats maintains is the ‘irstats2_downloads’ table.  To get the total number of downloads from this table, we need to run this query:

SELECT SUM(count) FROM irstats2_downloads

The reason for this is that this table has a granularity of a day.  Each row represents the number of downloads of a given eprint’s fulltext on any given day.  To find the 50,000th download I repeatedly ran queries like this:

SELECT SUM(count) FROM irstats2_downloads WHERE datestamp <= 20150413

…until I found the first day on which the sum of the count column was greater than 50,000.  Then, I checked the UIDs of the rows for that day, and repeated the process again using ascending UIDs until the sum was >= 50,000.

Differently Inaccurate

I wrapped both of these approaches into a script (https://github.com/gobfrey/nth_download/blob/master/nth_download.pl), which produced the following output for the repository in question:

EPrint 3160 was download number 50000 on 20150711
This result was generated from the IRStats2 downloads table

EPrint 1064 was download number 50000 on 20141027
This result was generated from the EPrints access table

These are obviously very different numbers, but each is flawed in different ways.  The access table method doesn’t take spam into account (repeat downloaders, spiders, etc), but more importantly, it significantly disagrees with IRStats, which is the primary source of repository statistics.

The IRStats result is flawed in a far more subtle way.  There’s a margin for error which is related to the number of downloads on the day that the 50,000th download occurred.  The reason for this is that each download of an item is sequentially inserted into the irstats table with a count of 1, but if a second download happens for any given item, the count on the existing row is incremented. This means we have no way of knowing if it happened immediately after the first download, or whether it was the last download of the day.  The approach I’ve taken with the queries above means that I treat all downloads of the same item on the same day to have happened concurrently.

Here’s what the day in question actually looks like:

select `count`, COUNT(*) from irstats2_downloads where datestamp = '20150711' group by `count`;
| count | COUNT(*) |
|     1 |       82 |
|     2 |       17 |
|     3 |        2 |
3 rows in set (0.00 sec)

Around 19% of downloads were repeats, which will skew the result.  There’s quite a high chance we’ve chosen the wrong one (in fact, it’s quite unlikely that we have the right answer).

The final piece of data is the difference in size of the two sets of download data.  The access table has 88310 downloads, which is 58% more than the 51739 in the irstats downloads table.

So, we have one method that will guarantee the item is in the right place in the sequence of downloads, but the sequence is somewhat bloated.  We also have another method that uses a better sequence, but chooses with a random offset.

More Development Required

The way to solve this is to develop an IRStats plugin that explicitly stores Nth downloads for important values of N.  That, however, is a job for another day.

A day of fun

Best. Day. Ever!

My OR2015

OR2015 was great.  I think I enjoyed myself more than I have at any other Open Repositories.  I was involved in a number of things that kept me really busy throughout the conference:

Developer Track

The conference committee had decided that a full developers track would replace the Developer Challenge of previous years, with paper presentation sessions in the main conference.  The amazing and well-organised Claire Knowles co-chaired the track with me, and we sent out our call with an emphasis on practical demonstrations and informal presentations.

We filled two sessions with submissions, and we enjoyed a broad set of technology and process demonstrations.  Hardy Pottinger gets a special mention for really entering into the spirit of things; live source management, compilation and yo-yo demonstrations while dealing with  the inevitable hiccoughs with his live demonstration of DSpace development within Vagrant with aplomb.

Ideas Challenge

As the Developer Challenge had been replaced by the Developer Track, but that left a Developer Challenge size hole in the conference, which Claire and I thought was important to fill.  The Developer Challenge has, to my mind, been one of the crown jewels of Open Repositories.  It was where repository developers would form small teams and build prototypes of new features.  It’s provided a huge number of benefits to the community, among them:

  • A forum to talk about the current on-the-ground state and future of repository software
  • A networking opportunity for developers
  • A list of good ideas that might influence repository platform development
  • An event at which developers are the stars

So, as part of the Developer Track, we designed the Ideas Challenge, which involved producing a 4-slide powerpoint presentation outlining a piece of development that could be done.  A scoring system was designed to encourage working with new people and the participation of non-developers.  We ended up with 9 entries, and some very interesting ideas.  Full blogpost at http://www.or2015.net/2015/06/17/ideas-challenge-winners/


Vendors Table

EPrints Services had contributed as a supported of OR2015, which entitled use to a vendors table.  This worked really well as a walk-up point for EPrints, and Will and I staffed the desk during breaks.  A number of our customers and users made a point of seeking us our for a chat, and it was nice to connect with new and old faces.  We had some postcards highlighting some of our services, which are downloadable below.


EPrints Interest Group

Discussions with various community members that had begun at the EPrint Services Vendor Table continued through the EPrints Interest Group sessions.  In my “State of the Union” presentation, we talked about the future of the software and the community.  I showed off the community work I had been doing over the past six months, including the training videos.  I then asked the room what they wanted to see, and the most interesting idea that was raised was the possibility of a community steering group for EPrints.

I have, since returning to EPrints Services, been trying to find ways to promote better documentation.  My Community Development presentation was a report on work Tomasz Neugebaur and I had been doing over the past few months.  I helped Tomasz build a bazaar package by providing advice and debugging support, and he created a bazaar package with good documentation.  My presentation encouraged a fairly wide-ranging discussion of documentation, and Meg Eastwood from NAU volunteered to come up with some recommendation on how the wiki could be improved.


EPrints UK User Group Meeting Report

The Winter 2015 EPrints UK User Group Meeting was hosted by ULCC and was my first EPrints User Group Meeting with my EPrints Community Lead hat on.  The programme, organised by David McElroy, was published on the EPrints UK User Group Google group.

This was my first public outing wearing my new hats:

  • EPrints Services Business Relationship Manager
  • EPrints Community Lead

What struck me most about the event was the evident health of the EPrints community.  The free tickets to the event were snapped up in a matter of days; on the day, the room was happily bustling with delegates from all over the UK; the presentations were varied and interesting; and a good crowd met for drinks after the event.

As well as a short presentation introducing myself and the Community Lead role, I had also been tasked with presenting the EPrints Development Roadmap to the community. The presentations are available on youtube (roadmap presentation at 2:34:07).

To end the event, I chaired a feedback session where we talked about what the community would like to see, and I invited people to email me directly after the session with further feedback.  Here’s what was received:

  • EPrints Development
    • Drag & Drop file upload
    • Click-editable metadata fields
    • Thread-awareness in EPrints v4
    • Metadata extraction and workflow auto-population (perhaps from DOIs)
    • Improvements for Visual Arts items (further development of Kultur work)
    • Better infrastructure for dealing with duplicate records
    • Generic metadata schemas for Research Data Repositories
  • Other Request
    • Technical Training Sessions (currently, EPrints Services routinely offers administrator training)
    • Technical Webinars
    • Regular updates from EPrints

Open Repositories Twitter Trends

I attended the Open Repositories 2014 conference last week, and harvested the conference twitter hashtag using an EPrints repository with the Tweepository package installed.  During the conference I generated wordles which I tweeted (the tweepository package makes that a two-click process).  These proved to be quite popular, so I thought I’d archive them here.  Anyone interested in the trends of the conference can do a comparison.  Here they are with their original tweet texts:

Continue reading


I’m in Helsinki for a conference, and I’ve been walking around, taking some pictures.  Here are five of them.

White Faced Saki

I haven’t been out with my camera lately, but I did get a nice shot of a White Faced Sake at Marwell Zoo.  I had a play with my black-and-white processing software, and then with the colour processing software.  Here are the results for the same photograph.  I thought the black-and-white one was best at first, but the colour one is winning me over.

Marwell Portraits

On Noah’s birthday, we bought season tickets to Marwell Zoo, and we’ve been making good use of the tickets ever since.  I’ve also been playing (again) with black-and-white processing, just for fun :)  Here are some animal portraits.