Barriers to Perfect Open Access

To get a bit of insight into what is on the mind of the Repositories Community, I performed a brief survey and asked as many people as I could what barriers to perfect Open Access they came across in their day-to-day professional lives.  The scale of the survey (13 responses to two open-ended questions) doesn’t make it statistically significant, but it did spark a lot of interesting discussion.

Interviewees

I put no plan into who I interviewed — just whoever I bumped into at the conference.  I made a note of the job titles of the people I talked to, and have grouped them into three broad categories:

  • Strategic Leadership
    • Associate Dean for Digital Strategies
    • Director Information Management
    • Head of E-Science
  • Tactical Leadership
    • Library Digital Development Manager
    • Manager Digital Library Product and Services
    • Manager of Leicester Research Archive
    • Project and Community Manager
    • Research Data Support Manager
    • Research Support Manager
  • Technical
    • Applications Specialist
    • Digital Repository Developer
    • Systems Developer
    • Technical Support Staff

This seems like a fairly representative cross section of the conference!  Unfortunately, none are academics — it would have been great to get some input from our end-users.

Perceived Problems

Interviewees were not limited to one response, and many gave more, so the number of answers is higher than the number of interviewees.  Answers have been grouped into rough classes of problem:

  • Academics
    • Academics’ understanding of what OA is, because they have very mixed ideas of what it is. Some are very worried that it will “take away their things”.
    • Awareness of researchers that OA is important to them for their careers.
    • Educating fellow librarians and faculty about OA issues.
    • Embargoes and Engagement with Academics — some academics are great, but some would rather just be doing their research.
    • Engagement of the researcher and their understanding of what it’s for, how it benefits them and that it requires them to spend time on the process.
    • No direct contact — all second-hand, which is good. Academics don’t understand the full consequences of OA and what green, gold, etc mean for them. Whether it’s just ‘we need to wait for a new generation’ or not, it’s taking a long time for people to understand.
    • Rights issues; knowledge of the rights, versions that can be used
    • The technology and the way that research is actually shared at the moment (i.e. requirement to formally publish and have processes of review) — it’s very specific and formulaic. In Perfect OA, there wouldn’t be these barriers. The way people communicate doesn’t support perfect OA.
    • The way Academics are measured and promoted in Australia isn’t supportive of OA.
    • We simply don’t see publications. Our academics write a paper, and we just don’t know it exists, so we can’t ask for it.
  • Infrastructure
    • Accessing content in a machine-readable form.
    • Ease of deposit
    • Grand ideas at the institution, but no developers to implement them. E.G. Long-term preservation.
    • Lack of APIs or consistent APIs.
    • Systems are clunky
    • Universities not adhering to standards.
    • We don’t have a clear message and a clear channel for overcoming challenges.
  • Publishers
    • It’s a matter of control. There are relatively few players that are holding very tightly onto the access and control of the flow. We need to have ownership across the whole system (‘we’ being the scholarly research communities, the universities; those that have the public good in mind).
    • Political & Financial issues. Need to find business models to fund and support their work. Established business models are a barrier. Compare with changes in music and media (move to streaming and downloads).
    • Voracious publishers eager to suck up anything that’s put online and the charge libraries for it.

What I find interesting about this list of problems is that the clear winner by weight of numbers is a feeling that Academics still don’t know enough, or aren’t willing enough, to engage fully with Open Access.

Solutions?

The final question was about what ways forwards — ways in which organisations (like us at SHERPA services) can help with these problems.

  • Communication
    • Advocacy, education resources
    • Face-to-face advocacy is handled by the library — providing training for advocates on national issues. Advising on narratives would be very useful. We have to make advocacy easier.
    • Part of this is having really frank conversation about what is happening. Elsevier with platforms like PURE and its add-ons is taking huge areas of control over the process. We need to say that this is happening.  Really collaborate internationally to break this. Both financially and intellectually.
    • Technology is very community-driven, and requires for knowledge and expertise at various places in the community. At the institutional level, there often isn’t the expertise to drive technology forwards (e.g. upgrades). Perhaps training for developers.
    • We need to get a handle on stopping people publishing in hybrid journals. We should tell people which those are. Raising awareness of the consequences of this is essential.
  • Research
    • Case studies across different disciplines. A lot of the problems vary over disciplines due to different requirements.
  • Technical
    • Accurate machine readable embargo information.
    • APIs (Sherpa services are awesome)
    • Coordination of platform provision would be useful. Help with coordinating and harvesting data to present a community-wide view for institutions.
    • Help determine rights of items, including digitised items
    • Help researchers and organisation have less work for publishing and making available their work. Increases in efficiency will reduce the costs and help to open the data.
    • Making systems more interoperable
    • Notifications from Jisc when publishers accept articles OR advise the authors on what to do next (i.e. please give your repository this article)
    • SHERPA is getting more UK centric. It should be modular to allow international concerns to be served.

My $0.02

People are very passionate about the problems, and it was not difficult to engage people and get them to talk about what they saw as the problems and limitations in Open Access.  The two key ones, which I think have become almost cliché, are that academics don’t deposit, and publishers are just in it for the money.  Two-thirds of the issues mentioned can be put into those two buckets.

I understand the combative relationship we have with publishers.  We’re disrupting their business model, and they’re understandably resisting that.

I think the relationship the community has with the academics is far more complex.  In running the developer challenge at the conference I had one half-serious proposal for an entry.  A couple of delegates suggested that we just wait 30 years until the next generation of academics grows up ready to eat their vegetables.  This view is unhealthy for our movement.  We’re in the business of scholarly communications and our priority should be to serve the academics.  If they don’t understand how to use our services after more than 10 years, that’s our failing, not theirs.

With that in mind, my pick of the solutions is:

Face-to-face advocacy is handled by the library — providing training for advocates on national issues. Advising on narratives would be very useful. We have to make advocacy easier.

Dublin By Bike

I was in Dublin for OR2016 and the day after the conference, generic I was able to squeeze in a great tour of the city with Peter, Kim and Alan.  It was great.

Ideas Challenge

Planning an Ideas Challenge

Open Repositories 2016 is almost upon us, and that means I once again have the pleasure of (with my cochair Claire Knowles) organising the Ideas Challenge.  This is the most fun I have all year!

The Challenge

The Ideas Challenge asks the conference to form small teams and to propose a specific solution to an existing problem.  In my mind, there are three main strategic goals to the challenge, and it’s hard for me to decide which is more important.

  • It’s an opportunity to discuss the future of repositories.  Ideas proposed at previous conferences have become features of repositories.  Emerging technologies and infrastructure have been demonstrated to the conference, sparking discussion and debate about their merits.
  • It’s a way for developers and non-developers to meet, engage and work together in a relaxed context, outside of their jobs.  They get to run a mini-project and bat around blue sky ideas.  It is one of the key events that helps to bring the developer community together (developers don’t get let out much under normal circumstances) and many of my contacts in the International Repository Community were formed and grew through developer challenges.  This includes my co-chair Claire Knowles.
  • It provides a forum for developers and the development process to take centre stage.  This should not be underestimated.  While policy and business requirements are in the hands of librarians and managers, the software tools that enable Open Access are in the hands of the developers.  Having a general conference session at which software, developers and development collaborations are the stars is an important part of the community calendar.

The Themes

Challenges usually have themes.  Last year (Claire and my first year of responsibility), the theme was ‘achievable ideas’.  Many winning ideas have historically been large-scale, sector changing plans for world domination.  While big ideas are valuable and can show strategic priorities, we felt the need to have a look at what progress could be made with small-scale ideas.  Last year’s winner proposed allowing authors to submit fulltext documents by responding to automatic emails from the repository requesting they do so (e.g. “Your publication is missing a document, please reply to this email and attach a PDF”).  Genius!

This year’s theme will look at ideas to make the lives of academics easier.  As another Open Repositories first, we will includes some academics from outside of the conference on the judging panel, including a postgraduate and a professor!

Two meta-themes that have always run through the challenges is the idea of encouraging new collaborations, breaking down barriers between developers and non-developers and having a fun, informal presentation session.

Min Maxing Effort and Utility

The ideas challenge is supposed to fit into the gaps in the conference, perhaps starting with a conversation during a coffee-break, and then throwing together some slides during a lunch.  The process is:

  1. Form a team
  2. Think of a problem
  3. Think of a specific technical solution
  4. Build a four slide presentation
    1. Introduce the Team
    2. Outline the Problem
    3. Overview of the Solution
    4. Technologies Required
  5. Give a 3-minute presentation to the conference.

Speaking with a member of the idea team that placed second last year, he estimated the total time at around 90 minutes, but said that he expected it would be shorter with a smaller team (he had 5 members).  Their slides were created by photographing scribbles on paper.

I think the presentation session is greater than the sum of its parts, and its parts are really good.  Each presentation offers an informed opinion of the future of repositories.  As an aggregate, the session give an overview of what is on the minds of the community, what problems the non-developers see, and what technologies and platforms the developer community favour.

The Scoresheet

As a board-game player and some-time game designer, I like the idea that the priorities of the challenge can be codified in a scoring system.  Last year we created a scoresheet that would minimise the possibility of a tie.  We originally did this to minimise judging effort, and that really paid off.  We had the results within 15 minutes of the final presentation, an unprecedented occurrence at Open Repositories.  Thanks to this efficiency, this year the Ideas Challenge presentations will be just before the closing session, meaning teams will have even more time to have ideas and create presentations.

This year’s scoresheet has now been created.  The scoring systems prioritises the main themes of the challenge and minimises the possibility of a tie.  We have also added a ‘newbie’ bonus to encourage those who have never participated to get involved.

Start Your Engines…

How can we make repositories serve the Academic Community better?

Carry this thought around in your head in the run-up to the conference.  I am excited to see what you come up with.

 

Brodsworth Hall

We have English Heritage membership, discount and we took the opportunity to see a fine old Victorian country house.

More information at:

Welcome to Jisc

At the beginning of this year, I started working for Jisc. My first months have been nice and busy, and full of mainly one challenge. SHERPA Ref.

Context

For those of you not in scholarly communications (Hi Mum), here’s a bit of background.

The Research Excellence Framework (REF) is a big deal in UK academia. It’s a sector-wide exercise to rank all Universities by the quality of their research outputs. How well an institution does in the REF will have a direct impact on the research funding it will receive through the next cycle. HEFCE (the UK government’s Higher Education quango) administer it and set the rules.

A word on Research Outputs: Without getting into a debate about the scope and nature of a Research Output, for the purposes of this post, we’ll be focusing on Journal Articles.

As far as I can tell, the product that a journal is selling is reputation. An academic that can get a Journal Article published in one of the top journals is respected by his peers, and more importantly, by research funders.

Journals make their money by selling subscriptions to university libraries, and the traditional model of scholarly communications is funded in this way.  When a researcher gives a journal her article, the journal handles dissemination of the research.  The business I have been in for the last ten years is about shaking this up by enabling academics to publish their work via the web so that it’s available for everyone, for free.  It’s called Open Access.  Publishers have a variety of views on this, and make those views formally known through Publisher Policies, which are legal wording defining what authors can and can’t do if they publish in their journal.

However, we can now add the British government to the growing list of entities that support Open Access to research.  HEFCE has decreed that for a research output to be considered in the REF, it must be Open Access.

Understanding whether a particular Publisher’s Policy allows an author to make his article Open Access is actually quite difficult. The author would have to read through potentially pages of policy document to understand this. To assist, the SHERPA Ref service provides a simple answer as to whether a Publish Policy allows an author to make his article open access and how he can do that. We’ve had our crack admin team read the policies, we’ve stored them in a database, and we’re providing and interface that allows anyone to query by journal.

In at The Deep End

Just as you don’t want to know about the processes that go into making laws or sausages, you probably don’t want to know what’s going on behind the web services you use every day. A lot of what is shipped is not beautiful to behold.  This is normal, and right up to the wire, managing a project can be duck-work — calm on the surface but paddling furiously below.

I inherited technical responsibility for SHERPA Ref quite late in its project lifecycle and it was my responsibility to work through the last remaining technical hurdles and get it release-ready.

The Challenge

The first piece of critical analysis was to identify which parts of the system were ready for launch and which weren’t. On the whole, the system was robust. There was a sophisticated administration and data entry infrastructure that was functional and appropriate, a database structure that was a little was inelegant but functional (as a developer, I’d love to live in a world where elegance was one of the bars all software had to clear) but there were still some issues with the front-end.

Results from our closed beta testing were generally positive, but there were criticisms that came back to the issues we identified with the front-end.

Problem Solving in Broad Strokes

Of course, some of these issues were bugs that beta-testing had shaken out, but some of the issues were structural.  An early misunderstanding of the requirements by the development side of the project team had led to some subtle issues in the way decisions and recommendations were being made by the system.  This was a critical issue that affected a low number of journals, so had not been caught earlier in the process.

So, there was a decision to make.  Repair the existing front-end or build a new one.  The front-end of the system was a fairly small (in terms of lines of code) component of the system and is essentially a view on the data that is curated by the larger components of the system, which made either option a viable one.  In the end it was decided to rebuild due to the lack of current staff that had a detailed knowledge of the framework that was used to build the system.

Building a REST interface to access data stored in a database is the bread-and-butter of a Web Systems Developer, which is one of the hats I wear. The most complex part build was reading from the database and normalising to a more elegant structure.  Once that was done, it didn’t take long to produce an equivalent system, and it was a simple matter to also push JSON data out over predictable URLs to create an API.

The Benefit of Automated Tests

As part of the new development, a comprehensive set of Unit and Integration tests were written. By far the most important of these were created in association with Jane Anders, the SHERPA Services Development Officer, and by far the most expert person I’ve ever met on the subject of publisher policies. Her job is to ensure that the data that powers SHERPA Services accurately reflects reality. Together we chose a set of representative publications and wrote down what the system should say about each of them. We ended up with 28 journals. We used these to test against the system, and quickly isolated the conditions under which the system did not perform as expected.

The biggest benefit of tests when dealing with software projects that they give provable confidence. There is now a process for determining that the software is correct at this point, and furthermore, we can recheck it at any point in the future.

The End of the Process

Sherpa Ref was launched yesterday in a small blitz of Jisc publicity. I ran the tests once more before launch time and kept my fingers crossed; I’m never fully convinced that my software works, and I think that’s a healthy attitude.

I learn best by doing, and while there was no Rocket Science in this project, it was an opportunity to stretch my legs with the technical management of a project while gaining understanding of the organisational context of my job. Throughout this process, I had the support of colleagues in other offices in Jisc. We were provided with infrastructure to run our test and live services on, development support from Rachel Witwicki in the Open Access Scholarly Comms Development Team.

All in all, this has been a good quarter, and I’m happily ensconced in my role, and I’m now looking at the next challenge: version 2 of SHERPA’s other Open Access Services…

 

EPrints User Group Meetings Roundup

I’ve had a busy couple of weeks, with a UKCoRR Members’ Day, the German Language User Group Meeting, an EPrints Hack Day and the EPrints UK User Group meeting all occurring in rapid succession.

UKCoRR Members’ Day

Hosted at the University of Glasgow, this event was a chance to hear talk on the policy issues that keep repository managers up at night. The quality of the presentations was very high, and I particularly enjoyed hearing Ben Johnson from HEFCE talk about Open Access.

I caught up with a few of the repository community’s usual suspects, I presented about the direction EPrints is heading, both from the perspective of the software and the community [ slides ].

German Language User Group Meeting

A few days later, I found myself in Zurich with the German Speaking part of the EPrints Community. I presented the same slides as I did in the UKCoRR meeting, but went into much greater detail while presenting them. We had an active discussion where the following requests were made:

  • jquery integration (i.e. throw away prototype)
  • Features to make responsive templates simpler to implement
  • Bazaar package documentation
    • List of repositories that have each package installed
    • Some kind of indication of how easy a package is to install

While the first two requests are squarely in the camp of core development, the Bazaar package documentation requests are quite easily achievable. EPrints repositories make no secrets about which bazaar packages are installed (check the /cgi/counter url on your repository), so we can harvest bazaar package usage. However, there are some concerns that advertising who has what installed may have privacy or security implications. We may start by simply showing a count of how many installations of each bazaar package are installed, and then we can do cool things like order bazaar packages by popularity.

The second bazaar documentation request stems from the lack of indication on the repository of how easy it is to install a package. Some packages are an easy one-click install. Some require some configuration (API keys and the like). Some require a systems administrator to install additional libraries on your server. Watch this space for new accolades in the bazaar to help with this.

There was also talk about creating more formal channels of communication from the community through to EPrints, and I’m looking forward to seeing how these conversations develop.

It was really great to meet with some of the non-UK community and to hear about their successes, issues and concerns. After the meeting we went to a coffee shop on the roof of the University next door where the view over Zurich was spectacular.

EPrints Hack Day

John Salter decided at the Repository Fringe that an EPrints developer event would be a good idea. We put out the call and were able to gather a handful of developers who met the day before the UK User Group Meeting with the stated goal of closing bugs and pull requests on github. We put it the day before the UK User Group Meeting because EPrints Developers might want to go to that, too.

John ran a tight ship, and there were a number of important outputs to the day:

  • Bugs got fixed
  • We learned how to build EPrints development environments (training video to follow soon)
  • The developer community got stronger

It’s the third point on the list above that excites me the most. We plan to try to have hack days piggy-backing onto user group meetings in future. We also may be running remote hack days, where there’s no meeting, but there is EPrints development. I also think we need to come up with a better name than ‘Hack Day’.

The slides that John presented the following day at the UK User Group Meeting are [ here ]

UK User Group Meeting

The final of my three meetings in three days was the UK User Group Meeting, hosted at the University of Southampton. I got the first presentation slot, and presented some of my ideas for how we can build a stronger community [ Slides ]. Among these:

  • Showing off my Training Videos and encouraging the community to produce some.
  • Talking about how to create wiki documentation from interactions on the eprints-tech mailing list.
  • Soliciting ideas for EPrints feature on tricider

I’ve promised the community that I will attempt to build whichever tricider idea gets the most votes. To my bemusement, there were requests for me to livestream the development.

The rest of the day went well, with interesting presentations from community members, including JISC presenting about the REF plugin, and Peter West showing ORCID integration.

It’s been a busy couple of weeks. I’ve really enjoyed meeting so many of the community in so many contexts, but I’m looking forward to a few quiet days in the office now.

EPrints Nth Fulltext Download

An EPrints repository administrator wanted to make a fuss over the author of the 50,000th fulltext download in their repository, and reached out.  After some time looking at the problem, I discovered that finding out the 50,000th fulltext download from an EPrints repository is not as simple as it sounds.  Ironically, if IRStats wasn’t installed, it would have been far easier.

The Simple Approach

EPrints stores every view of an abstract page or download of a fulltext document in its ‘access’ table.  A query on this will give us the 50,000th download:

SELECT referent_id FROM access WHERE service_type_id = '?fulltext=yes' LIMIT 50000,1

This will give you an answer that is in some way correct, but if you have IRStats, you can do better.

The Subtleties of IRStats

IRStats 2 starts with the access table discussed above and filters out spam and optimises the data for its visualisations, creating a new set of tables.  The most useful table that IRStats maintains is the ‘irstats2_downloads’ table.  To get the total number of downloads from this table, we need to run this query:

SELECT SUM(count) FROM irstats2_downloads

The reason for this is that this table has a granularity of a day.  Each row represents the number of downloads of a given eprint’s fulltext on any given day.  To find the 50,000th download I repeatedly ran queries like this:

SELECT SUM(count) FROM irstats2_downloads WHERE datestamp <= 20150413

…until I found the first day on which the sum of the count column was greater than 50,000.  Then, I checked the UIDs of the rows for that day, and repeated the process again using ascending UIDs until the sum was >= 50,000.

Differently Inaccurate

I wrapped both of these approaches into a script (https://github.com/gobfrey/nth_download/blob/master/nth_download.pl), which produced the following output for the repository in question:

EPrint 3160 was download number 50000 on 20150711
This result was generated from the IRStats2 downloads table

EPrint 1064 was download number 50000 on 20141027
This result was generated from the EPrints access table

These are obviously very different numbers, but each is flawed in different ways.  The access table method doesn’t take spam into account (repeat downloaders, spiders, etc), but more importantly, it significantly disagrees with IRStats, which is the primary source of repository statistics.

The IRStats result is flawed in a far more subtle way.  There’s a margin for error which is related to the number of downloads on the day that the 50,000th download occurred.  The reason for this is that each download of an item is sequentially inserted into the irstats table with a count of 1, but if a second download happens for any given item, the count on the existing row is incremented. This means we have no way of knowing if it happened immediately after the first download, or whether it was the last download of the day.  The approach I’ve taken with the queries above means that I treat all downloads of the same item on the same day to have happened concurrently.

Here’s what the day in question actually looks like:

select `count`, COUNT(*) from irstats2_downloads where datestamp = '20150711' group by `count`;
+-------+----------+
| count | COUNT(*) |
+-------+----------+
|     1 |       82 |
|     2 |       17 |
|     3 |        2 |
+-------+----------+
3 rows in set (0.00 sec)

Around 19% of downloads were repeats, which will skew the result.  There’s quite a high chance we’ve chosen the wrong one (in fact, it’s quite unlikely that we have the right answer).

The final piece of data is the difference in size of the two sets of download data.  The access table has 88310 downloads, which is 58% more than the 51739 in the irstats downloads table.

So, we have one method that will guarantee the item is in the right place in the sequence of downloads, but the sequence is somewhat bloated.  We also have another method that uses a better sequence, but chooses with a random offset.

More Development Required

The way to solve this is to develop an IRStats plugin that explicitly stores Nth downloads for important values of N.  That, however, is a job for another day.

A day of fun

Best. Day. Ever!

Ideas Challenge

My OR2015

OR2015 was great.  I think I enjoyed myself more than I have at any other Open Repositories.  I was involved in a number of things that kept me really busy throughout the conference:

Developer Track

The conference committee had decided that a full developers track would replace the Developer Challenge of previous years, with paper presentation sessions in the main conference.  The amazing and well-organised Claire Knowles co-chaired the track with me, and we sent out our call with an emphasis on practical demonstrations and informal presentations.

We filled two sessions with submissions, and we enjoyed a broad set of technology and process demonstrations.  Hardy Pottinger gets a special mention for really entering into the spirit of things; live source management, compilation and yo-yo demonstrations while dealing with  the inevitable hiccoughs with his live demonstration of DSpace development within Vagrant with aplomb.

Ideas Challenge

As the Developer Challenge had been replaced by the Developer Track, but that left a Developer Challenge size hole in the conference, which Claire and I thought was important to fill.  The Developer Challenge has, to my mind, been one of the crown jewels of Open Repositories.  It was where repository developers would form small teams and build prototypes of new features.  It’s provided a huge number of benefits to the community, among them:

  • A forum to talk about the current on-the-ground state and future of repository software
  • A networking opportunity for developers
  • A list of good ideas that might influence repository platform development
  • An event at which developers are the stars

So, as part of the Developer Track, we designed the Ideas Challenge, which involved producing a 4-slide powerpoint presentation outlining a piece of development that could be done.  A scoring system was designed to encourage working with new people and the participation of non-developers.  We ended up with 9 entries, and some very interesting ideas.  Full blogpost at http://www.or2015.net/2015/06/17/ideas-challenge-winners/

Documents:

Vendors Table

EPrints Services had contributed as a supported of OR2015, which entitled use to a vendors table.  This worked really well as a walk-up point for EPrints, and Will and I staffed the desk during breaks.  A number of our customers and users made a point of seeking us our for a chat, and it was nice to connect with new and old faces.  We had some postcards highlighting some of our services, which are downloadable below.

Documents:

EPrints Interest Group

Discussions with various community members that had begun at the EPrint Services Vendor Table continued through the EPrints Interest Group sessions.  In my “State of the Union” presentation, we talked about the future of the software and the community.  I showed off the community work I had been doing over the past six months, including the training videos.  I then asked the room what they wanted to see, and the most interesting idea that was raised was the possibility of a community steering group for EPrints.

I have, since returning to EPrints Services, been trying to find ways to promote better documentation.  My Community Development presentation was a report on work Tomasz Neugebaur and I had been doing over the past few months.  I helped Tomasz build a bazaar package by providing advice and debugging support, and he created a bazaar package with good documentation.  My presentation encouraged a fairly wide-ranging discussion of documentation, and Meg Eastwood from NAU volunteered to come up with some recommendation on how the wiki could be improved.

Documents: