Text-mining and open access

October 18, 2012

There is an excellent opinion piece in the latest edition of Research Fortnight by Professor Doug Kell on text-mining and open access. As for many the article will be behind a pay-wall (the irony…) I thought I would summarize the argument and post a few quotes here.

The argument goes like this:

  • New research findings are being added to the body of literature at a rate that means it is impossible for anyone to read it all, let alone assimilate and make sense of it all. The only solution is to use text-mining.
  • There are clear benefits for researchers, business and policy-makers in using text-mining of the scientific literature. For example a recent report from JISC concludes that “there is clear potential for significant productivity gains, with benefit both to the sector and to the wider economy”.
  • But for text-mining to be effective access is needed to the full text. Abstracts are not enough, and for rapid interpretation of new research embargo periods are a problem.

And here are some key paragraphs from the article:

The PubMed database records two new peer-reviewed papers in the life sciences every minute. Across all the sciences, the number is five.

Such is the rate at which scholarly papers are produced that only computers can read them all. As a result, text-mining techniques are infiltrating every field of research, from genomics to the social sciences and humanities. Historians are using text mining to analyse court records from the Old Bailey. Business has been mining newswires since the 1980s to acquire competitive intelligence and today companies use text mining, including of social media, to discover what customers think of their products and services.


To get the most from text mining requires open access to the literature. And it requires it as soon after publication as possible. In the life sciences, six months—the maximum embargo allowed in Research Councils UK’s policy on ‘green’ open access—is a very long time.

This is one reason why the research councils’ policy on open access announced this July made the ‘gold’ model the preferred route. Pursuing gold open access will help the UK to get ahead of the curve in exploiting the opportunities, including text mining, that come from open access.

I gave a presentation on RCUK Strategy to the Winter meeting of the Heads of University Biological Science departments last week. Here are the slides, together with audio of my talk (direct link):

RCUK Strategy

View another webinar from steven_hill


Slide 5: SET statistics 2011
Slide 6, 24: Royal Society, The Scientific Century 2010
Slide 7: Spending review 2010 [pdf]
Slide 8: Allocation of science and research funding 2010 [pdf]
Slides 14-19, 25: BIS/Elsevier, Performance of the UK research base 2011;  Thompson-Reuters, Global Research Report UK 2011
Slide 26: Innovation Union Scoreboard 2010
Slide 27: Science, Technology and Industry Scorecard – Innovation and knowledge flows
Slide 28: OECD 2011, Science, Technology and Industry Scoreboard – Public/private cross-funding of research
Slide 30: RCUK data principles
Slide 32: RCUK Concordat on Public Engagement
Slide 35: Times Higher Education
Slide 36: RCUK demand management principles
Slide 37: ESRC consultation responses
Slide 38: Nature

BIS published their annual ‘SET statistics’ last week, which provide a wealth of information on the UK’s investment in science, engineering and technology. The Campaign for Science and Engineering have published their take on the numbers. For me, one of the interesting aspects of the dataset is the reasonably long time series it provides, giving insights into long term trends. I compiled the following graph from the data to show how the general pattern of research investment has varied over time:

 The investment levels are the inflation corrected figures, converted to 2009/10 prices. Some striking features of the patterns of investment are:

  • In real terms, the investment in 2009/10 is equivalent to that of 1986/87. Total levels of investment have been reasonably constant in recent years.
  • The changing pattern of investment has been towards increasing expenditure by the Research Councils and the Higher Education Funding Councils, largely at the expense of spending on defence. Spending by the civil departments has been steady and has increased slightly in recent years.
  • Defence spending is more volatile than other areas with bigger year-on-year fluctuations.

I would be interested in your views on the statistics, so please comment.

Nature has recently published a fascinating article (paywall) developing the argument that theoretical work in mathematics that has no apparent application can prove to be really useful in the future. These quotes summarise the argument:

The mathematician develops topics that no one else can see any point in pursuing, or pushes ideas far into the abstract, well beyond where others would stop.

There is no way to guarantee in advance what pure mathematics will later find application. We can only let the process of curiosity and abstraction take place, let mathematicians obsessively take results to their logical extremes, leaving relevance far behind, and wait to see which topics turn out to be extremely useful. If not, when the challenges of the future arrive, we won’t have the right piece of seemingly pointless mathematics to hand.

These points are then illustrated with seven examples where advances in mathematics precede, sometimes by centuries, their use in new innovations or products. One of the examples explains that the mathematics of quaternions, which were first described in the nineteenth century, turns out to be really useful in computer game programming.

The examples provide evidence that abstract developments can prove useful, but I was left with a question. If the new understanding hadn’t happened first, would the application itself have driven the new mathematics? This is a hypothetical question, and there is no doubt that having the maths in place already will have speeded up the application. In order to get a better picture, though, it would be interesting to know how easy it is to find examples where new advances in maths have been catalysed because of a pressing need to solve a practical problem. If can think of examples like this please add them to the comments.


It was recently announced that the UK Centre for Medical Research and Innovation has been renamed the Francis Crick Institute. While the reduction in the alphabet soup of UK research policy is to be applauded, I find the obsession with naming scientific institutes and facilities after famous individuals problematic for science and its relationship with society. It is part of a wider personality cult in science, that is manifest by the emphasis that is given to personal awards like fellowships of the major academies or big international prizes, of which the Nobel prize is probably the best known.

I think that the focus on individuals raises a number of problems:

  • It suggests that advances in science are dependent on the particular insight of special individuals, but the history of science shows that the cultural context within which scientists operate is at least as influential as individual genius. It is the rule, rather than the exception, that new ideas emerge in parallel in multiple places, and the name we associate with discoveries often reflects accidents of history or aptitudes for self-publicity, rather than some unique contribution.
  • The focus on the individual ignores the importance of teams. Almost any major scientific advance is now dependent on a team effort, and while every effective team needs a leader, to single out individuals misses the point and devalues the wider contributions. And even beyond the research team, science progresses through the development of a body of evidence to which many researchers contribute. This is equally relevant to the current focus on delivering impact from research, as pointed out recently by Jack Stilgoe and Alice Bell: impact comes from people and the interactions between them, rather than from journals article or individuals.
  • Perhaps most importantly, the focus on individuals leads to a perception outside of the research community that there are some special characteristics that are needed to be a successful scientist, and can reinforce stereotypes about age, gender or social background. If we want to attract young people into science focusing on the fact that scientific research is an exciting career that is open to many would seem a better strategy than building a cult of ‘special’ individuals.


  1. Communicate about the process of science as well as the content. Many of the controversies around science and its interface with society are really about the processes of science. But often the background is not well explained. Peer review should be explained clearly, covering both the formal and informal aspects, and being honest about the weaknesses as well as the strengths of the system. The ‘weight of evidence’ approach should be discussed as a real strength of science. So often our understanding of the world depends on the alignment of a large number of small pieces of evidence. None of these on the own are particularly compelling but taken together… And when one piece of evidence turns out to be in error it may only have a minor impact on the overall story. Finally, we need a wider understanding of Kuhn‘s Scientific Revolutions. Sometimes the lone voice is right and the consensus wrong, although history tells us that this doesn’t happen often.
  2. Make research outputs available to all for free
  3. Publish negative results and unsuccessful experiments too
  4. Publish peer review comments with research outputs
  5. Attach a summary for non-experts to research outputs
  6. Make raw data available as early as possible
  7. Use new technology to open research conferences to all

Science inspirations

September 6, 2009

Why did you become a scientist? I am sure most scientists have been asked this at some point, and with the drive to maintain and increase the number of young people choosing science, its also a central question for policymakers. In a recent post on 2020science, Andrew Maynard revealed some of the key inspirations that got him hooked on science. Following a Twitter challenge to do the same, I tweeted my top three inspirations, but I thought I would expand a little here. So my top three are:

  1. David Attenborough. Or more specifically, the television programmes he presented. These programmes provided a window onto the natural world, and I loved the exploration, the exoticism and the obvious enthusiasm of Attenborough himself. It was through watching Attenborough that I gained an appreciation of the diversity of the natural world, and became fascinated in it. Why and how had that diversity arisen? How is the diversity maintained? And what will happen in the future?
  2. My chemistry teacher. Mr Jones taught me chemistry throughout my secondary school years, and he was a truly inspiring person. One of the things he passed on was a love of the experimental side of his subject. His chemistry demonstrations were legendary, and the explosions could often be heard across the school campus. He also made sure we spent lots of time actually doing experiments, and I learned my chemistry at the bench not at the blackboard or from a book. When I look back, he also taught me the scientific method – the ‘how?’ of science as well as the ‘what?’. In retrospect I think this was the most valuable thing I learned at school.
  3. ‘The Selfish Gene’ by Richard Dawkins. I can remember reading this book during my later years at school. I had always thought my interests in biodiversity and molecules were separate, but Dawkins helped me see the links between them. From this point I knew that life sciences was the discipline for me. Even 30 years on ‘The Selfish Gene’ is still a great book – if you haven’t read it pick up a copy here.

I would be interested to here about your inspirations to become a scientist in the comments.

%d bloggers like this: