科学雑誌に掲載された研究の将来的な「インパクト」を、機械学習モデルを用いて予測できることを報告する論文が、Nature Biotechnology に掲載される。このモデルは、任意の1年間に出版された「トップ5%の論文」を、独自のスコアを用いて予測するものであり、科学者の研究の潜在的なインパクトを、論文の引用回数を利用する指標に依存して計測する既存の書誌学システムを補完できる可能性がある。


今回、James WeisとJoseph Jacobsonはこのアイデアを実践しようと、DELPHI(Dynamic Early-warning by Learning to Predict High Impact)と呼ばれるモデルを採用し、これを、科学研究グラフを使って訓練した。1980~2019年に出版されたユニークな168万7850編からなる論文のプールを用いて、論文発表後1~5年間の、各論文、著者、雑誌、ネットワークに関連する29の特徴のセットを抽出した。次に、各論文のこれらの特徴を用いて機械学習モデルを訓練し、インパクトの「早期警報」スコアを算出した。


A machine-learning model can be used to predict the future ‘impact’ of work published in the scientific literature, according to a paper in Nature Biotechnology. The model, whose score is used to predict the ‘top 5% of papers’ published in any year, could complement existing bibliographic systems that rely on metrics employing paper citations to gauge the potential impact of a scientist’s work.

Many systems have been employed to assess the scientific output of researche, including metrics based on the number of citations accrued by the papers they author. With the advent of machine learning, the opportunity exists to use more aspects related to researcher output in determining the potential impact of their published work. This has led to the proposal that a machine learning model that predicts time-scaled ‘PageRank’ scores, similar to the metric used to rank the importance of webpages, could be applied to researcher output.

James Weis and Joseph Jacobson implemented this idea by employing a model called DELPHI (Dynamic Early-warning by Learning to Predict High Impact) which was trained on the scientific research graph. Using a pool of 1,687,850 unique papers published between 1980 and 2019, a set of 29 features relating to each paper, author, journal and network were derived for 1 to 5 years post-publication. The features for each paper were then used to train a machine-learning model that produced an ‘early warning’ score of impact.

The authors’ model correctly identified 19 out of 20 seminal biotechnologies from the 1980–2014 period in a blind, retrospective study. The model also predicted 50 papers published in 2018 from 42 biotechnology-related journals which would appear in the top 5% in the future, and could be used to identify and channel funding to ‘hidden gem’ research in a data-driven manner. Further extensive testing will be needed to evaluate performance of the approach in fields outside of biotechnology against traditional impact indicators, such as field-normalized citation scores, before such models can be adopted in other areas of research.

