Distilled database identifies genetic links to rare diseases

Published online 24 March 2023

A unique method of condensing genetic data enables the rapid discovery of genetic causes for rare diseases. 

Lara Reid

A variant of the ERG protein (green) and cell nuclei (blue) present in some lymphoedema patients. In healthy individuals, the protein exists only inside the nuclei of cells.
A variant of the ERG protein (green) and cell nuclei (blue) present in some lymphoedema patients. In healthy individuals, the protein exists only inside the nuclei of cells.

Daniela Pirri and Graeme Birdsey, Imperial College London (2023)
There are approximately 10,000 catalogued rare diseases that affect humans, but only half of these have a known genetic cause. Now, a unique database developed by Daniel Greene at the Icahn School of Medicine at Mount Sinai, New York, and an international team including scientists in Saudi Arabia, has enabled the identification of 19 new associations between genes and rare diseases, with many more to follow. 

“From a statistical perspective, large numbers of patients and their families are needed for research studies to help pinpoint the genetic causes of rare diseases,” says Ernest Turro at Mount Sinai, senior author of the study. “It cost one billion dollars to sequence one whole genome 25 years ago. Now it costs just a few hundred dollars, making large studies such as the 100,000 Genomes Project (100KGP) feasible.”

The 100KGP involved sequencing the genomes and collecting clinical data for 34,523 patients in the United Kingdom, and 43,016 unaffected relatives across 29,741 families. To analyse this vast dataset efficiently, the researchers developed a computational approach to distil the most pertinent genetic information into a relatively small database, which they called the ‘Rareservoir’. 

“We took advantage of the fact that the genetic variants responsible for rare diseases are typically kept rare in the human population by natural selection, because affected individuals tend to have few children, if any,” says Turro. “This meant that we could discard the genetic information corresponding to common variants in the human population without throwing away the key disease-causing variants.”

The team examined associations between 20,000 genes and 269 classes of rare diseases. They confirmed 241 known associations and identified 19 new ones. From these, they selected the three most promising associations to validate. 

“Working with our international collaborators, we identified additional families with these three diseases and performed experiments to confirm our results,” says Turro. “This included our colleagues in Saudi Arabia, who contributed excellent data from a family affected by congenital deafness.”

The team verified that variants in a gene called GPR156 cause congenital deafness, and ERG gene variants cause primary lymphoedema – a condition involving fluid retention and swelling in the body’s tissues. Finally, they showed that PMEPA1 gene variants cause familial thoracic aortic aneurysm disease – hereditary swelling of the main artery in the chest.

“The remaining 16 gene-disease relationships merit further exploration,” says Turro. “Of the 269 disease classes analysed, 28 contained fewer than five families, limiting our ability to make discoveries. Boosting sample sizes for such ultra-rare disorders by encouraging enrolment is crucial.”

Turro hopes that the Rareservoir will be an invaluable tool to hasten discoveries across other large genetic datasets from rare disease patients.


Greene, D. et al. Genetic association analysis of 77,539 genomes reveals rare disease etiologies. Nat. Med. (2023).