Illumina trains AI on primate data to estimate risk of rare variants in humans

Dive Brief:

Illumina has trained a neural network on sequencing data from 233 different primate species to create PrimateAI-3D.
Because primate proteins are nearly identical to human proteins, the sequencing results helped the researchers to overcome the lack of labeled data for training large machine learning models and create an artificial intelligence capable of making predictions about humans.
In a pair of papers published in Science, PrimateAI-3D successfully distinguished between benign and pathogenic variants and estimated the pathogenicity of rare coding variants.

Dive Insight:

Millions of people have now undergone genome and exome sequencing, often on machines developed by Illumina, but the effects of most genetic variants identified in the studies remain unknown. AI’s ability to operate at speeds and scales beyond human capabilities suggests models may be able to help identify variants that cause disease, but large, labeled datasets are needed for training.

Illumina identified primate studies as a way to increase the size of the training dataset. The researchers obtained whole-genome sequencing data for 809 individuals from 233 primate species. In that dataset, the scientists cataloged 4.3 million common missense variants, or changes that alter an amino acid.

In one of the papers, the researchers showed that 99% of human missense variants found in at least one nonhuman primate species were annotated as benign in the ClinVar database. Variants from mammals other than primates were less likely to be benign, suggesting that only data from our closest ancestors is useful for AI training.

The researchers classified 4 million human missense variants as likely benign and showed PrimateAI-3D is better than 15 other published machine learning methods at distinguishing between benign and pathogenic variants. Having developed the model, the collaborators applied it to human exome data in the second Science paper.

PrimateAI-3D revealed 73% more significant gene-phenotype associations in the UK Biobank exome dataset compared with not using the model, suggesting the AI can help improve genetic risk prediction. The scientists framed the study as evidence of the “utility of personal genome sequencing for otherwise healthy individuals in the general population,” a conclusion that may boost demand for Illumina devices.

The publications come at a turbulent time for the company, with its shareholders voting to replace board Chair John Thompson with a candidate put forward by activist investor Carl Icahn and the fate of Grail still to be determined.