deepvariant
DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
DeepVariant is a tool built by Google that reads DNA sequencing data and identifies locations in a genome where an individual's DNA differs from a reference. These differences, called variants, can include single letter changes or small insertions and deletions. Finding them accurately is a core step in genomics research and medical genetics.
What makes DeepVariant different from older variant-calling tools is that it uses a type of AI model called a convolutional neural network, the same kind of model used in image recognition. The pipeline converts sections of sequencing data into a visual representation, then passes that image through the neural network to classify whether a variant is present. This approach was published in the journal Nature Biotechnology and won multiple accuracy competitions run by the US Food and Drug Administration's precision medicine initiative.
The tool works with several types of DNA sequencing technology. It supports short-read data from Illumina instruments, long-read data from PacBio and Oxford Nanopore sequencers, and hybrid combinations. There are also specialized modes for whole genome sequencing, whole exome sequencing, and RNA sequencing. A companion tool called DeepTrio extends DeepVariant to analyze genetic data from a child and one or both parents together, which can improve accuracy by using family relationships.
The models included with DeepVariant were trained on human data, so users working with other organisms need to take additional steps. The tool currently supports organisms where each chromosome comes in two copies, which covers humans and many other animals.
Running DeepVariant is done through Docker, a packaging system that bundles the software and its dependencies into a portable container. Users point the tool at their input files and an output directory and specify which sequencing type they used. GPU support is available for faster processing. The repository includes detailed case studies for each supported data type.