Accurate detection of genetic variants -- SNVs, indels, and structural/copy number variation -- is a critical step in next-generation sequence analysis. VarScan 2 is a platform-independent software package for variant calling in massively parallel sequencing data generated by Illumina, IonTorrent, Roche/454, and similar instruments. Given data from an individual or cohort of individuals, it identifies germline SNPs and indels based on heuristic, user-defined thresholds such as minimum coverage, variant allele frequency, and supporting read count. Given data from a tumor-normal pair, VarScan 2 identifies somatic mutations (SNVs and indels), loss-of-heterozygosity events, and somatic mutations, as well as somatic copy number alterations. The package is accompanied by a filtering tool that removes likely artifactual calls based on strand representation, mapping differences, and other criteria.
VarScan combines heuristic (coverage, variant allele frequency, supporting read counts, strand representation) and statistical (Fisher's Exact Test of read counts) methods for variant detection and classification. This strategy can be advantageous over Bayesian variant callers that are sensitive to extreme read depths and make assumptions about expected allele frequencies. VarScan 2 is thus better suited to analyses involving pooled samples, variable read depth, or heterogeneous samples such as tumor samples.
Our software is used by large-scale sequencing centers, commercial entities, and numerous research groups. It was employed for somatic mutation calling in The Cancer Genome Atlas (TCGA) studies of ovarian and breast carcinoma, and has been cited by more than 180 publications, as of March 2013.
Based on comparisons with high-density SNP array data, VarScan 2 provides high accuracy (>99.5%) for germline variant calling. We have also performed extensive orthogonal validation of somatic mutation calls in solid tumors, which support VarScan 2's high sensitivity (~90%) to detect valid somatic mutations with a relatively low (~15%) false positive rate. Finally, a comparison of VarScan 2 exome-based copy number analysis to SNP array and WGS data for the same tumors indicates that VarScan's SCNA calls are highly concordant with those of other platforms, while potentially identifying additional focal events in coding regions.
The ongoing development and improvement of VarScan 2 benefits from a wide community of active users. More than a dozen code updates have been released since 2010, many in response to user feedback and bug reports. For example, we made VarScan compatible with Variant Call Format (VCF) upon request. The web site at http://varscan.sourceforge.net includes variant calling tutorials, documentation, and software downloads. We interact with users by e-mail, on the SourceForge Help forum, and on bioinformatics communities such as BioStars and SeqAnswers.