Quickly create your own ultra-fast pipeline for batch processing of up to 100 files. Use defaults or define specific parameters. It's your choice!
Meet EvE, the first universal genetic adaptor that aligns, calls, annotates, converts and interprets almost any genetic data file to EvErything. EvE includes a straightforward user interface and powerful, dynamic parallel processing so that using your custom EvE pipeline is effortless and fast.
With EvE Premium Batch, you can easily create your own ultra-fast custom pipeline for batch processing. For example, you can select Isaac for alignment and SamTools for variant calling. Or you can select CutAdapt for pre-processing, TopHat2 for alignment, Isaac for variant calling, SnpEff for annotation and ClinVar for interpretation.
Once you select your pipeline, you can use defaults or easily modify almost any parameter. EvE Premium Batch will then run your batch in parallel through your custom pipeline using multithreaded cluster cloud-based processing.
You no longer have to worry about whether your computer hardware and servers are powerful enough to process your data. With EvE, you also don't have to worry about bandwidth charges or obscure cloud computing fees. All you need is an internet connection (such as from a laptop or mobile device) and EvE will process your genetic data.
There are three editions of EvE:
1. EvE Free
- Cost: Free
- Performs all pre-processing, alignments, variant discovery, annotations and conversions.
- With just a few clicks, EvE will process your data.
- EvE's Default Settings processes data according to most-often-used settings while Advanced Settings allow you to define most parameters.
- Select any human reference genome from hg2 through GRCh38.p6 or simply use the default 'Auto Detect'
- Reference Genome Auto Detect
- automatically uses GRch38.p6 if your pipeline includes alignment
- automatically detects the reference genome that was used to create your genetic data file if the file format is downstream of alignment (such as BAM, SAM, CRAM, VCF, CSV, TXT, etc.)
2. EvE Premium
- Cost: $0.99/use
- All of the features of EvE and:
- incredible speed
- Premium utilizes dedicated multithreaded cluster cloud computing technology to significantly speed up processing so you receive results in a fraction of the time.
- Premium includes the option for interpretation of human genetic data including identification of pathologic variants.
- Example: converting a whole genome 100GB FASTQ file to VCF using SamTools or GATK can take 12 hours or more but with EvE Premium the conversion time is around 60 minutes.
3. EvE Premium Batch (this app)
- Cost: $99/use
- All the features of EvE Premium and
- batch processing
- Premium Batch includes the ability to process multiple files at a single time, all with incredible speed.
- Simultaneously convert a batch of up to 100 whole genomes from FASTQ to VCF in around 60 minutes total.
EvE Premium Batch v2 supports the following:
- Batch sizes up to 100 different files
- All files must be of the same file format (for example, all files must be FASTQ files or all files must be BAM files)
- Each file will be processed according to your custom pipeline.
- The app will process all files in parallel using dynamic multithreaded cloud processing
- FASTQ to VCF (Supports both single end reads and paired-end reads)
- FASTA to VCF (Supports both single end reads and paired-end reads)
- FASTQ to BAM
- FASTA to BAM
- FASTQ to SAM
- FASTA to SAM
- SAM to FASTQ
- BAM to FASTQ
- BAM to VCF
- SAM to VCF
- SAM to BAM
- BAM to SAM
- BAM to SVG
- BAM to CRAM
- SAM to CRAM
- BED to VCF
- VCF to gVCF
- VCF to GVF
- GVF to VCF
- VCF to WT (Wormtable)
- GVCF to VCF
- Genome VCF to VCF
- VCF to Genome VCF
- Text region lists to VCF: When a region list is supplied then data for those regions will be extracted.
- CSV to VCF (specific formatting required)
- TXT to VCF (specific formatting required)
- FASTA/ FASTQ/ SAM or BAM to Clinical plus VCF: This is Sequencing VCF format file that included calls and no calls data but excludes reference calls.
- FASTA/ FASTQ/ SAM or BAM to annotated VCF file
- FASTA/ FASTQ/ SAM/ BAM and VCF file to GVF: Supports converting Genome Variation Format file
- FASTA/ FASTQ/ SAM/ BAM and VCF to Wormtable format
- Array to VCF: Converts a gene array file including 23andMe, Ancestry.com, Family Tree DNA and The Genographic Project (National Geographic) into a VCF file
EvE Premium Batch accepts almost all file formats including .bz2 and .gz compression.
EvE also accepts inputs generated using any reference genome from hg2 through GRCh38.p3.
- gVCF (genomic VCF)
- GVF (Genome variant format)
- WT (Wormtable)
- 23andMe data file
- AncestryDNA data file
- Family Tree DNA data file
- Genographic Project (National Geographic) data file
Outputs can be produced using any reference genome of your choice from hg2 through GRCh38.p3.
- VCF with reference SNP IDs (rs<number>) added
- Annotated VCF (includes reference SNP IDs rs<number>)
- gVCF (genomic VCF)
- GVF (Genome variant format)
- Clinical+ VCF
- WT (Wormtable)
- ADAM (coming soon)
- GFF3 (coming soon)
- SQLite (coming soon)
- MongoDB (coming soon)
- -6, --illumina1.3+: Assume the quality is in the Illumina 1.3+ encoding.
- -A, --count-orphans: Do not skip anomalous read pairs in variant calling.
- -b, --bam-list FILE: List of input BAM files, one file per line [null]
- -B, --no-BAQ: Disable probabilistic realignment for the computation of base alignment quality (BAQ). BAQ is the Phred-scaled probability of a read base being misaligned. Applying this option greatly helps to reduce false SNPs caused by misalignments.
- -C, --adjust-MQ INT: Coefficient for downgrading mapping quality for reads containing excessive mismatches. Given a read with a phred-scaled probability q of being generated from the mapped position, the new mapping quality is about sqrt((INT-q)/INT)*INT. A zero value disables this functionality; if enabled, the recommended value for BWA is 50. 
- -d, --max-depth INT:At a position, read maximally INT reads per input BAM. 
- -E, --redo-BAQ: Recalculate BAQ on the fly, ignore existing BQ tags
- -f, --fasta-ref FILE: The faidx-indexed reference file in the FASTA format. The file can be optionally compressed by bgzip. [null]
- -G, --exclude-RG FILE: Exclude reads from readgroups listed in FILE (one @RG-ID per line)
- -l, --positions FILE: BED or position list file containing a list of regions or sites where pileup or BCF should be generated. If BED, positions are 0-based half-open [null]
- -q, -min-MQ INT: Minimum mapping quality for an alignment to be used 
- -Q, --min-BQ INT: Minimum base quality for a base to be considered 
- -r, --region STR: Only generate pileup in region. Requires the BAM files to be indexed. If used in conjunction with -l then considers the intersection of the two requests. STR [all sites]
- -R, --ignore-RG: Ignore RG tags. Treat all reads in one BAM as one sample.
- --rf, --incl-flags STR|INT: Required flags: skip reads with mask bits unset [null]
- --ff, --excl-flags STR|INT: Filter flags: skip reads with mask bits set [UNMAP,SECONDARY,QCFAIL,DUP]
- -x, --ignore-overlaps: Disable read-pair overlap detection.
- -o, --output FILE: Write pileup or VCF/BCF output to FILE, rather than the default of standard output.
- -g, --BCF: Compute genotype likelihoods and output them in the binary call format (BCF). As of v1.0, this is BCF2 which is incompatible with the BCF1 format produced by previous (0.1.x) versions of samtools.
- -v, --VCF: Compute genotype likelihoods and output them in the variant call format (VCF). Output is bgzip-compressed VCF unless -u option is set.
- -O, --output-BP: Output base positions on reads.
- -s, --output-MQ: Output mapping quality.
Output Options for VCF/BCF format (with -g or -v):
- -D: Output per-sample read depth [DEPRECATED - use -t DP instead]
- -S: Output per-sample Phred-scaled strand bias P-value [DEPRECATED - use -t SP instead]
- -t, --output-tags LIST
Options for SNP/INDEL Genotype Likelihood Computation (for -g or -v):
- -e, --ext-prob INT: Phred-scaled gap extension sequencing error probability. Reducing INT leads to longer indels. 
- -F, --gap-frac FLOAT: Minimum fraction of gapped reads [0.002]
- -h, --tandem-qual INT: Coefficient for modeling homopolymer errors. Given an l-long homopolymer run, the sequencing error of an indel of size s is modeled as INT*s/l. 
- -I, --skip-indels: Do not perform INDEL calling
- -L, --max-idepth INT
- -m, --min-ireads INT: Minimum number gapped reads for indel candidates INT. 
- -o, --open-prob INT: Phred-scaled gap open sequencing error probability. Reducing INT leads to more indel calls. 
- -p, --per-sample-mF: Apply -m and -F thresholds per sample to increase sensitivity of calling. By default both options are applied to reads pooled from all samples.
- -P, --platforms STR: Comma-delimited list of platforms (determined by @RG-PL) from which indel candidates are obtained. It is recommended to collect indel candidates from sequencing technologies that have low indel error rate such as ILLUMINA. [all]
- -R: Reference File Name
- -I: BAM file name
- --alleles: The set of alleles at which to genotype when --genotyping_mode is GENOTYPE_GIVEN_ALLELES
- --comp: Comparison VCF file
- --dbsnp: -D dbSNP file
- --annotation: -A: One or more specific annotations to apply to variant calls
- --contamination_fraction_to_filter, -contamination: Fraction of contamination in sequencing data (for all samples) to aggressively remove
- --excludeAnnotation, -XA: One or more specific annotations to exclude
- --genotype_likelihoods_model, -glm NA Genotype likelihoods calculation model to employ -- SNP is the default option, while INDEL is also available for calling indels and BOTH is available for calling both together
- --genotyping_mode, -gt_mode: Specifies how to determine the alternate alleles to use for genotyping,
- --group, -G: One or more classes/groups of annotations to apply to variant calls. The single value 'none' removes the default group
- --heterozygosity, -hets: Heterozygosity value used to compute prior likelihoods for any locus.
- --indel_heterozygosity, -indelHeterozygosity: Heterozygosity for indel calling.
- --max_deletion_fraction, -deletions: Maximum fraction of reads with deletions spanning this locus for it to be callable
- --min_base_quality_score, -mbq: Minimum base quality required to consider a base for calling
- --min_indel_count_for_genotyping, -minIndelCnt: Minimum number of consensus indels required to trigger genotyping run
- --min_indel_fraction_per_sample, -minIndelFrac: Minimum fraction of all reads at a locus that must contain an indel (of any allele) for that sample to contribute to the indel count for alleles
- --output_mode, -out_mode: Specifies which type of calls we should output
- --pair_hmm_implementation -pairHMM: The PairHMM implementation to use for -glm INDEL genotype likelihood calculations
- --pcr_error_rate, -pcr_error: The PCR error rate to be used for computing fragment-based likelihoods
- --standard_min_confidence_threshold_for_calling, -stand_call_conf: The minimum phred-scaled confidence threshold at which variants should be called.
- --standard_min_confidence_threshold_for_emitting, -stand_emit_conf: The minimum phred-scaled confidence threshold at which variants should be emitted (and filtered with LowQual if less than the calling threshold)
- -annotateNDA, -nda: If provided, we will annotate records with the number of alternate alleles that were discovered (but not necessarily genotyped) at a given site
- --computeSLOD, -slod: If provided, we will calculate the SLOD (SB annotation)
- -indelGCP: Indel gap continuation penalty, as Phred-scaled probability. I.e., 30 => 10^-30/10
- --indelGapOpenPenalty, -indelGOP: Indel gap open penalty, as Phred-scaled probability. I.e., 30 => 10^-30/10
- --input_prior, -inputPrior: Input prior for calls
- --max_alternate_alleles, -maxAltAlleles: Maximum number of alternate alleles to genotype
- -noShiftHgvs: Do not shift variants towards most 3-prime position (as required by HGVS).
- -canon: Only use canonical transcripts.
- -interval: Use a custom intervals in TXT/BED/BigBed/VCF/GFF file (you may use this option many times)
- -motif: Annotate using motifs (requires Motif database).
- -noMotif: Disable motif annotations.
- -noNextProt: Disable NextProt annotations.
- -onlyReg: Only use regulation tracks.
- -onlyProtein: Only use protein coding transcripts. Default: false
- -onlyTR: Only use the transcripts in this file. Format: One transcript ID per line.
- -reg: Regulation track to use (this option can be used add several times).
- -ss , -spliceSiteSize : Set size for splice sites (donor and acceptor) in bases. Default: 2
- -strict: Only use 'validated' transcripts (i.e. sequence has been checked). Default: false
When you use EvE Premium, your data files and result file(s) will be stored in your Sequencing.com account, which provides free, unlimited storage of genetic data.
This simple yet secure approach means you no longer have to use your own storage or computing power to conduct genetic analysis and you also no longer have to use USB drives or FTP sites to move and share genetic data.
EvE includes an integration of the following:
A suite of programs for interacting with high-throughput sequencing data. It consists of three separate repositories:
- Samtools: Reading/writing/editing/indexing/viewing SAM/BAM/CRAM format
- BCFtools: Reading/writing BCF2/VCF/gVCF files and calling/filtering/summarising SNP and short indel sequence variants
- HTSlib: A C library for reading/writing high-throughput sequencing data
GATK (Genome Analysis Toolkit)
GATK analyzes high-throughput sequencing data. The toolkit, which focuses on providing high quality data, offers a wide variety of tools including variant discovery and genotyping.
Isaac Variant Caller
Isaac Variant Caller (IVC) is an analysis package designed to detect SNVs and small indels from the aligned sequencing reads of a single diploid sample.
SnpEFF predicts the effects of genetic variations and also provides annotations. Annotations can simple such as variant name or more complex such as site of variation such as exon, intron and its effect on gene expression.
Since UGENE offers conversions between different file formats such which is a multistep process at time, we use our own scripts to pipe data and create some file types. Specifically, the Clinical plus VCF files and gVCF to VCF conversions are executed by sequencing.com customs scripts.
Wormtable is a format for storing large scale tabular data and interacting with it. It generates an index file as well that can be used repeatedly. Wormtable files are considered very Python friendly and can be used in downstream Python based analysis.
GVF (Genome Variation Format) is gaining popularity as a standard for sequence ontology based datasets. It is a successor of the GFF3 format and includes pragmas for defining sequence alterations at genomic locations as compared to the reference genome.
Human clinical applications require sequencing information for both variant and non-variant positions, yet there is currently no common exchange format for such data. Genomic VCF (gVCF) addresses this issue. gVCF is a set of conventions applied to the standard variant call format (VCF) that include genotype, annotation and other information across all sites in the genome in a reasonably compact format.
- The Sequence alignment/map (SAM) format and SAMtools Li H., et al., 2009 Bioinformatics 25: 2078-9. [Pubmed]
- A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Li H. Bioinformatics 2011 Nov 1;27(21): 2987-93. Epub 2011 Sep 8. [Pubmed]
- Multiallelic calling model in bcftools (-m) Danecek P., et al. [Link]
- Improving SNP discovery by base alignment quality. Li H. Bioinformatics 2011 Apr 15;27(8):1157-8. doi: 10.1093/bioinformatics/btr076. Epub 2011 Feb 13. [Pubmed]
- Segregation based metric for variant call QC Durbin R. [Link]
- Mathematical Notes on SAMtools Algorithms Li H. [Link]
- The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data, McKenna A, et al., Genome Research 2010 20:1297-303 [Article] [Pubmed]
- A framework for variation discovery and genotyping using next-generation DNA sequencing data, DePristo M, et al., Nature Genetics 2011 43:491-498 [Article] [Pubmed]
- From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline, Van der Auwera GA, et al., Current Protocols in Bioinformatics 2013 43:11.10.1-11.10.33 [Article] [Pubmed]
Isaac Variant Caller
- Isaac: ultra-fast whole-genome secondary analysis on Illumina sequencing platforms Bioinformatics June 4, 2013 29(16): 2041-2043 [Article] [Pubmed]
- Isaac Genome Alignment and Isaac Variant Caller Illumina Technical Support Note. Pub. No. 770-2013-009 Current as of July 3, 2013 [Link]
- A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin). 2012 Apr-Jun;6(2):80-92. doi: 10.4161/fly.19695. [PubMed]
- Clinical+ VCF [Link]
- GVF Convertor
- Processing genome scale tabular data with wormtable [Article]