GeneChefAI-powered bioinformatics for researchers

Product

  • Workflows
  • Pricing
  • Documentation
  • Blog

Company

  • Contact
  • Support

Legal

  • Privacy Policy
  • Terms of Service

© 2026 GeneChef. All rights reserved.

HIPAA & GDPR Compliant
Blog/Variant Calling from Whole Genome Sequencing: A Biologist's Guide
Tutorials8 min read

Variant Calling from Whole Genome Sequencing: A Biologist's Guide

A practical guide for wet-lab researchers generating whole genome sequencing data who need to identify genetic variants but lack command-line bioinformatics skills.

GTGeneChef TeamFebruary 24, 2026
Share
variant-callingwgsdeepvarianttutorial

On this page

You just got an email from your sequencing core. Your whole genome sequencing run is done — 30x coverage across 12 samples, exactly what you asked for. The FASTQ files are sitting on the server. Now you need to find the variants: the SNPs, the indels, maybe a structural rearrangement that explains why your knockout mice have that unexpected phenotype.

You open a bioinformatics tutorial and the first instruction says: bwa mem -t 8 -R '@RG\tID:sample1' hg38.fa sample1_R1.fq.gz sample1_R2.fq.gz | samtools sort -o sample1.bam. You close the tab.

This is where most bench scientists get stuck. Not because variant calling is conceptually hard — it's actually pretty intuitive once you understand what's happening — but because the tools were built by and for people who live in the terminal. Let's fix that.

What Is Variant Calling, and Why Should You Care?

Variant calling is the process of finding where your sample's DNA differs from a reference genome. That's it. You're comparing your sequencing reads against the "standard" human (or mouse, or zebrafish) genome and flagging every position where something is different.

Those differences fall into a few categories:

  • SNPs (single nucleotide polymorphisms): One base is swapped for another. The most common type. Think of the classic sickle cell mutation — a single A→T change in the HBB gene.
  • Indels (insertions and deletions): Small stretches of DNA are added or removed. These often cause frameshifts in coding regions, which can be devastating to protein function.
  • Structural variants: Larger rearrangements — deletions of thousands of bases, duplications, inversions, translocations. These are harder to detect and we'll be honest about that later.

If you're studying rare disease genetics, cancer biology, population diversity, or even just confirming that your CRISPR edit landed where you intended, variant calling is how you get answers from raw sequencing data.

The Traditional Pipeline (and Why It's Painful)

A standard variant calling workflow looks something like this:

  1. Quality control — Check your raw reads for adapter contamination, low-quality bases, and GC bias (FastQC, Trimmomatic or fastp).
  2. Alignment — Map your reads to a reference genome using BWA-MEM or Bowtie2. This produces a BAM file — essentially a giant table showing where each read landed on the reference.
  3. Duplicate marking — Flag PCR duplicates so they don't inflate your variant confidence (Picard MarkDuplicates or samtools markdup).
  4. Variant calling — The actual detection step. GATK HaplotypeCaller has been the gold standard for years. DeepVariant, Google's deep learning approach, is increasingly popular and often more accurate for SNPs and indels.
  5. Annotation — Your raw variant calls are just genomic coordinates. You need tools like VEP (Ensembl) or SnpEff to tell you which gene each variant hits, whether it changes an amino acid, and whether it's been seen before in population databases.
  6. Filtering — Not every variant call is real. You need to filter by quality scores, read depth, and strand bias to separate true variants from sequencing artifacts.

Each of these steps requires installing software, downloading reference genomes (the human reference alone is ~3 GB), writing command-line scripts, and managing intermediate files that can easily reach hundreds of gigabytes. For 12 WGS samples at 30x, you're looking at roughly 1–2 TB of intermediate data.

A bioinformatician might set this up in a day and let it run overnight. If you're learning from scratch, budget two to four weeks just to get a working pipeline — and that's before you start troubleshooting the inevitable errors.

How GeneChef Handles This

GeneChef lets you describe your analysis in plain English, and an AI assistant builds the Galaxy workflow for you.

Here's what that actually looks like:

  1. Upload your FASTQ files. Drag and drop, or paste a URL if your sequencing core provides download links. GeneChef handles the cloud storage.
  2. Describe your experiment. In the AI chat, you might type something like: "I have paired-end whole genome sequencing data from 12 mouse samples, 30x coverage, on the mm39 reference. I need to call SNPs and indels using DeepVariant, then annotate with SnpEff."
  3. Review the workflow. The AI builds a Galaxy workflow with the right tools in the right order — FastQC for QC, BWA-MEM for alignment, DeepVariant for variant calling, SnpEff for annotation. You can see every step, adjust parameters if you want, or just trust the defaults.
  4. Run it. Click run. DeepVariant executes on GPU-accelerated NVIDIA L4 hardware, which makes it roughly 10x faster than running on a standard CPU server. A single 30x human genome that takes 8–10 hours on CPU finishes in under an hour on GPU.
  5. Get your results. Annotated VCF files, summary statistics, and quality reports — all downloadable from your browser.

For those 12 mouse samples, you're looking at roughly 4–6 hours of total compute time with GPU acceleration, versus 2–3 days on a typical academic cluster (assuming you don't hit a queue).

Reading Your VCF Results: A Plain-Language Guide

The VCF (Variant Call Format) file is what you actually care about. It looks intimidating at first, but the key columns are straightforward:

  • CHROM / POS: The chromosome and position of the variant. This is your genomic coordinate.
  • REF / ALT: The reference base(s) and what your sample has instead. REF=A, ALT=G means your sample has a G where the reference has an A.
  • QUAL: A confidence score. Higher is better. Anything above 30 is generally considered reliable (that's a 1-in-1,000 chance of being wrong).
  • FILTER: Either PASS or a reason the variant was flagged as suspicious. Focus on the PASS variants first.
  • GT (genotype): 0/0 means homozygous reference (no variant), 0/1 means heterozygous (one copy of the variant), 1/1 means homozygous alternate (both copies carry the variant).
  • AD (allelic depth): How many reads support the reference vs. the alternate allele. If you see AD=30,28, that's a solid heterozygous call with good support on both alleles.

SnpEff annotation adds information like which gene the variant falls in, whether it's missense, nonsense, or synonymous, and its predicted impact (HIGH, MODERATE, LOW, MODIFIER).

Common Use Cases

Rare disease diagnosis. You're looking for the one or two variants (out of ~4 million in a typical human genome) that explain a patient's phenotype. Trio analysis — sequencing the patient plus both parents — helps you filter down to de novo or recessive candidates.

Cancer somatic mutations. Comparing tumor vs. normal tissue to find mutations that are unique to the cancer. This requires a slightly different pipeline (Mutect2 or similar somatic callers), and GeneChef's AI will select the right tools when you describe a tumor-normal experiment.

Population genetics. Calling variants across dozens or hundreds of samples to study allele frequencies, selection, or population structure. Joint calling across samples improves accuracy for rare variants.

CRISPR validation. Confirming your edit is on-target and checking for off-target effects elsewhere in the genome.

When You Still Need a Human Expert

Let's be honest about the limitations:

  • Structural variants (large deletions, duplications, inversions, translocations) are significantly harder to detect from short-read WGS. Tools like Manta or DELLY can find some of them, but sensitivity is lower than for SNPs and indels. If structural variants are your primary interest, long-read sequencing (PacBio or Oxford Nanopore) with specialized callers is a better approach.
  • Novel or non-model organisms without a well-assembled reference genome make alignment-based variant calling unreliable. You may need a de novo assembly first.
  • Clinical-grade variant interpretation — deciding whether a specific variant actually causes disease — requires domain expertise, database lookups (ClinVar, gnomAD), and often manual review. GeneChef gives you the variant calls and annotations, but interpreting pathogenicity is still a job for a geneticist or genetic counselor.
  • Complex experimental designs like large multi-family linkage studies or somatic variant calling in low-purity tumor samples benefit from a bioinformatician who can tune parameters and validate results.

Cost and Time: GeneChef vs. Traditional Approaches

| Approach | Setup Time | Run Time (12 samples, 30x WGS) | Cost | |----------|-----------|-------------------------------|------| | Learn bioinformatics yourself | 2–4 weeks | 2–3 days on academic cluster | Free (but your time isn't) | | Hire a bioinformatician | 1–2 weeks (if available) | 1–2 days | $80–150/hr consulting | | Commercial analysis service | 1–2 weeks turnaround | Included | $200–500 per sample | | GeneChef (Professional tier) | 15 minutes | 4–6 hours (GPU-accelerated) | $299/month flat rate |

The Professional tier includes 10 GPU hours per month, which covers roughly 10 individual WGS variant calling runs with DeepVariant. For larger projects, additional GPU time is billed at $2.50/hour — still far cheaper than most alternatives.

Getting Started

If you have FASTQ files from a WGS experiment and you want to find variants, here's the fastest path:

  1. Sign up for a free 14-day trial at genechef.io.
  2. Upload your FASTQ files.
  3. Open the AI assistant and describe your experiment — species, reference genome, what you're looking for.
  4. Review the workflow, hit run, and check back in a few hours.

You don't need to install anything. You don't need to learn the command line. You don't need to provision a server. You just need your data and a question.


GeneChef is a managed bioinformatics platform built on Galaxy that lets researchers run complex genomic analyses without writing code. AI-powered workflow building, GPU-accelerated tools, and 100+ bioinformatics tools — all in your browser at genechef.io.

On this page

Continue reading

Tutorials

How to Run RNA-Seq Analysis Without Coding or a Bioinformatician

A practical guide for wet-lab biologists who generate RNA-seq data but lack computational skills. Learn how AI-powered platforms let you run the entire pipeline by describing your experiment in plain English.

GTGeneChef TeamMar 10, 20269 min
Tutorials

Run AlphaFold2 Protein Structure Prediction — No GPU Setup Required

A practical guide for wet-lab biologists, structural biologists, and biochemists who need protein structures but don't have GPU infrastructure or computational expertise.

GTGeneChef TeamMar 3, 20269 min
Tutorials

ChIP-Seq Analysis Made Simple: From Raw Data to Peaks

A practical guide for wet-lab researchers performing ChIP-seq experiments who need to analyze their data but lack computational bioinformatics experience.

GTGeneChef TeamFeb 17, 20269 min
← PreviousChIP-Seq Analysis Made Simple: From Raw Data to PeaksNext →Run AlphaFold2 Protein Structure Prediction — No GPU Setup Required