Information

GATK workflow for Cancer

GATK workflow for Cancer


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am just starting to learn to use bioinformatics tools. My university has a limited and expensive bioinformatics team, so I'm mostly on my own except for big questions.

I am planning to use GATK to run 58 cancer control/normal pairs of Exome sequencing data (Illumina) from FASTQ or BAM file format, through the pipeline, with an output of a VCF & MAF format for analysis.

The current GATK pipeline is used for disease but not cancer, so I was wondering if anyone knew if there should be changes made for cancer. Here's the current pipeline starting with BAM files:

  • (Non-GATK) Picard Mark Duplicates or Samtools roundup
  • Indel Realignment (Realigner TargetCreator + Indel Realigner)
  • Base Quality Score Reacalibration (Base Recalibrator + PrintReads)
  • HaplotypeCaller
  • VQSR (VariantRecalibrator and ApplyRecalibrator in SNP and INDEL mode)
  • Annotation using Oncotator (?)

I'd like some verification that this pipeline will output what I need to run my samples on MuTect, MutSig, or some other analysis program. I appreciate any advice.


MuTect2 was just released into beta as part of GATK 3.5. It's based on HaplotypeCaller but makes somatic SNV and INDEL calls. You can find more information about MuTect2 on the GATK blog and ask any additional questions on the forum.

As a note: IndelQualityRecalibration is not needed with Mutect2, and there is no VQSR available for somatic calls.

MarkDuplicates -> BQSR -> Mutect2 -> Oncotator is a good basic workflow for somatic variant calling.


This post goes over what MuTec requires as input. mark duplicates and indel realignment will probably have to be done on bam file to use it as input. BQSR is optional and does not change the quality too much. HaplotypeCaller is used for germline not somatic variant calling.

If you have followup bioinformatics questions, you might be able to find answers more quickly on biostars or GATK forums


A versatile workflow to integrate RNA-seq genomic and transcriptomic data into mechanistic models of signaling pathways

Affiliations Clinical Bioinformatics Area, Fundación Progreso y Salud (FPS), Hospital Virgen del Rocío, Sevilla, Spain, Departamento de Biología Celular, Fisiología e Inmunología, Universidad de Córdoba, Córdoba, Spain, Instituto Maimónides de Investigación Biomédica de Córdoba (IMIBIC), Córdoba, Spain, Hospital Universitario Reina Sofía, Córdoba, Spain

Affiliations Clinical Bioinformatics Area, Fundación Progreso y Salud (FPS), Hospital Virgen del Rocío, Sevilla, Spain, Computational Systems Medicine, Institute of Biomedicine of Seville (IBIS), Sevilla, Spain

Roles Methodology, Software

Affiliations Clinical Bioinformatics Area, Fundación Progreso y Salud (FPS), Hospital Virgen del Rocío, Sevilla, Spain, Computational Systems Medicine, Institute of Biomedicine of Seville (IBIS), Sevilla, Spain

Roles Conceptualization, Methodology

Affiliations Clinical Bioinformatics Area, Fundación Progreso y Salud (FPS), Hospital Virgen del Rocío, Sevilla, Spain, Computational Systems Medicine, Institute of Biomedicine of Seville (IBIS), Sevilla, Spain, Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), FPS, Hospital Virgen del Rocío, Sevilla, Spain

Affiliations Departamento de Biología Celular, Fisiología e Inmunología, Universidad de Córdoba, Córdoba, Spain, Instituto Maimónides de Investigación Biomédica de Córdoba (IMIBIC), Córdoba, Spain, Hospital Universitario Reina Sofía, Córdoba, Spain

Roles Conceptualization, Supervision

Affiliations Departamento de Biología Celular, Fisiología e Inmunología, Universidad de Córdoba, Córdoba, Spain, Instituto Maimónides de Investigación Biomédica de Córdoba (IMIBIC), Córdoba, Spain, Hospital Universitario Reina Sofía, Córdoba, Spain

Roles Conceptualization, Funding acquisition, Investigation, Supervision, Writing – original draft, Writing – review & editing

Affiliations Clinical Bioinformatics Area, Fundación Progreso y Salud (FPS), Hospital Virgen del Rocío, Sevilla, Spain, Computational Systems Medicine, Institute of Biomedicine of Seville (IBIS), Sevilla, Spain, Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), FPS, Hospital Virgen del Rocío, Sevilla, Spain, FPS/ELIXIR-es, Hospital Virgen del Rocío, Sevilla, Spain


The GATK joint genotyping workflow is appropriate for calling variants in RNA-seq experiments

The Genome Analysis Toolkit (GATK) is a popular set of programs for discovering and genotyping variants from next-generation sequencing data. The current GATK recommendation for RNA sequencing (RNA-seq) is to perform variant calling from individual samples, with the drawback that only variable positions are reported. Versions 3.0 and above of GATK offer the possibility of calling DNA variants on cohorts of samples using the HaplotypeCaller algorithm in Genomic Variant Call Format (GVCF) mode. Using this approach, variants are called individually on each sample, generating one GVCF file per sample that lists genotype likelihoods and their genome annotations. In a second step, variants are called from the GVCF files through a joint genotyping analysis. This strategy is more flexible and reduces computational challenges in comparison to the traditional joint discovery workflow. Using a GVCF workflow for mining SNP in RNA-seq data provides substantial advantages, including reporting homozygous genotypes for the reference allele as well as missing data. Taking advantage of RNA-seq data derived from primary macrophages isolated from 50 cows, researchers from Agriculture and Agri-Food Canada validated the GATK joint genotyping method for calling variants on RNA-seq data by comparing this approach to a so-called “per-sample” method. In addition, pair-wise comparisons of the two methods were performed to evaluate their respective sensitivity, precision and accuracy using DNA genotypes from a companion study including the same 50 cows genotyped using either genotyping-by-sequencing or with the Bovine SNP50 Beadchip (imputed to the Bovine high density). Results indicate that both approaches are very close in their capacity of detecting reference variants and that the joint genotyping method is more sensitive than the per-sample method. Given that the joint genotyping method is more flexible and technically easier, the researchers recommend this approach for variant calling in RNA-seq experiments.

Common variants found in different datasets

a Comparison of RNA-seq variants detected using the per-sample and the joint genotyping approaches. b Comparison of the two sets of RNA-seq variants with those detected by the BovineHD BeadChip. c Comparison of the two sets of RNA-seq variants with those detected by GBS


Cancer Analysis Workflow v1.0 - from FASTQ to VCF

We are working on a workflow to analyse tumor-normal WGS pairs for a while. It is pretty complex, but now is used in research projects. So, here is the very first version: https://github.com/SciLifeLab/CAW

Just curious, what's the performance like? What size FASTQs do you typically have, and given that, how long would the entire workflow take to produce VCFs?

The raw data (both tumour and normal) is about 200G FASTQ at the beginning. The non-realigned BAMs are about 200G still, recalibration makes them about the double. These are actually relatively low coveage samples, when starting from a 30x normal 60x tumor, these numbers are about the double. Starting from a 0.5T raw FASTQ you will have 1.5T final data, and you will need a 3T tmp space during processing. On a single 16 CPUs 128G node the whole stuff takes about a week. But good point, I will have a to make a comprehensive benchmark.

I really like what you are trying to do here. Congratulations! Tell us more about Nextflow. Saw it a while ago but never had time to check it out. Also, any plans to put the workflow in the cloud?

Thanks, regarding Nextflow we have chosen this DSL mostly because here is a solid user base already. One thing I like is that if you know java and/or groovy, it is easier to debug. There are many things I do not like, but that is true for all programming languages ) The basic concept is a "channel" in Nextflow, you are creating, feeding, joining, forking channels that are usually UNIX pipes. Its slack channel is useful if you are new or advanced - most of the answers are from Paolo, the main developer behind Nextflow.

I'm working on a GATK based pipeline for germline variant calling written in WDL and I've implemented scatter gather parallelization for baserecalibrator, printreads and haplotypecaller. It improves the execution times by up to 7x for me. Printreads went from 18.5 hours with -nct 16 to

2.5 hours with single threaded scatter gather on an 18 core 256GB RAM docker machine. There's built in support for it in nextflow, check out point nr 8 and 9 here: https://github.com/nextflow-io/examples/blob/master/README.md

EDIT: I have some questions for you.

I'm not familiar enough with any of the callers except for HaplotypeCaller, and HaplotypeCaller makes IndelRealigner and RealignerTargetCreator unnecessary since it has taken over that function. But perhaps the other callers that you're using depend on them? And if you need IndelRealigner, you can parallelize it with scatter gather parallelization. I don't know how much time youɽ save since I've never had to do it myself.

What's the pipeline for though? You're using several different callers, so is it for both germline and somatic mutations? It seems a bit overkill to always look for somatic and germline mutations, but I'm very new to the biology side of all of this so I don't know what a doctor or researcher is looking for. We have two separate pipelines for somatic and germline variant calling.


Flagship DSP software products and services

The DSP develops software products and operates services that are widely used across the biomedical ecosystem, such as:

Terra: an open cloud-based platform for accessing data, performing analyses and collaborating securely in the cloud, developed in collaboration with Microsoft and Verily Life Sciences.

GATK: the leading open-source variant discovery package for analysis of high-throughput sequencing data.

Picard: a popular set of open-source command line tools for processing high-throughput sequencing data

Cromwell: An execution engine that allows users to run reproducible workflows written in either the Workflow Description Language (WDL, pronounced widdle) or the Common Workflow Language (CWL), portable across local machines, computer clusters, and cloud platforms (e.g., AWS, Microsoft Azure, Google Cloud Platform)

The Data Donation Platform (DDP): A software stack that enables direct participant engagement, including consent and recontact, via intuitive web and mobile interfaces. DDP provides the underlying infrastructure for disease-specific registries such as the Angiosarcoma Project, the Rare Genomes Project, and the Global A-T Family Data Platform.

The Data Use Oversight System (DUOS): A suite of interfaces for managing interactions between data access committees and researchers seeking to access sensitive genomic datasets.


Data analytics

Make genomics data actionable by analyzing and interpreting data generated by modern genomics technologies using open-source software, big-data analytics, and machine learning services on Azure.

Genomics Notebooks

Genomics Notebooks brings the power of Jupyter Notebooks on Azure for genomics data analysis using GATK , Picard, Bioconductor, and Python libraries.

Bioconductor on Azure

Bioconductor provides hundreds of R based bioinformatics tools for the analysis and comprehension of high-throughput genomic data.

Genomics Data Science

Azure Virtual Machine templates provide preinstalled and preconfigured tools, libraries and SDK s for data exploration, analysis, and modeling.


Data and resource management for workflow-enabled biology

Advancements in sequencing technologies have greatly increased the volume of data available for biological query [58] . Workflow systems, by virtue of automating many of the time-intensive project management steps traditionally required for data-intensive biology, can increase our capacity for data analysis. However, conducting biological analyses at this scale requires a coordinated approach to data and computational resource management. Below, we provide recommendations for data acquisition, management, and quality control that have become especially important as the volume of data has increased. Finally, we discuss securing and managing appropriate computational resources for the scale of your project.

Managing large-scale datasets

Experimental design, finding or generating data, and quality control are quintessential parts of data intensive biology. There is no substitute for taking the time to properly design your analysis, identify appropriate data, and conduct sanity checks on your files. While these tasks are not automatable, many tools and databases can aid in these processes.

Look for appropriate publicly-available data

With vast amounts of sequencing data already available in public repositories, it is often possible to begin investigating your research question by seeking out publicly available data. In some cases, these data will be sufficient to conduct your entire analysis. In others cases, particularly for biologists conducting novel experiments, these data can inform decisions about sequencing type, depth, and replication, and can help uncover potential pitfalls before they cost valuable time and resources.

Most journals now require data for all manuscripts to be made accessible, either at publication or after a short moratorium. Further, the FAIR (findable, accessible, interoperable, reusable) data movement has improved the data sharing ecosystem for data-intensive biology [59,60,61,62,63,64,64,65] . You can find relevant sequencing data either by starting from the “data accessibility” sections of papers relevant to your research or by directly searching for your organism, environment, or treatment of choice in public data portals and repositories. The International Nucleotide Sequence Database Collaboration (INSDC), which includes the Sequence Read Archive (SRA), European Nucleotide Archive (ENA), and DataBank of Japan (DDBJ) is the largest repository for raw sequencing data, but no longer accepts sequencing data from large consortia projects [66] . These data are instead hosted in consortia-specific databases, which may require some domain-specific knowledge for identifying relevant datasets and have unique download and authentication protocols. For example, raw data from the Tara Oceans expedition is hosted by the Tara Ocean Foundation [67] . Additional curated databases focus on processed data instead, such as gene expression in the Gene Expression Omnibus (GEO) [68] . Organism-specific databases such as Wormbase (Caenorhabditis elegans) specialize on curating and integrating sequencing and other data associated with a model organism [69] . Finally, rather than focusing on certain data types or organisms, some repositories are designed to hold any data and metadata associated with a specific project or manuscript (e.g. Open Science Framework, Dryad, Zenodo [70] ).

Consider analysis when generating your own data

If generating your own data, proper experimental design and planning are essential. For cost-intensive sequencing data, there are a range of decisions about experimental design and sequencing (including sequencing type, sequencing depth per sample, and biological replication) that impact your ability to properly address your research question. Conducting discussions with experienced bioinformaticians and statisticians, prior to beginning your experiments if possible, is the best way to ensure you will have sufficient statistical power to detect effects. These considerations will be different for different types of sequence analysis. To aid in early project planning, we have curated a series of domain-specific references that may be useful as you go about designing your experiment (see Table 2). Given the resources invested in collecting samples for sequencing, it’s important to build in a buffer to preserve your experimental design in the face of unexpected laboratory or technical issues. Once generated, it is always a good idea to have multiple independent backups of raw sequencing data, as it typically cannot be easily regenerated if lost to computer failure or other unforeseeable events.

Table 2: References for experimental design and considerations for common sequencing chemistries.
Sequencing type Resources
RNA-sequencing [32,71,72]
Metagenomic sequencing [33,73,74]
Amplicon sequencing [75,76,77]
Microbial isolate sequencing [78]
Eukaryotic genome sequencing [79,80,81,82]
Whole-genome resequencing [83]
RAD-sequencing [84,84,85,86,87,88]
single cell RNA-seq [89,90]

As your experiment progresses, keep track of as much information as possible: dates and times of sample collection, storage, and extraction, sample names, aberrations that occurred during collection, kit lot used for extraction, and any other sample and sequencing measurements you might be able to obtain (temperature, location, metabolite concentration, name of collector, well number, plate number, machine your data was sequenced, on etc). This metadata allows you to keep track of your samples, to control for batch effects that may arise from unintended batching during sampling or experimental procedures and makes the data you collect reusable for future applications and analysis by yourself and others. Wherever possible, follow the standard guidelines for formatting metadata for scientific computing to limit downstream processing and simplify analyses requiring these metadata (see: [10] ). We have focused here on sequencing data for data management over long-term ecological studies, we recommend [91] .

Getting started with sequencing data

Protect valuable data

Aside from the code itself, raw data are the most important files associated with a workflow, as they cannot be regenerated if accidentally altered or deleted. Keeping a read-only copy of raw data alongside a workflow as well multiple backups protects your data from accidents and computer failure. This also removes the imperative of storing intermediate files as these can be easily regenerated by the workflow.

When sharing or storing files and results, data version control can keep track of differences in files such as changes from tool parameters or versions. The version control tools discussed in the Workflow-based project management section are primarily designed to handle small files, but GitHub provides support for Git Large File Storage (LFS), and repositories such as the Open Science Framework (OSF), Figshare, Zenodo, and Dryad can be used for storing larger files and datasets [49,70,92,93,94] .

In addition to providing version control for projects and datasets, these tools also facilitate sharing and attribution by enabling generation of digital object identifiers (doi) for datasets, figures, presentations, code, and preprints. As free tools often limit the size of files that can be stored, a number of cloud backup and storage services are also available for purchase or via university contract, including Google Drive, Box, Dropbox, Amazon Web Services, and Backblaze. Full computer backups can be conducted to these storage locations with tools like rclone [95] .

Ensure data integrity during transfers

If you’re working with publicly-available data, you may be able to work on a compute system where the data are already available, circumventing time and effort required for downloading and moving the data. Databases such as the Sequence Read Archive (SRA) are now available on commercial cloud computing systems, and open source projects such as Galaxy enable working with SRA sequence files directly from a web browser [12,96] . Ongoing projects such as the NIH Common Fund Data Ecosystem aim to develop a data portal to make NIH Common Fund data, including biomedical sequencing data, more findable, accessible, interoperable, and reusable (FAIR).

In most cases, you’ll still need to transfer some data - either downloading raw data or transferring important intermediate and results files for backup and sharing (or both). Transferring compressed files (gzip, bzip2, BAM/CRAM, etc.) can improve transfer speed and save space, and checksums can be used to to ensure file integrity after transfer (see Figure 8).

Perform quality control at every step

The quality of your input data has a major impact on the quality of the output results, no matter whether your workflow analyzes six samples or six hundred. Assessing data at every analysis step can reveal problems and errors early, before they waste valuable time and resources. Using quality control tools that provide metrics and visualizations can help you assess your datasets, particularly as the size of your input data scales up. However, data from different species or sequencing types can produce anomalous quality control results. You are ultimately the single most effective quality control tool that you have, so it is important to critically assess each metric to determine those that are relevant for your particular data.

Look at your files Quality control can be as simple as looking at the first few and last few lines of input and output data files, or checking the size of those files (see Table 3). To develop an intuition for what proper inputs and outputs look like for a given tool, it is often helpful to first run the test example or data that is packaged with the software. Comparing these input and output file formats to your own data can help identify and address inconsistencies.

Table 3: Some commands to quickly explore the contents of a file. These commands can be used on Unix and Linux operating systems to detect common formatting problems or other abnormalities.
command function example
ls -lh list files with information in a human-readable format ls -lh *fastq.gz
head print the first 6 lines of a file to standard out head samples.csv
tail print the last 6 lines of a file to standard out tail samples.csv
less show the contents of a file in a scrollable screen less samples.csv
zless show the contents of a gzipped file in a scrollable screen zless sample1.fastq.gz
wc -l count the number of lines in a file wc -l ecoli.fasta
cat print a file to standard out cat samples.csv
grep find matching text and print the line to standard out grep “>” ecoli.fasta
cut cut columns from a table cut -d“,” -f1 samples.csv

Visualize your data Visualization is another powerful way to pick out unusual or unexpected patterns. Although large abnormalities may be clear from looking at files, others may be small and difficult to find. Visualizing raw sequencing data with FastQC (Figure 9A) and processed sequencing data with tools like the Integrative Genome Viewer and plotting tabular results files using python or R can make aberrant or inconsistent results easier to track down [98,99] .

Pay attention to warnings and log files Many tools generate log files or messages while running. These files contain information about the quantity, quality, and results from the run, or error messages about why a run failed. Inspecting these files can be helpful to make sure tools ran properly and consistently, or to debug failed runs. Parsing and visualizing log files with a tool like MultiQC can improve interpretability of program-specific log files (Figure 9 [101] ).

Look for common biases in sequencing data Biases in sequencing data originate from experimental design, methodology, sequencing chemistry, or workflows, and are helpful to target specifically with quality control measures. The exact biases in a specific data set or workflow will vary greatly between experiments so it is important to understand the sequencing method you have chosen and incorporate appropriate filtration steps into your workflow. For example, PCR duplicates can cause problems in libraries that underwent an amplification step, and often need to be removed prior to downstream analysis [102,103,104,105,106] .

Check for contamination Contamination can arise during sample collection, nucleotide extraction, library preparation, or through sequencing spike-ins like PhiX, and could change data interpretation if not removed [107,108,109] . Libraries sequenced with high concentrations of free adapters or with low concentration samples may have increased barcode hopping, leading to contamination between samples [110] .

Consider the costs and benefits of stringent quality control for your data Good quality data is essential for good downstream analysis. However, stringent quality control can sometimes do more harm than good. For example, depending on sequencing depth, stringent quality trimming of RNA-sequencing data may reduce isoform discovery [111] . To determine what issues are most likely to plague your specific data set, it can be helpful to find recent publications using a similar experimental design, or to speak with experts at a sequencing core.

Because sequencing data and applications are so diverse, there is no one-size-fits-all solution for quality control. It is important to think critically about the patterns you expect to see given your data and your biological problem, and consult with technical experts whenever possible.

Securing and managing appropriate computational resources

Sequence analysis requires access to computing systems with adequate storage and analysis power for your data. For some smaller-scale datasets, local desktop or even laptop systems can be sufficient, especially if using tools that implement data-reduction strategies such as minhashing [112] . However, larger projects require additional computing power, or may be restricted to certain operating systems (e.g. linux). For these projects, solutions range from research-focused high performance computing systems to research-integrated commercial analysis platforms. Both research-only and and commercial clusters provide avenues for research and educational proposals to enable access to their computing resources (see Table 4). In preparing for data analysis, be sure to allocate sufficient computational resources and funding for storage and analysis, including large intermediate files and resources required for personnel training. Note that workflow systems can greatly facilitate faithful execution of your analysis across the range of computational resources available to you, including distribution across cloud computing systems.

Table 4: Computing Resources Bioinformatic projects often require additional computing resources. If a local or university-run high-performance computing cluster is not available, computing resources are available via a number of grant-based or commercial providers.
Provider Access Model Restrictions
Amazon Web Services Paid
Bionimbus Protected Data Cloud Research allocation users with eRA commons account
Cyverse Atmosphere Free with limits storage and compute hours
EGI federated cloud Access by contact European partner countries
Galaxy Free with storage limits data storage limits
Google Cloud Platform Paid
Google Colab Free computational notebooks, no resource guarantees
Microsoft Azure Paid
NSF XSEDE Research allocation USA researchers or collaborators
Open Science Data Cloud Research allocation
Wasabi Paid data storage solution only

Getting started with resource management

As the scale of data increases, the resources required for analysis can balloon. Bioinformatic workflows can be long-running, require high-memory systems, or involve intensive file manipulation. Some of the strategies below may help you manage computational resources for your project.

Apply for research units if eligible There are a number of cloud computing services that offer grants providing computing resources to data-intensive researchers (Table 4). In some cases, the resources provided may be sufficient to cover your entire analysis.

Develop on a local computer when possible Since workflows transfer easily across systems, it can be useful to develop individual analysis steps on a local laptop. If the analysis tool will run on your local system, test the step with subsampled data, such as that created in the Getting started developing workflows section. Once working, the new workflow component can be run at scale on a larger computing system. Workflow system tool resource usage reporting can help determine the increased resources needed to execute the workflow on larger systems. For researchers without access to free or granted computing resources, this strategy can save significant cost.

Gain quick insights using sketching algorithms Understanding the basic structure of data, the relationship between samples, and the approximate composition of each sample can very helpful at the beginning of data analysis, and can often drive analysis decisions in different directions than those originally intended. Although most bioinformatics workflows generate these types of insights, there are a few tools that do so rapidly, allowing the user to generate quick hypotheses that can be further tested by more extensive, fine-grained analyses. Sketching algorithms work with compressed approximate representations of sequencing data and thereby reduce runtimes and computational resources. These approximate representations retain enough information about the original sequence to recapitulate the main findings from many exact but computationally intensive workflows. Most sketching algorithms estimate sequence similarity in some way, allowing you to gain insights from these comparisons. For example, sketching algorithms can be used to estimate all-by-all sample similarity which can be visualized as a Principal Component Analysis or a multidimensional scaling plot, or can be used to build a phylogenetic tree with accurate topology. Sketching algorithms also dramatically reduce the runtime for comparisons against databases (e.g. all of GenBank), allowing users to quickly compare their data against large public databases.

Rowe 2019 [113] reviewed programs and genomic use cases for sketching algorithms, and provided a series of tutorial workbooks (e.g. Sample QC notebook: [114] ).

Use the right tools for your question RNA-seq analysis approaches like differential expression or transcript clustering rely on transcript or gene counts. Many tools can be used to generate these counts by quantifying the number of reads that overlap with each transcript or gene. For example, tools like STAR and HISAT2 produce alignments that can be post-processed to generate per-transcript read counts [115,116] . However, these tools generate information-rich output, specifying per-base alignments for each read. If you are only interested in read quantification, quasi-mapping tools provide the desired results while reducing the time and resources needed to generate and store read count information [117,118] .

Seek help when you need it In some cases, you may find that your accessible computing system is ill-equipped to handle the type or scope of your analysis. Depending on the system, staff members may be able to help direct you to properly scale your workflow to available resources, or guide you in tailoring computational unit allocations or purchases to match your needs.


Variant Discovery with GATK4

This workshop will focus on the core steps involved in calling germline short variants, somatic short variants, and copy number alterations with the Broad’s Genome Analysis Toolkit (GATK), using “Best Practices” developed by the GATK methods development team. A team of methods developers and instructors from the Data Sciences Platform at Broad will give talks explaining the rationale, theory, and real-world applications of the GATK Best Practices. You will learn why each step is essential to the variant-calling process, what key operations are performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset. If you are an experienced GATK user, you will gain a deeper understanding of how the GATK works under-the-hood and how to improve your results further, especially with respect to the latest innovations.

The hands-on tutorials for learning GATK tools and commands will be on Terra, a new platform developed at Broad in collaboration with Verily Life Sciences for accessing data, running analysis tools and collaborating securely and seamlessly. (If you’ve heard of or been a user of FireCloud, think of Terra as the new and improved user interface for FireCloud that makes doing research easier than before!)

  • Day 1: Introductory topics and hands-on tutorials. We will start off with introductory lectures on sequencing data, preprocessing, variant discovery, and pipelining. Then you will get hands-on with a recreation of a real variant discovery analysis in Terra.
  • Day 2: Germline short variant discovery. Through a combination of lectures and hands-on tutorials, you will learn: germline single nucleotide variants and indels, joint calling, variant filtering, genotype refinement, and callset evaluation.
  • Day 3: Somatic variant discovery. In a format similar to Day 2, you will learn: somatic single nucleotide variants and indels, Mutect2, and somatic copy number alterations.
  • Day 4: Pipelining and performing your analysis end-to-end in Terra. On the final day, you will learn how to write your own pipelining scripts in the Workflow Description Language (WDL) and execute them with the Cromwell workflow management system. You will also be introduced to additional tools that help you do your analysis end-to-end in Terra.

Please note that this workshop is focused on human data analysis. The majority of the materials presented does apply equally to non-human data, and we will address some questions regarding adaptations that are needed for analysis of non-human data, but we will not go into much detail on those points.


Requirements on bioinformatics solutions for clinical oncology

High-throughput NGS allows for time- and cost-effective molecular probing of tumors. However, the resulting sequencing data is challenging to analyze because of its large size and various confounding sources of variation, most notably amplification and sequencing errors. Careful analysis of NGS data is particularly important in the context of MTBs, where treatment suggestions based on mutation calls may have dramatic effects, ranging from recovery to death of a patient. Therefore, strict standards with respect to several aspects described below need to be followed.

First and foremost, experimental noise needs to be distinguished from true biological signals. Treatment decisions have to be based only on validated, real biological alterations and should not be misled by technical artifacts. Toward this end, appropriate computational data analysis pipelines have to be used that cover the entire process from primary analysis of the read data to clinical reporting. To understand the limitations of an implemented pipeline, it needs to be evaluated under defined conditions reflecting realistic use case conditions [20, 21]. Pipelines need to be robust with respect to new sequencing data that may differ in some aspects from previously analyzed samples. In addition, mutation calls should be reported with a confidence estimate. Although some mutation callers report, for example, P-values or posterior probabilities, it remains a major challenge to provide a meaningful notion of confidence for the results of an entire pipeline. This is particularly important, as the overlap of different approaches is often limited, as mentioned in [22�].

The results produced by a bioinformatics pipeline have to be reproducible. This requirement entails several technical prerequisites discussed below and includes controlling random seeds for all steps that involve randomization. Another important aspect of reproducibility is a rigorous documentation of each step of the pipeline, including complete documentation of the used tools, their version and parameter settings. This also holds for databases and ensures complete transparency [20]. For instance, in the past, most genomic studies have used as a reference genome GRCh37 from the Genome Reference Consortium or its equivalent from the University of California Santa Cruz, version hg19. Even though there are only minor differences in their genetic information, the naming scheme is different, which can lead to confusion. Moreover, the new human genome assembly GRCh38 not only updated the main chromosomes, and therefore changed their coordinates, but also included new contigs to represent population haplotypes, further complicating reproducibility. Therefore, it is necessary that for each file used in the pipeline, its generation and dependencies are clearly described. Such a setup also guarantees the traceability of all results. For example, it should be possible to trace back the call of a treatment-critical mutation, to assess the call manually and to validate it before recommending the treatment. In addition, genomic alterations in the patient which are not directly linked to cancer, known as incidental variants, may be discovered. As these variants may be reported in various ways with potential ethical implications, a clear strategy needs to be defined, for example, reporting all relevant incidental findings [26].

In addition to these requirements on stability, robustness, reproducibility and traceability of the computational pipeline, the size, sensitivity and complexity of comprehensive clinical data sets combined with the urgency caused by the often critical state of the respective patient result in a set of challenging technical prerequisites for the computational infrastructure and the implemented data analysis software of an MTB.


Snapshots of the code can be found in the GigaScience repository, GigaDB [ 21].

The authors would like to thank Shadrielle Melijah G. Espiritu and Andre Masella for their feedback on the manuscript/software. This project has been supported by funding from Genome Canada/Genome British Columbia (grant No. 173CIC), the Natural Science and Engineering Research Council of Canada (grant No. RGPGR 488167-2013), and Terry Fox Research Institute - Program Project Grants (grant No. 1021).


Watch the video: Behandling af kræft - Operation og strålebehandling. SundhedsTV (May 2022).