Flowchart of the DRAW workflow

A complete DRAW workflow run consists of three phases as outlined in the following flowchart.

Directory and file structure for DRAW execution

The DRAW pipeline was designed to process the sequence reads of multiple samples sequenced in one flow cell at a time. In this section, we describe the directory and file structure assumed in the program.

Creating sample-level directory and file structure

DRAW was designed to process the sequence reads of multiple samples sequenced in one flow cell at a time. But most DRAW commands operate on a per-sample level. This means each sample has it's own directory and individual sub-directories as shown above, and most importly, a tailered bash script file (.sh) that contains all the commands to complete the analysis. The following describes the two steps to achieve this: 1) making a configuration file and 2) executing it by running "draw.sh"

  1. Making a configuration file

  2. Use your favorite editor to make a configuration file and save it in the working/flowcell direcory, e.g. "flowcell_1234". You can name the file "flowcell_1234.cfg." This file contains information about the samples sequenced and several common attributes among them, such as capture library, reference genome, and research project. Therefore, if samples from different research projects are sequenced on the same flow cell, you should separate the analysis by creating a configuration file for each project. Most of the sample information, such as sample name and barcode, can be found in the manifest file, which can be acquired from the sequencing facility. The template file can be found here flowcell_1234.cfg.
    NOTE: The ALLCAPS represent unix/linux environment variables that must be in the file. The following are some of the key variables.

  3. Executing the configuration file

  4. Once you have prepared the configuration file (e.g. flowcell_1234.cfg), the following are the steps to execute it.

    1. Change directory to the working/flowcell direcory (e.g. flowcell_1234).
    2. Locate the DRAW package's draw.ini file, e.g. path/to/DRAWnSneakPeek/draw/draw.ini, and "source" it. If you cannot find it, you need to consult the person who installed DRAW/SneakPeek.
    3. To check whether sourcing draw.ini has worked correctly, you can try "echo $DRAW_HOME". It should display the location where DRAW package resides
    4. Using SneakPeek is optional for running DRAW. If you do, plese make sure MySQL connection is specified in draw.ini and works correctly. (more)
    5. Please make sure there is stdout directory in your home directory.
    6. Run draw.sh with a configuration file (-f flowcell_1234.cfg) and -m ('m' stands for "m"ySQL).
    7. The string of commands described so far is shown as follows.
    8. cd path/to/flowcell_1234
      source path/to/DRAWnSneakPeek/draw/draw.ini
      echo $DRAW_HOME
      draw.sh -f flowcell_1234.cfg [-m] [-h DRAW_HOME_DIR] [-p1] [-p2] [-p3]

      • -f flowcell_1234.cfg - specifies the location of your configuration file of a flowcell
      • -m - flag to begin MySQL import of project, flow cell, and sample information to SneakPeek's MySQL tables.
        If you use DRAW without SneakPeek, leave out this option.
      • -h DRAW_HOME_DIR - optionally specify the location of the DRAW home directory. Otherwise it would use the environment variable $DRAW_HOME in draw.ini
      • -p1 -p2 -p3 - optionally submit all tasks from each phase for all samples specified in the config file.

      This would create the sample-level directory and file structure, and in each sample's cmd directory there would be a launching script with a name in this format: sample1.sh. In the next several sections we describe the different ways of using this launching script.

Pipeline execution: the phases mode

After running draw.sh, you'll find that each sample will have it's own directory and individual sub-directories. The cmd directory contains a bash script file, e.g. sample1.sh. The entire DRAW pipeline can be divided into four phases, which should be run sequentially for every sample. More details on the four phases, such as the steps, or "tasks" as we call them, that constitue each phase and the intermediate outputs can be found in section All phases and tasks of DRAW explained.

Debugging option: the debugging mode

Running selected portions of the pipeline: the task mode

All phases and tasks of DRAW explained

Phase 1: Read Mapping

Phase 1 takes the unmapped reads received and aligns them to the reference genome.
  1. Aligning single reads to the reference genome.
    cd Sample_sample1/cmd
    ./sample1.sh -t bwaAln

    The output files are stored in the sai directory.

  2. Combining mate pairs.
    ./sample1.sh -t bwaSamp

    The output files are stored in the sam directory.

  3. Adding readgroup information to all reads.
    ./sample1.sh -t addReadGroup

    The output files are stored in the bam directory.

  4. Merging multiple alignment files (*.BAM) from the same sample into one alignment file.
    ./sample1.sh -t mergeBam

    The output file is stored in the cmd directory. *_merged.bam

Phase 2: Quality Control

Phase 2 takes the aligned reads and further processes them. First, it removes the duplicates that may be due to PCR artifacts. Then, it realigns problematic regions (i.e. known indels) to map a better alignment. Finally, it recalibrates the base call quality score post-alignment.
  1. Marking duplicate reads (please see picard for definition of duplicate)
    cd Sample_sample1/cmd
    ./sample1.sh -t MarkDuplicates

    The output file in stored in the cmd directory: s_sample1_markdup.bam

  2. Counting the number of reads in the sample
    ./sample1.sh -t "countRead s_${LINE}_markdup"

    Replace "${LINE}" with the sample name.
    The output files are stored in cmd/stat directory. *.readct

  3. Local realignment at known indel regions (see GATK)
    ./sample1.sh -t localRealignIndel

    The output file is stored in the cmd directory. *_markdup.indel_realign.bam

  4. Base quality score recalibration (see GATK)
    ./sample1.sh -t recal

    The output file is stored in the cmd directory. *_recal.FULL.bam

Phase 3: Variant Calling and Depth Coverage Statistics

Phase 3 calculates the depth of coverage for the sample at the target interval level and the genome level. This phase also calls the variants (SNVs and indels) and then annotates these variants (dbSNP ID, functional and gene annotation).
  1. Calculating the depth of coverage in the target region
    cd Sample_sample1/cmd
    ./sample1.sh -t target_coverages

    The output files are stored in the cmd/stat directory. *_recal_target.sample*

  2. Calculating the genomic depth of coverage
    ./sample1.sh -t genome_coverages

    The output files are stored in the cmd/stat directory. *_recal_genome.sample*

  3. Variant detection of SNVs and Indels
    ./sample1.sh -t snpcal

    The output file is stored in the cmd/vcf directory. *_recal.FULL_mbq_20_mmq_30_filtered_nomask.vcf

  4. Variant annotation
    ./sample1.sh -t snpAnnotation

    The output file is stored in the cmd/vcf directory. *_recal.FULL_mbq_20_mmq_30_filtered_nomask_snpeff.vcf

  5. Collecting quality metrics from various output files
    ./sample1.sh -t collectSPeekMetrics

    The output file is stored in the cmd directory. sample_speek.txt
    This file has information useful for assessing sequencing quality of the sample in this run, such as read count, duplicate rate, depth of coverage, etc.

Phase 5: Importing quality metrics into MySQL tables

Phase 5 connects to the SneakPeek's MySQL database and import the data in sample_speek.txt.
cd Sample_sample1/cmd
./sample1.sh -p5