CONTRA: Copy Number Analysis for Targeted Resequencing
From contra-cnv
** 21-3-2016 updates (v2.0.8): New workflows - Null Distribution Estimation (NDE) and Whole-gene analysis (WGCNV) with plots
Contents[hide] |
Introduction
CONTRA is a tool for copy number variation (CNV) detection for
targeted resequencing data such as those from whole-exome capture data.
CONTRA calls copy number gains and losses for each target region with
key strategies include the use of base-level log-ratios to remove
GC-content bias, correction for an imbalanced library size effect on
log-ratios, and the estimation of log-ratio variations via binning and
interpolation. It takes standard alignment formats (BAM/SAM) and output
in variant call format (VCF 4.0) for easy integration with other next
generation sequencing analysis package.
21-3-2016: VCF has failed to become a standard format for Copy Number analysis output. So the tabular output format from contra.py is all you need. For versions prior to 2.0.8, ignore all warnings about reference ".fasta" file; it is required for VCF generation only. Newer versions do not require reference fasta any more. Use the largest tab-delimited output. Learn about our new workflow below regarding plots and gene-level calls.
For support, please contact jason dot li at petermac dot org and richard dot lupat at petermac dot org.
Citing CONTRA
CONTRA has been widely used and cited since publication. If you have used contra.py or baseline.py, please cite the CONTRA paper:
If you have used the WGCNV or NDE workflows, please cite the CONTRA paper and state explicitly the use of WGCNV and/or NDE workflows.
Download
- Source File
- User Guide
- Use this web page (most up-to-date)
- Test Files
- Baseline Files
- There are seven baseline files that we have generated from either our samples or HapMap. Baseline file for the HapMap is generated from the 6 of the CEU samples that uses NimbleGen SeqCap EZ Exome V2 (Exome Center = BGI).
Requirements
To run CONTRA, you need the following programs:
- Python 2.6+
- Most of the scripts for CONTRA are written in Python. It requires version 2.6 in order to use the multiprocessing module. Python Website
- R (now support all versions of R)
- BEDTools (Included in CONTRA package. See Installation Guide)
- The original source of the BEDtools can be found in BEDtools source
- SAMtools
- [Optional] DNACopy (R-library that will be used for predicting large CNV)
Installation Guide
Download CONTRA tarball and decompress it with the following command:
- tar –xvzf CONTRA.<version>.tar.gz
For users who DO NOT have BEDTools installed, follow these steps:
- cd CONTRA.<version>
- tar -zxvf BEDTools.tar.gz
- cd BEDTools
- make clean
- make all
- sudo cp bin/* /usr/local/bin/
Format of Input Files
The analysis itself requires several files as following:
- BAM or SAM files for the test and control samples.
- User can also provide a baseline file (in BED format) as the control sample. Please refer to “ Baseline” section for example how to use the baseline script.
- Target File in BED format (UCSC BED FORMAT FAQ)
- A tab-delimited file specifying the target/capture regions. Four columns must be specified: Chromosome, ChromStart, ChromEnd and Name. Name is typically gene symbols. It is used to map regions or exons to a gene, and is required by WGCNV and NDE workflows (v2.0.8+). No header line is expected.
- The ChromStart is 0 base (the first base in a chromosome is numbered 0). The ChromEnd base is not included in the display of the feature. For example, chromStart= 0 and chromEnd = 100 is implying the first 100 bases of a chromosome (0-99).
- Below is the example of first few lines of a target.BED file:
- 1 357512 357632 uc001aaw.1
- 1 357571 357691 uc001aaw.1
- 1 357633 357753 uc001aaw.1
- 1 357694 357814 uc001aaw.1
- 1 357756 357876 uc001aaw.1
- 1 357816 357936 uc001aaw.1
- (DEPRECATED; Fasta not required in v2.0.8+) Fasta file for reference genome (e.g. human_g1k_v37.fasta)
CONTRA Workflow
21-3-2016: New workflows added for CONTRA v2.0.8+
Whole Exome:
For Whole-Exome data (i.e. having data points consistenly throughout the genome), ADTEx works brilliantly. Use it instead. See http://adtex.sourceforge.net/. For custom targets, CONTRA is the right tool.Gene-level specific calls (WGCNV):
Noise is an intrinsic property of capture data, and high false positive rates at the exon-level is inevitable. Most capture is designed to target whole-genes consisting of multiple exons. We can take advantage of this fact to generate more specific calls at the gene-level (will miss exon scale event).- Step-1
- Generate CONTRA results using a bed file in which the fourth column contains the gene symbols. These gene symbols will determine how exons are grouped.
- Step-2 (for a single-case; without NDE)
- WGCNV-SINGLE (see new package in CONTRA v2.0.8+) will summarise all exon-level log-ratios generated by CONTRA, and perform a student t-test against a hardcoded threshold of +/-0.2 to compute statistical significance. Use the adjusted p.value and "n" (number of windows) to filter the results. If somatic variants are provided, B-Allele Frequencies will be plotted alongside the log-ratios for visual inspection. These new plots make CONTRA much more user-friendly. If the program complaints your variant file not having the right format, contact us.
- OR Step-2 (for multiple-cases; with NDE)
- You would have first followed the Null Distribution Estimation workflow to generate background estimations. WGCNV-NDE (see new package in CONTRA v2.0.8+) will then summarise all exon-level log-ratios generated by CONTRA, and compute a z-score against the null distribution to derive the statistical significance. Use the adjusted p.value and "n" (number of windows) to filter the results. If somatic variants are provided, B-Allele Frequencies will be plotted alongside the log-ratios for visual inspection. These new plots make CONTRA much more user-friendly. If the program complaints your variant file not having the right format, contact us.
Null Distribution Estimation workflow:
Capture efficiency varies across target regions. This means some regions are associated with a lower variation in their depth of coverage, are hence are more reliable for copy number calling. Estimating the variance of DOC specific to each target region is possible given a large number (N > 50) of samples derived from the same capture design. This workflow can be applied in both scenarios where a pooled baseline is used or a matched control is used.- Step-1
- Generate CONTRA results using a bed file in which the fourth column contains the gene symbols. These gene symbols will determine how exons are mapped to genes.
- Step-2
- The NullDistEstimation package (found in v2.0.8+) will perform the following: 1) re-normalise the DOC levels after excluding regions with very high amplification. This step is particularly useful for small captures (around 50 genes) as a high gain in one single gene can steal a significant amount of reads from the other target regions, affecting library size correction. 2) Estimate the null distribution of log-ratios from all given samples. This is done at both the exon-level and gene-level.
- Step-3
- Use WGCNV-NDE (see above) to call CNVs based on the estimated Null.
Standard CONTRA:
- Step-1
- CONTRA takes all the required files from the user [target, test BAM/SAM, control BAM/SAM/baseline file, fasta], and do all the pre-processing steps to ensure all the input files are compatible with the CONTRA script (such as files need to be in sorted order). If maxRegionSize is specified, large regions will be broken down into smaller regions depending on user specified parameters
- Step-2
- Short-read alignment information (BAM or SAM formats) from a test and a control sample is converted into read count per base pair for a list of target regions (BED format) using BEDTools. Target regions with too few reads in the control sample (by default, < 10 base pairs with read count > 10) are excluded from further analysis. The remaining read counts are then scaled based on the geometric mean of the total read counts of the two samples
- Step-3
- For each target region, a set of base-level log-ratios between test and control is calculated based on the scaled read counts, the mean of which is used to estimate the region’s log-ratio. Library size bias is then removed based on a linear relationship between log-ratio and log-coverage estimated from the data. The regions are then binned based on their similarity in log-coverage
- Step-4
- Significance is then computed for each region and is adjusted to reduce false discovery rates. Results are reported with other details in either tab-delimited or the VCF4.0 format (Variant Call Format; see www.1000genomes.org)
- Step-5
- If large deletion option is specified, circular binary segmentation will be performed on region log-ratios, using different parameters to achieve different resolutions of segmentation. Segmentation results from different resolutions are combined to make the final call
Output Format
VCF (Deprecated: Not the best output format for copy number results)
- VCF4.0 format (Use tabular file described below instead; v2.0.8+ does not support VCF any more)
- Note:
- QUAL = 10Log10 Adjusted p-value
- Other columns are described in the VCF file header.
- The details explanation for each column can be found in 1000Genomes VCF 4.0 Format
Tab Delimited
Three tab-delimited files will be generated:
- Full details of the analysis, excluding target regions that do not pass the minimum read depth & number of bases thresholds (See Step 2 in CONTRA workflow)
- Including only target regions that pass the p-value threshold
- If largeDeletion option is specified, a summary of large CNV prediction with significant copy number changes will be presented in the tab-delimited file
The full details analysis file contains:
target region ID, exon number, gene symbol, chromosome, original start
coordinate, original end coordinate, mean, standard deviation and median
of the log-ratios, number of bases included in the analysis for that
target region, p-value, adjusted p-value, gain/loss, average test
sample’s scaled read depth, average control sample’s scaled read depth,
average test sample’s original read depth, average control sample’s
original read depth, minimum and maximum log ratios on that target
regions and the bin number.
Plot
If --plot option is specified when running the code, plot(s) for the distribution of log ratios will be included in the output folder.
Examples
Example 1: Running CONTRA with default optional parameters
We assume we have a target file target_test.BED, two BAM files test_sample.BAM and control_sample.BAM and a reference file human_ref.fasta. Our intended output folder is in ~/ContraTest/sampleName/
To run the analysis on this sample, the command line argument is:
contra.py --target target_test.BED --test test_sample.BAM --control control_sample.BAM --fasta human_ref.fasta --outfolder ~/ContraTest/sampleName/
This will create a folder call sampleName inside ~/ContraTest/. Inside the folder, there will be two subfolders, plot and table. The table folder will contain the VCF file and the analysis’ details.
Note: CONTRA will always attempt to create the folder specified last (i.e. sampleName in this example). If the folder exists, there will be an error message when running the script. It is setup this way to avoid the data in the existing folder being overwrite. However, it will not create the parents folder (i.e. ContraTest in this example) with the assumption this folder has already existed. An attempt to put the result folders in a directory that has not been created will generate an error message.
Example 2: Running CONTRA with other optional parameters
We assume we have a target file target_test.BED, two BAM files test_sample.BAM and control_sample.BAM and a reference file human_ref.fasta. Our intended output folder is in ~/ContraTest/sampleName/
The options we want to change for this example:
- Number of bins to : 1,5,10,15 and 20 bins
- Minimum read depth : 5
- Minimum number of bases : 20
- SampleName : sample123
- Remove multi mapped reads
To run the analysis on this sample, the command line argument is:
contra.py --target target_test.BED --test test_sample.BAM --control control_sample.BAM --fasta human_ref.fasta --outfolder ~/ContraTest/sampleName/ --numBin 1,5,10,15,20 --minReadDepth 5 --minNBases 20 --sampleName sample123 --nomultimapped
Note1: The --nomultimapped option will use samtools to filter out alignment with mapping quality = 0.
Note2: With the --sampleName option, all the results’ filename will be appended with the sample name in front of the default filename.
Example 3: Common workflows
Targetting ~1000 cancer genes, ongoing experiments. A case of tumour with matched germline:
RUN:
contra.py -t target_test.BED -s test_sample.bam -c control.bam -p --minExon=100 --maxRegionSize=100 --targetRegionSize=75 --removeDups --sampleName SampName -o ContraOutputFolder
THEN:
WGCNV-SINGLE/wgcnv.py ContraOutputFolder human VariantFile WGCNV_OUTDIR
**The format of VariantFile is described in README.txt under WGCNV-SINGLE.
**Without matched germline, one can use a known diploid sample as the germline (e.g. NA12878). Pooled baseline can be used if no germline is available at all. And then run WGCNV-SINGLE/wgcnv_noBAF.py instead of wgcnv.py
--------------
Having collected 30 to 50 cases, run above contra.py command on all, and then:
RUN:
NullDistEstimation/nde_wrapper.py ContraOut_List.txt NDE_OUTDIR BED_ANNOTATED T T
**ContraOut_List.txt is a text file listing all samples' contra output directory to be used for null estimation, one sample per line. (parent directory to the 'table' directory generated by contra.py)
**BED_ANNOTATED is the bed file appended with a fourth column containing gene symbols
For each current and future sample which you want CNV results for:
RUN normal contra.py first, THEN:
WGCNV-NDE/wgcnv_wrapper.py ContraOutputDir OUTDIR NDE_OUTDIR/thresh_wg.txt NDE_OUTDIR/thresh_ex.txt BED_ANNOTATED T human
**thresh_wg.txt and thresh_ex.txt are gene-level and exon-level null estimates generated by nde_wrapper.py in the above steps
Data Simulator
TargetedSim Project Website - a tool to simulate targeted resequencing data. (by Kaushalya Amarasinghe 2011)
List of CONTRA parameters
Required
-t, --target
- Target region definition file [BED format]
-s, --test
- Alignment file for the test sample [BAM/SAM]
-c, --control Alignment file for the control sample [BAM/SAM/BED* – baseline file]
- *--bed option has to be supplied for control with baseline file.
-f, --fasta
- Reference genome [FASTA] (NOT REQUIRED since v2.0.8)
-o, --outFolder
- the folder name (and its path) to store the output of the analysis (this new folder will be created – error message occur if the folder exists)
Optional
--numBin
- Numbers of bins to group the regions. User can specify multiple experiments with different numbers of bins (comma separated). [Default: 20]
--minReadDepth
- The threshold for minimum read depth for each bases (see Step 2 in CONTRA workflow) [Default: 10]
--minNBases
- The threshold for minimum number of bases for each target regions (see Step 2 in CONTRA workflow) [Default: 10]
--sam
- If the specified test and control samples are in SAM format. [Default: False] (It will always take BAM samples as default)
--bed
- If specified, control will be a baseline file in BED format. [Default: False]
- *Please refer to the Baseline Script section for instruction how to create baseline files from set of BAMfiles. A set of baseline files from different platform have also been provided in the CONTRA download page.
--pval
- The p-value threshold for filtering. Based on Adjusted P-Values. Only regions that pass this threshold will be included in the VCF file. [Default: 0.05]
--sampleName
- The name to be appended to the front of the default output name. By default, there will be nothing appended.
--nomultimapped
- The option to remove multi-mapped reads (using SAMtools with mapping quality > 0). [default: FALSE]
-p, --plot
- If specified, plots of log-ratio distribution for each bin will be included in the output folder [default: FALSE]
--minExon
- Minimum number of exons in one bin (if less than this number, bin that contains small number of exons will be merged to the adjacent bins) [Default : 2000]
--minControlRdForCall
- Minimum Control ReadDepth for call [Default: 5]
--minTestRdForCall
- Minimum Test ReadDepth for call [Default: 0]
--minAvgForCall
- Minimum average coverage for call [Default: 20]
--maxRegionSize
- Maximum region size in target region (for breaking large regions into smaller regions. By default, maxRegionSize=0 means no breakdown). [Default : 0]
--targetRegionSize
- Target region size for breakdown (if maxRegionSize is non-zero) [Default: 200]
-l, --largeDeletion
- If specified, CONTRA will run large deletion analysis (CBS). User must have DNAcopy R-library installed to run the analysis. [False]
--smallSegment
- CBS segment size for calling large variations [Default : 1]
--largeSegment
- CBS segment size for calling large variations [Default : 25]
--lrCallStart
- Log ratios start range that will be used to call CNV [Default : -0.3]
--lrCallEnd
- Log ratios end range that will be used to call CNV [Default : 0.3]
--passSize
- Size of exons that passed the p-value threshold compare to the original exons size [Default: 0.5]
--removeDups (new since 2.0.6)
- if specified, will remove PCR duplicates [False]
--version
- Returns version
Baseline
Creating a baseline control from multiple samples is can be useful when a matched control is not available. In the CONTRA download page, we have provided several baseline files for some of the platforms that we have tried. Alternatively, the “baseline.py” script that comes with CONTRA can be used to generate a custom baseline file.
List of parameters
-t, --target
- Target region definition file [REQUIRED] [BED format]
-f, --files
- Files to be converted to baselines [REQUIRED] [BAM]
-o, --output
- Output folder [REQUIRED]
-c, --trim
- Portion of outliers to be removed before calculating average [Default: 0.2]
-n, --name
- Output baseline file name [Default: baseline]
Example
We assume we have a target file target_test.BED and four BAM files that we want to turn into baseline file test1.BAM, test2.BAM , test3.BAM and test4.BAM. Our intended output folder is in ~/Baseline/sampleBaseline/ with the final baseline file name baseline_test.
To run the baseline script on this sample, the command line argument is:
python baseline.py --target target_test.BED --files test1.BAM test2.BAM test3.BAM test4.BAM --output ~/Baseline/sampleBaseline/ --name baseline_test