TRUFFLE - Fast shared segment and ancestry estimation.
Quickstart
Introduction
Estimating relatedness and co-ancestry among pairs of individuals is a commonly encountered task in most association or population genetics studies. TRUFFLE enables the simple and accurate identification of IBD1 and 2 segments, calculation of total IBD1/2 probabilities and provide graphical reporting of distribution of shared segments across pairs of individuals.
Important Notes for this release
Truffle is currently under development and a beta version has been released. This means that while the main algorithm has been validated in numerous datasets, there might still be some loose ends.
We ask you to notify us of any problems or errors encountered. If truffle works perfectly you can also send us an email with your experience.
Input file requirements
For the most common use case, truffle requires a single whole-genome vcf input file with genotypes for every individuals.
Most vcf files should be readable as input. If you have a plink bed/bim file that you previously ld-pruned and filtered, truffle can use it as long as you convert it back to vcf using the plink command --recode-vcf
.
It is also possible to run single-chromosome or single region analysis, for identifying segment locations and sizes. In such cases it might be necessary to adjust the thresholds for detecting segments.
Number of marker requirements
For this pre-release version we recommend filtering down the markers of the vcf file to 100-500k.
If you have an unfiltered vcf file with < 10 million markers, truffle has embedded functionality for filtering the vcf file before analysis.
Default filtering criteria
By default truffle filters out markers with minor allele frequency < 5% or missing data > 5%. If you would like to include all markers in analysis you can use the option --nofiltering
.
Quickstart
VCF file with 60-200k markers:
./truffle --vcf input.vcf.gz --cpu 4
Unfiltered VCF file, with millions of markers:
./truffle --vcf input.vcf.gz --cpu 4 --mindist 2000 --maf 0.1
Also generate a list of segments identified:
./truffle --vcf input.vcf.gz --cpu 4 --segments
Example data
You can download a VCF file used for the examples below from:
fs-and-po-pairs-from-1000genomes.vcf.gz 4.5MB
This dataset includes 47 selected CEU samples from 1000 genomes sequencing data, including some parent-offpring and full-sibling pairs.
Other types of analysis
Truffle can read most common vcf files as input to the segment detection algorithm. The file should, generally, contain all chromosomes from all individuals. Currently, there is no support for joining vcf files from different chromosomes.
For most common use cases, the vcf file should contain autosomal chromosomes only.
Running a single chromosome
If you would like to perform segment identification on a single chromosome, the sensitivity should generally be adjusted, see below for options --ibs1markers
and --ibs2markers
Output a list of segments
To generate a list of segments in the file truffle.segments use the option --segments
:
./truffle --vcf input.vcf.gz --cpu 4 ---segments
Output files description
truffle.ibd
An example output file is as follows:
ID1 ID2 NMARK NCOMMON IBD0 IBD1_MAX IBD1_NSEGS IBD1 IBD2_MAX IBD2_NSEGS IBD2
P7_HG02657 P5_HG02085 169413 0 1.000000 430 0 0.000000 38 0 0.000000
P7_HG02658 P1_HG00269 169413 0 1.000000 456 0 0.000000 40 0 0.000000
P7_HG02658 P6_HG02429 169413 0 1.000000 225 0 0.000000 29 0 0.000000
P8_HG03343 P1_HG00269 169413 0 1.000000 233 0 0.000000 28 0 0.000000
P8_HG03343 P6_HG02429 169413 0 1.000000 464 0 0.000000 39 0 0.000000
P10_HG03754 P10_HG03750 169413 0 0.001765 13475 23 0.987498 270 18 0.010737
P14_NA19334 P14_NA19331 169413 0 0.263604 9753 47 0.498734 1102 115 0.237662
The output file columns are:
ID1,ID2 : ID’s of pair.
NMARK : number of markers that were used.
NCOMMON : not used.
IBD0,IBD1,IBD2 : The computed proportion of genome that is IBD0,1 or 2.
- IBD1_MAX : The maximum length of segment found to be IBD1 or 2.
IBD2_MAX : The maximum length of segment found to be IBD2.
IBD1_NSEGS : Number of segments that are identified as IBD1 or 2.
IBD2_NSEGS : Number of segments that are identified as 2.
In the above output file we find that the 5 first pairs are unrelated (IBD1,2 < 0.001).
The pair P10_HG03754 P10_HG03750 is identified as a parent offspring pair (IBD1 > 0.95).
The last pair is definetely a full-sibling, having IBD1 > 0.4, IBD2 > 0.15.
truffle.segments
TYPE ID1 ID2 CHROM VARSTART VAREND POS Mbp LENGTH Mbp NMARKERS
IBD1 P0_HG00120 P0_HG00116 6 59059 59814 0.2765 Mbp 7.4560 Mbp 756
IBD2 P0_HG00120 P0_HG00116 6 64814 64872 104.6593 Mbp 1.3802 Mbp 59
IBD2 P0_HG00120 P0_HG00116 6 65197 65264 110.7780 Mbp 1.7072 Mbp 68
IBD1 P0_HG00120 P0_HG00116 6 62715 65328 52.5659 Mbp 61.1437 Mbp 2614
IBD2 P0_HG00120 P0_HG00116 6 66651 66706 139.8456 Mbp 1.8712 Mbp 56
IBD1 P0_HG00120 P0_HG00116 6 66064 66818 130.3246 Mbp 13.5776 Mbp 755
The output file columns are:
ID1,ID2 : ID’s of pair.
TYPE : whether the segment was IBD1 or 2.
CHROM : chromosome name.
VARSTART : the index of marker on the start of the segment. This corresponds to the n’th marker in the VCF file (if filtering is applied this might not be the same as the n’th marker in the original vcf file
VAREND : the index of marker at the end of the segment.
POS : Position of the start of the segment in MBpairs.
LENGTH : Length of the segment in MBp.
NMARKERS : The length of the segment in number of markers.
Notes for using truffle
Segment detection sensitivity
Truffle has some build-in measures to adjust the sensitivity of the segments detection.
The default sensitivity threshold should produce reasonable but maybe not optimal, results for whole-genome autosomal marker panels. In many cases you might need to adjust it, for example when:
- running exome sequencing data,
- analyzing studies with mixed ethnicity individuals,
- analyzing single chromosomes.
In most cases, you get more than expected relatedness and you might want to consider adjusting the parameter L to 1.5-3. For example:
./truffle --vcf input.vcf.gz --L 2
For more precise minimum segment size control, you can specify exactly the minimum number of markers that a segment must have before it is considered for reporting in the output.
There are two parameters to control the minimum lengths for IBS1 and ISB2 segments: –ibs1markers n1 and –ibs2markers n2. For example
./truffle --vcf input.vcf.gz --ibs1markers 4000 --ibs2markers 500
OPTIONAL: LD-pruning and filtering steps for genotyping array data
The following commands show a suggested pipeline for performing LD-pruning of variants and minor allele frequency pruning before running truffle:
plink --bfile input --maf 0.05 --mind 0.2 --make-bed --out /tmp/t
plink --bfile /tmp/t --maf 0.05 --geno 0.05 --make-bed --out /tmp/t
plink --bfile /tmp/t --indep-pairwise 2000 100 0.5 --out /tmp/t
cat /tmp/t.bim|awk '$1<=22' >/tmp/autosom.txt
plink --bfile /tmp/t --extract /tmp/t.prune.in --make-bed --out /tmp/t2
plink --bfile /tmp/t2 --extract /tmp/autosom.txt --recode-vcf --out /tmp/filtered
Commonly used command line options
Input/Output control
--vcf vcffile
Reads the inputvcffile
for processing.--segments
Output a list of identified segments in the file truffle.segments--out X
Change the name of the output files. The results will be stored in the files X.ibd and X.segments
CPU usage control
--cpu N
Use N cpus when running truffle.
Segment length threshold
--L X
Adjust the sensitivity threshold for detecting segments to X. Increasing L causes only larger segments to be reported.
Default: 1. For more specific marker length detection adjustments:--ibs1markers n
Set the minimum segment length to report to n consecutive markers, for IBS1 segments.--ibs2markers n
Set the minimum segment length to n consecutive markers, for IBS2 segments. This is generally 2-3 time smaller than the value of ibs1markers.
Variant filtering
--maf *X*
Remove variants that have a minor allele frequency less than X. For example:--maf 0.05
--missing X
Excludes variants with missing rate more than X. For example--missing 0.02
--mindist X
Removes variants that are less than X base-pairs apart.For example--mindist 2000
Other options
--pair ID1 ID2
Analyze only a single pair from the data