TRUFFLE - Fast shared segment and ancestry estimation.

Quickstart

Introduction

Estimating relatedness and co-ancestry among pairs of individuals is a commonly encountered task in most association or population genetics studies. TRUFFLE enables the simple and accurate identification of IBD1 and 2 segments, calculation of total IBD1/2 probabilities and provide graphical reporting of distribution of shared segments across pairs of individuals.

Important Notes for this release

Truffle is currently under development and a beta version has been released. This means that while the main algorithm has been validated in numerous datasets, there might still be some loose ends.

We ask you to notify us of any problems or errors encountered. If truffle works perfectly you can also send us an email with your experience.

Input file requirements

For the most common use case, truffle requires a single whole-genome vcf input file with genotypes for every individuals.

Most vcf files should be readable as input. If you have a plink bed/bim file that you previously ld-pruned and filtered, truffle can use it as long as you convert it back to vcf using the plink command --recode-vcf.

It is also possible to run single-chromosome or single region analysis, for identifying segment locations and sizes. In such cases it might be necessary to adjust the thresholds for detecting segments.

Number of marker requirements

For this pre-release version we recommend filtering down the markers of the vcf file to 100-500k.

If you have an unfiltered vcf file with < 10 million markers, truffle has embedded functionality for filtering the vcf file before analysis.

Default filtering criteria

By default truffle filters out markers with minor allele frequency < 5% or missing data > 5%. If you would like to include all markers in analysis you can use the option --nofiltering.

Quickstart

VCF file with 60-200k markers:

./truffle --vcf input.vcf.gz --cpu 4

Unfiltered VCF file, with millions of markers:

./truffle --vcf input.vcf.gz --cpu 4 --mindist 2000 --maf 0.1

Also generate a list of segments identified:

./truffle --vcf input.vcf.gz --cpu 4 --segments

Example data

You can download a VCF file used for the examples below from:

fs-and-po-pairs-from-1000genomes.vcf.gz 4.5MB

This dataset includes 47 selected CEU samples from 1000 genomes sequencing data, including some parent-offpring and full-sibling pairs.

Other types of analysis

Truffle can read most common vcf files as input to the segment detection algorithm. The file should, generally, contain all chromosomes from all individuals. Currently, there is no support for joining vcf files from different chromosomes.

For most common use cases, the vcf file should contain autosomal chromosomes only.

Running a single chromosome

If you would like to perform segment identification on a single chromosome, the sensitivity should generally be adjusted, see below for options --ibs1markers and --ibs2markers

Output a list of segments

To generate a list of segments in the file truffle.segments use the option --segments:

./truffle --vcf input.vcf.gz --cpu 4 ---segments

Output files description

truffle.ibd

An example output file is as follows:

           ID1                ID2    NMARK  NCOMMON         IBD0        IBD1_MAX IBD1_NSEGS      IBD1        IBD2_MAX IBD2_NSEGS      IBD2    
    P7_HG02657         P5_HG02085   169413        0     1.000000             430          0  0.000000              38          0  0.000000    
    P7_HG02658         P1_HG00269   169413        0     1.000000             456          0  0.000000              40          0  0.000000    
    P7_HG02658         P6_HG02429   169413        0     1.000000             225          0  0.000000              29          0  0.000000    
    P8_HG03343         P1_HG00269   169413        0     1.000000             233          0  0.000000              28          0  0.000000    
    P8_HG03343         P6_HG02429   169413        0     1.000000             464          0  0.000000              39          0  0.000000    
   P10_HG03754        P10_HG03750   169413        0     0.001765           13475         23  0.987498             270         18  0.010737    
   P14_NA19334        P14_NA19331   169413        0     0.263604            9753         47  0.498734            1102        115  0.237662

The output file columns are:

ID1,ID2 : ID’s of pair.
NMARK : number of markers that were used.
NCOMMON : not used.
IBD0,IBD1,IBD2 : The computed proportion of genome that is IBD0,1 or 2.
IBD1_MAX : The maximum length of segment found to be IBD1 or 2.
IBD2_MAX : The maximum length of segment found to be IBD2.
IBD1_NSEGS : Number of segments that are identified as IBD1 or 2.
IBD2_NSEGS : Number of segments that are identified as 2.

In the above output file we find that the 5 first pairs are unrelated (IBD1,2 < 0.001).

The pair P10_HG03754 P10_HG03750 is identified as a parent offspring pair (IBD1 > 0.95).

The last pair is definetely a full-sibling, having IBD1 > 0.4, IBD2 > 0.15.

truffle.segments

TYPE              ID1             ID2  CHROM  VARSTART   VAREND              POS Mbp      LENGTH Mbp    NMARKERS     
IBD1       P0_HG00120      P0_HG00116      6     59059    59814           0.2765 Mbp      7.4560 Mbp         756
IBD2       P0_HG00120      P0_HG00116      6     64814    64872         104.6593 Mbp      1.3802 Mbp          59
IBD2       P0_HG00120      P0_HG00116      6     65197    65264         110.7780 Mbp      1.7072 Mbp          68
IBD1       P0_HG00120      P0_HG00116      6     62715    65328          52.5659 Mbp     61.1437 Mbp        2614
IBD2       P0_HG00120      P0_HG00116      6     66651    66706         139.8456 Mbp      1.8712 Mbp          56
IBD1       P0_HG00120      P0_HG00116      6     66064    66818         130.3246 Mbp     13.5776 Mbp         755

The output file columns are:

ID1,ID2 : ID’s of pair.
TYPE : whether the segment was IBD1 or 2.
CHROM : chromosome name.
VARSTART : the index of marker on the start of the segment. This corresponds to the n’th marker in the VCF file (if filtering is applied this might not be the same as the n’th marker in the original vcf file
VAREND : the index of marker at the end of the segment.
POS : Position of the start of the segment in MBpairs.
LENGTH : Length of the segment in MBp.
NMARKERS : The length of the segment in number of markers.

Notes for using truffle

Segment detection sensitivity

Truffle has some build-in measures to adjust the sensitivity of the segments detection.

The default sensitivity threshold should produce reasonable but maybe not optimal, results for whole-genome autosomal marker panels. In many cases you might need to adjust it, for example when:

running exome sequencing data,
analyzing studies with mixed ethnicity individuals,
analyzing single chromosomes.

In most cases, you get more than expected relatedness and you might want to consider adjusting the parameter L to 1.5-3. For example:

./truffle --vcf input.vcf.gz --L 2

For more precise minimum segment size control, you can specify exactly the minimum number of markers that a segment must have before it is considered for reporting in the output.

There are two parameters to control the minimum lengths for IBS1 and ISB2 segments: –ibs1markers n1 and –ibs2markers n2. For example

./truffle --vcf input.vcf.gz --ibs1markers 4000 --ibs2markers 500

OPTIONAL: LD-pruning and filtering steps for genotyping array data

The following commands show a suggested pipeline for performing LD-pruning of variants and minor allele frequency pruning before running truffle:

plink --bfile input --maf 0.05 --mind 0.2 --make-bed --out /tmp/t
plink --bfile /tmp/t --maf 0.05 --geno 0.05 --make-bed --out /tmp/t

plink --bfile /tmp/t --indep-pairwise 2000 100 0.5 --out /tmp/t
cat /tmp/t.bim|awk '$1<=22' >/tmp/autosom.txt
plink --bfile /tmp/t --extract /tmp/t.prune.in --make-bed --out /tmp/t2

plink --bfile /tmp/t2 --extract /tmp/autosom.txt --recode-vcf --out /tmp/filtered

Commonly used command line options

Input/Output control

--vcf vcffile
Reads the input vcffile for processing.
--segments
Output a list of identified segments in the file truffle.segments
--out X
Change the name of the output files. The results will be stored in the files X.ibd and X.segments

CPU usage control

--cpu N
Use N cpus when running truffle.

Segment length threshold

--L X
Adjust the sensitivity threshold for detecting segments to X. Increasing L causes only larger segments to be reported.
Default: 1. For more specific marker length detection adjustments:
--ibs1markers n
Set the minimum segment length to report to n consecutive markers, for IBS1 segments.
--ibs2markers n
Set the minimum segment length to n consecutive markers, for IBS2 segments. This is generally 2-3 time smaller than the value of ibs1markers.

Variant filtering

--maf *X*
Remove variants that have a minor allele frequency less than X. For example: --maf 0.05
--missing X
Excludes variants with missing rate more than X. For example --missing 0.02
--mindist X
Removes variants that are less than X base-pairs apart.For example --mindist 2000

Other options

--pair ID1 ID2
Analyze only a single pair from the data