layout: post title: VCFTools date: 2020-07-10 —

How to subset or filter individuals from a vcf file.

A vcf file format is a very complicated format and manually editing the file is really not a good idea. Often I find myself in a situation where I want analyze a subgroup of samples separately or create separate groups of samples that are in a single vcf file. VCFTools has a very useful functionality that can allow the user to choose which individuals to remove/keep in a vcf file while maintaining the format.

If you want to keep a single or a just a few individuals you can specify them directly on commandline after --indiv option.

vcftools --vcf <original vcf file> --indiv <indiv name> --recode --recode-INFO-all --out <new vcf file>

The --recode option generates a new vcf file with the changes. If you want a bcf file instead you can use --recode-bcf to output a bcf file. --recode-INFO-all option is used to keep all the information in the original vcf exactly the same in the outputvcf which is most often the requirement since the goal is to subset only. --out specifies the prefix for the outfile. By default VCFTools adds a suffix of “.recode.vcf” to the prefix specified by the user.

You can also provide a list of individuals to keep from the original vcf file by making a text file with one name per line and using the --keep option.

vcftools --vcf <original vcf file> --keep <file with indiv names> --recode --recode-INFO-all --out <new vcf file>

If you want to filter out an individual you can use

vcftools --vcf <original vcf file> --remove-indiv <indiv name> --recode --recode-INFO-all --out <new vcf file>

or if you want a list of individuals removed use

vcftools --vcf <original vcf file> --remove <file with indiv names> --recode --recode-INFO-all --out <new vcf file>

An important thing to remember is that the individual names have to exactly match the ones in the header of the original vcf file.