SNP filtering full data set
Step2: Minimum mean depth Testing at 2 values
With the full dataset testing 2 values of minDP
Minimum mean depth = 5
vcftools --vcf raw.g5mac3.recode.vcf --minDP 5 --recode --recode-INFO-all --out raw.g5mac3dp5
After filtering, kept 124 out of 124 Individuals
Outputting VCF file...
After filtering, kept 288777 out of a possible 288777 Sites`
Step3: Filter missing indiv post minDP =5
vcftools --vcf raw.g5mac3dp5.recode.vcf --missing-indv
cat out.imiss
mawk '!/IN/' out.imiss | cut -f5 > totalmissing
gnuplot << \EOF
set terminal dumb size 120, 30
set autoscale
unset label
set title "Histogram of % missing data per individual"
set ylabel "Number of Occurrences"
set xlabel "% of missing data"
set yr [0:100000]
binwidth=0.01
bin(x,width)=width*floor(x/width) + binwidth/2.0
plot 'totalmissing' using (bin($1,binwidth)):(1.0) smooth freq with boxes
pause -1
EOF
Histogram of % missing data per individual
25 +-----------------------------------------------------------------------------------------------------------+
| * * + + + + + + + + |
| * * 'totalmissing' using (bin($1,binwidth)):(1.0) ******* |
| * * |
| * * |
20 |-+ * * +-|
| ** ** |
| ** ** |
| ** ** |
15 |-+ ** ** +-|
| ** ** |
| ** *** |
| ** *** |
| **** *** |
10 |-+ **** *** +-|
| **** *** |
| **** *** |
| **** *** |
5 |-+ **** **** +-|
| **** **** |
| ***** ***** |
| ***** ******** *|
| ***** ***** ********************************************************************************************|
0 +-----------------------------------------------------------------------------------------------------------+
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
% of missing data
With cutoff at 25%, the number of lowDP individuals are 11.
mawk '$5 > 0.25' out.imiss | cut -f1 | less
INDV
EOB_182_epi
EOB_492_ddr
PBF_159_ddr
PBF_161_epi
WOB_35_epi
WOB_38_ddr
WOB_45_ddr
WOB_46_ddr
WOB_61_ddr
WOB_65_epi
WOF_232_epi
With cutoff at 30%, the number of lowDP indiv are 10
mawk '$5 > 0.3' out.imiss | cut -f1 | less
EOB_182_epi
EOB_492_ddr
PBF_159_ddr
PBF_161_epi
WOB_35_epi
WOB_38_ddr
WOB_45_ddr
WOB_46_ddr
WOB_61_ddr
WOB_65_epi
Choosing 25% cutoff
vcftools --vcf raw.g5mac3dp5.recode.vcf --remove lowDP.indv --recode --recode-INFO-all --out raw.g5mac3dplm
After filtering, kept 113 out of 124 Individuals
Outputting VCF file...
After filtering, kept 288777 out of a possible 288777 Sites
Step4: Restrict the data to variants called in a high percentage of individuals and filter by mean depth of genotypes
This applied a genotype call rate (95%) across all individuals. Setting maf = 0.001
vcftools --vcf raw.g5mac3dplm.recode.vcf --max-missing 0.95 --maf 0.001 --recode --recode-INFO-all --out DP3g95maf001 --min-meanDP 20
After filtering, kept 113 out of 113 Individuals
Outputting VCF file...
After filtering, kept 126456 out of a possible 288777 Sites
Step5: Filtering by population specific call rate when multiple localities are present
cut -f1 out.imiss|awk -F "_" 'NR>1{print $1"_"$2"_"$3,$1}' > popmap
cat popmap
mawk '$2 == "EOB"' popmap > 1.keep && mawk '$2 == "PBF"' popmap > 2.keep && mawk '$2 == "WOB"' popmap > 3.keep && mawk '$2 == "WOF"' popmap > 4.keep
vcftools --vcf DP3g95maf001.recode.vcf --keep 1.keep --missing-site --out 1
vcftools --vcf DP3g95maf001.recode.vcf --keep 2.keep --missing-site --out 2
vcftools --vcf DP3g95maf001.recode.vcf --keep 3.keep --missing-site --out 3
vcftools --vcf DP3g95maf001.recode.vcf --keep 4.keep --missing-site --out 4
cat 1.lmiss 2.lmiss 3.lmiss 4.lmiss | mawk '!/CHR/' | mawk '$6 > 0.1' | cut -f1,2 >> badloci
vcftools --vcf DP3g95maf001.recode.vcf --exclude-positions badloci --recode --recode-INFO-all --out DP3g95p5maf001
After filtering, kept 113 out of 113 Individuals
Outputting VCF file...
After filtering, kept 121233 out of a possible 126456 Sites
Step6: Used dDocent_filters script
./dDocent_filters DP3g95p5maf001.recode.vcf dDocent_filters_out
Number of sites filtered based on allele balance at heterozygous loci, locus quality, and mapping quality / Depth
18603 of 121233
Are reads expected to overlap? In other words, is fragment size less than 2X the read length? Enter yes or no.
no
Number of additional sites filtered based on overlapping forward and reverse reads
12091 of 102630
Is this from a mixture of SE and PE libraries? Enter yes or no.
no
Number of additional sites filtered based on properly paired status
1815 of 90539
Number of sites filtered based on high depth and lower than 2*DEPTH quality score
7746 of 88724
Histogram of mean depth per site
1200 +---------------------------------------------------------------------------------------------------------+
| + + + + + + + + + + + + + + + + + + + |
| 'meandepthpersite' using (bin($1,binwidth)):(1.0) ******* |
| * * |
1000 |-+ *** ** ** * * +-|
| * *** ********** |
| * ************** |
| * ********************** |
800 |-+ * ************************ +-|
| * * ************************** |
| *** ****************************** ** |
600 |-+ *** ********************************* +-|
| ************************************** |
| **************************************** *** |
| ********************************************** ** |
400 |-+ ************************************************* +-|
| *************************************************** |
| ********************************************************** |
| *************************************************************** |
200 |-+ ****************************************************************** +-|
| ************************************************************************* * |
| ******************************************************************************** ***** ** |
| + *************************************************************************************************|
0 +---------------------------------------------------------------------------------------------------------+
11 22 33 44 55 66 77 88 99 110 121 132 143 154 165 176 187 198 209 220 231
Mean Depth
If distrubtion looks normal, a 1.645 sigma cutoff (~90% of the data) would be 25233.65575
The 95% cutoff would be 211
Would you like to use a different maximum mean depth cutoff than 211, yes or no
no
Number of sites filtered based on maximum mean depth
5198 of 88724
Number of sites filtered based on within locus depth mismatch
26 of 83526
Total number of sites filtered
37733 of 121233
Remaining sites
83500
Step7: Convert our variant calls to SNPs
vcfallelicprimitives dDocent_filters_out.FIL.recode.vcf --keep-info --keep-geno > DP3g95p5maf001.prim.vcf
vcftools --vcf DP3g95p5maf001.prim.vcf --remove-indels --recode --recode-INFO-all --out SNP.DP3g95p5maf001
After filtering, kept 113 out of 113 Individuals
Outputting VCF file...
After filtering, kept 90992 out of a possible 95915 Sites
Step8: HWE filter
Setting -h 0.001
.././filter_hwe_by_pop.pl -v SNP.DP3g95p5maf001.recode.vcf -p popmap -o SNP.DP3g95p5maf001.HWE -h 0.001
Processing population: EOB (36 inds)
Processing population: PBF (30 inds)
Processing population: WOB (30 inds)
Processing population: WOF (28 inds)
Outputting results of HWE test for filtered loci to 'filtered.hwe'
Kept 87172 of a possible 90992 loci (filtered 3820 loci)
With minDP = 5, maf = 0.001 and h = 0.001 : 87172 loci kept with 113 indiv
Step2: Minimum mean depth Testing at 2 values
Minimum mean depth = 10
vcftools --vcf raw.g5mac3.recode.vcf --minDP 10 --recode --recode-INFO-all --out raw.g5mac3dp10
After filtering, kept 124 out of 124 Individuals
Outputting VCF file...
After filtering, kept 288777 out of a possible 288777 Sites`
Step3: Filter missing indiv post minDP=10
vcftools --vcf raw.g5mac3dp10.recode.vcf --missing-indv
cat out.imiss
mawk '!/IN/' out.imiss | cut -f5 > totalmissing
gnuplot << \EOF
set terminal dumb size 120, 30
set autoscale
unset label
set title "Histogram of % missing data per individual"
set ylabel "Number of Occurrences"
set xlabel "% of missing data"
set yr [0:100000]
binwidth=0.01
bin(x,width)=width*floor(x/width) + binwidth/2.0
plot 'totalmissing' using (bin($1,binwidth)):(1.0) smooth freq with boxes
pause -1
EOF
Histogram of % missing data per individual
18 +-----------------------------------------------------------------------------------------------------------+
| **** + + + + + + + + |
| **** 'totalmissing' using (bin($1,binwidth)):(1.0) ******* |
16 |-+ **** +-|
| **** |
14 |-+ **** +-|
| ** ***** |
| ** ***** |
12 |-+ ** ******* +-|
| ** ***** * |
10 |-+ ** ***** * +-|
| ** ***** * |
| ** ***** * |
8 |-+ ** ***** * +-|
| ** ***** * |
6 |-+ ******** *** +-|
| ** ***** * * |
| ** ***** * * |
4 |-+ ** ***** * * +-|
| ** ***** * ** *** |
2 |-+ *** ***** * ** * * *** +-|
| *** ***** * ** * * * * |
| *** ***** * **** *************************************************************************** ** |
0 +-----------------------------------------------------------------------------------------------------------+
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
% of missing data
With cutoff of 30%, number of lowDP indiv = 12
mawk '$5 > 0.3' out.imiss | cut -f1 | less
INDV
EOB_182_epi
EOB_184_ddr
EOB_492_ddr
PBF_159_ddr
PBF_161_epi
WOB_35_epi
WOB_38_ddr
WOB_45_ddr
WOB_46_ddr
WOB_61_ddr
WOB_65_epi
WOF_232_epi
vcftools --vcf raw.g5mac3dp10.recode.vcf --remove lowDP.indv --recode --recode-INFO-all --out raw.g5mac3dplm
After filtering, kept 112 out of 124 Individuals
Outputting VCF file...
After filtering, kept 288777 out of a possible 288777 Sites
Step4: Restrict the data to variants called in a high percentage of individuals and filter by mean depth of genotypes
This applied a genotype call rate (95%) across all individuals. Setting maf = 0.001
vcftools --vcf raw.g5mac3dplm.recode.vcf --max-missing 0.95 --maf 0.001 --recode --recode-INFO-all --out DP3g95maf001 --min-meanDP 20
fter filtering, kept 112 out of 112 Individuals
Outputting VCF file...
After filtering, kept 117809 out of a possible 288777 Sites
Step5: Filtering by population specific call rate when multiple localities are present
cut -f1 out.imiss|awk -F "_" 'NR>1{print $1"_"$2"_"$3,$1}' > popmap
cat popmap
mawk '$2 == "EOB"' popmap > 1.keep && mawk '$2 == "PBF"' popmap > 2.keep && mawk '$2 == "WOB"' popmap > 3.keep && mawk '$2 == "WOF"' popmap > 4.keep
vcftools --vcf DP3g95maf001.recode.vcf --keep 1.keep --missing-site --out 1
vcftools --vcf DP3g95maf001.recode.vcf --keep 2.keep --missing-site --out 2
vcftools --vcf DP3g95maf001.recode.vcf --keep 3.keep --missing-site --out 3
vcftools --vcf DP3g95maf001.recode.vcf --keep 4.keep --missing-site --out 4
cat 1.lmiss 2.lmiss 3.lmiss 4.lmiss | mawk '!/CHR/' | mawk '$6 > 0.1' | cut -f1,2 >> badloci
vcftools --vcf DP3g95maf001.recode.vcf --exclude-positions badloci --recode --recode-INFO-all --out DP3g95p5maf001
After filtering, kept 112 out of 112 Individuals
Outputting VCF file...
After filtering, kept 112668 out of a possible 117809 Sites
Step6: dDocent_filters script
.././dDocent_filters DP3g95p5maf001.recode.vcf dDocent_filters_out
Number of sites filtered based on allele balance at heterozygous loci, locus quality, and mapping quality / Depth
17462 of 112668
Are reads expected to overlap? In other words, is fragment size less than 2X the read length? Enter yes or no.
no
Number of additional sites filtered based on overlapping forward and reverse reads
11477 of 95206
Is this from a mixture of SE and PE libraries? Enter yes or no.
no
Number of additional sites filtered based on properly paired status
1725 of 83729
Number of sites filtered based on high depth and lower than 2*DEPTH quality score
7395 of 82004
Histogram of mean depth per site
1200 +---------------------------------------------------------------------------------------------------------+
| + + + + + + + + + + + + + + + + + + + + |
| 'meandepthpersite' using (bin($1,binwidth)):(1.0) ******* |
| ** ** |
1000 |-+ ** *** +-|
| * ** **** **** |
| * ********* **** |
| ***************** |
800 |-+ *********************** +-|
| ************************* |
| *************************** * |
600 |-+ * ******************************** +-|
| ** ********************************** |
| ************************************* |
| ******************************************** |
400 |-+ ********************************************** +-|
| *********************************************** |
| ** ***************************************************** |
| ** ******************************************************** |
200 |-+ ************************************************************ +-|
| ************************************************************** * |
| ********************************************************************** ****** ** |
| + + ********************************************************************************************|
0 +---------------------------------------------------------------------------------------------------------+
11 22 33 44 55 66 77 88 99 110 121 132 143 154 165 176 187 198 209 220 231
Mean Depth
If distrubtion looks normal, a 1.645 sigma cutoff (~90% of the data) would be 25723.2027
The 95% cutoff would be 218
Would you like to use a different maximum mean depth cutoff than 218, yes or no
no
Number of sites filtered based on maximum mean depth
4794 of 82004
Number of sites filtered based on within locus depth mismatch
12 of 77210
Total number of sites filtered
35470 of 112668
Remaining sites
77198
Step7: Convert our variant calls to SNPs
vcfallelicprimitives dDocent_filters_out.FIL.recode.vcf --keep-info --keep-geno > DP3g95p5maf001.prim.vcf
vcftools --vcf DP3g95p5maf001.prim.vcf --remove-indels --recode --recode-INFO-all --out SNP.DP3g95p5maf001
After filtering, kept 112 out of 112 Individuals
Outputting VCF file...
After filtering, kept 84068 out of a possible 88583 Sites
Step8: HWE filter
Setting -h 0.001
.././filter_hwe_by_pop.pl -v SNP.DP3g95p5maf001.recode.vcf -p popmap -o SNP.DP3g95p5maf001.HWE -h 0.001
Processing population: EOB (36 inds)
Processing population: PBF (30 inds)
Processing population: WOB (30 inds)
Processing population: WOF (28 inds)
Outputting results of HWE test for filtered loci to 'filtered.hwe'
Kept 80692 of a possible 84068 loci (filtered 3376 loci)
With minDP = 10, maf = 0.001 and h = 0.001 : 80692 loci kept with 112 indiv
Written on April 30, 2020