SNP filtering full data set

SNP filtering

This filtering was performed on full data set per the [ SNP filtering tutorial ] (http://www.ddocent.com/filtering/) in dDocent. The filtration is housed in /home/tejashree/Moorea/ddocent/final/snp_filtering/.

Step1: 50% of individuals, a minimum quality score of 30, and a minor allele count of 3

Filter genotypes called below 50% (across all individuals) the –mac 3 flag tells it to filter SNPs that have a minor allele count less than 3. This is relative to genotypes, so it has to be called in at least 1 homozygote and 1 heterozygote or 3 heterozygotes.

vcftools --vcf TotalRawSNPs.vcf --max-missing 0.5 --mac 3 --minQ 30 --recode --recode-INFO-all --out raw.g5mac3

After filtering, kept 124 out of 124 Individuals
Outputting VCF file...
After filtering, kept 288777 out of a possible 773896 Sites

Step2: Minimum mean depth

This command will recode genotypes that have less than 3 reads.Sophisticated multisample variant callers like FreeBayes and GATK can confidently call genotypes with few reads because variants are assessed across all samples simultaneously. So, the genotype is based on three reads AND prior information from all reads from all individuals.

vcftools --vcf raw.g5mac3.recode.vcf --minDP 3 --recode --recode-INFO-all --out raw.g5mac3dp3`

After filtering, kept 124 out of 124 Individuals
Outputting VCF file...
After filtering, kept 288777 out of a possible 288777 Sites`

Step3: Filter missing indiv

The next step is to get rid of individuals that did not sequence well.

vcftools --vcf raw.g5mac3dp3.recode.vcf --missing-indv
cat out.imiss
mawk '!/IN/' out.imiss | cut -f5 > totalmissing
gnuplot << \EOF 
set terminal dumb size 120, 30
set autoscale 
unset label
set title "Histogram of % missing data per individual"
set ylabel "Number of Occurrences"
set xlabel "% of missing data"
#set yr [0:100000]
binwidth=0.01
bin(x,width)=width*floor(x/width) + binwidth/2.0
plot 'totalmissing' using (bin($1,binwidth)):(1.0) smooth freq with boxes
pause -1
EOF

                                         Histogram of % missing data per individual

     25 +-----------------------------------------------------------------------------------------------------------+
        |     *** * +           +           +           +           +           +           +           +           |
        |     *** *                                           'totalmissing' using (bin($1,binwidth)):(1.0) ******* |
        |     *** *                                                                                                 |
        |     *** *                                                                                                 |
     20 |-+  **** *                                                                                               +-|
        |    **** *                                                                                                 |
        |    **** *                                                                                                 |
        |    **** *                                                                                                 |
     15 |-+  **** *                                                                                               +-|
        |    **** *                                                                                                 |
        |    **** *                                                                                                 |
        |    **** *                                                                                                 |
        |    **** *                                                                                                 |
     10 |-+  **** *                                                                                               +-|
        |    **** *                                                                                                 |
        |    **** ***                                                                                               |
        |    **** ***                                                                                               |
      5 |-+  **** ***                                                                                             +-|
        |    **** ***                                                                                               |
        | ******* *** *****                                                                                         |
        | * ***** *****   *                                                                                         |
        |** ***** *****   ******************************************************************************************|
      0 +-----------------------------------------------------------------------------------------------------------+
       0.1         0.2         0.3         0.4         0.5         0.6         0.7         0.8         0.9          1
                                                      % of missing data

Most of the individuals have less than 0.3 missing data. Created a list of individuals with more than 30% missing data.

mawk '$5 > 0.3' out.imiss | cut -f1 > lowDP.indv
vcftools --vcf raw.g5mac3dp3.recode.vcf --remove lowDP.indv --recode --recode-INFO-all --out raw.g5mac3dplm`

After filtering, kept 117 out of 124 Individuals
Outputting VCF file...
After filtering, kept 288777 out of a possible 288777 Sites

Step4: Restrict the data to variants called in a high percentage of individuals and filter by mean depth of genotypes

This applied a genotype call rate (95%) across all individuals.

vcftools --vcf raw.g5mac3dplm.recode.vcf --max-missing 0.95 --maf 0.05 --recode --recode-INFO-all --out DP3g95maf05 --min-meanDP 20
After filtering, kept 117 out of 117 Individuals
Outputting VCF file...
After filtering, kept 51772 out of a possible 288777 Sites

Step5: Filtering by population specific call rate when multiple localities are present

We want to filter by a population specific call rate.First we need a file to define localities (populations). This is called popmap. Use VCFtools to estimate missing data for loci in each population. Combine the two files and make a list of loci about the threshold of 10% missing data to remove.

cut -f1 out.imiss|awk  -F "_" 'NR>1{print $1"_"$2"_"$3,$1}' > popmap
cat popmap
mawk '$2 == "EOB"' popmap > 1.keep && mawk '$2 == "PBF"' popmap > 2.keep && mawk '$2 == "WOB"' popmap > 3.keep && mawk '$2 == "WOF"' popmap > 4.keep                                         

vcftools --vcf DP3g95maf05.recode.vcf --keep 1.keep --missing-site --out 1
vcftools --vcf DP3g95maf05.recode.vcf --keep 2.keep --missing-site --out 2                                                                                                                     
vcftools --vcf DP3g95maf05.recode.vcf --keep 3.keep --missing-site --out 3
vcftools --vcf DP3g95maf05.recode.vcf --keep 4.keep --missing-site --out 4                                                                                                                     
cat 1.lmiss 2.lmiss 3.lmiss 4.lmiss | mawk '!/CHR/' | mawk '$6 > 0.1' | cut -f1,2 >> badloci
vcftools --vcf DP3g95maf05.recode.vcf --exclude-positions badloci --recode --recode-INFO-all --out DP3g95p5maf05

After filtering, kept 117 out of 117 Individuals
Outputting VCF file...
After filtering, kept 49453 out of a possible 51772 Sites

Step6: Used dDocent_filters script

./dDocent_filters DP3g95p5maf05.recode.vcf dDocent_filters_out

Number of sites filtered based on allele balance at heterozygous loci, locus quality, and mapping quality / Depth
 4058 of 49453

Are reads expected to overlap?  In other words, is fragment size less than 2X the read length? 
no
Number of additional sites filtered based on overlapping forward and reverse reads
5570 of 45395

Is this from a mixture of SE and PE libraries? Enter yes or no.
no
Number of additional sites filtered based on properly paired status
939 of 39825

Number of sites filtered based on high depth and lower than 2*DEPTH quality score
2470 of 38886

                                               Histogram of mean depth per site

      500 +---------------------------------------------------------------------------------------------------------+
          |+    +    +     +    +    +    +     +    +    +    +    +     +    +    +    +     +    +    +    +     |
      450 |-+                                      ** *   'meandepthpersite' using (bin($1,binwidth)):(1.0) *******-|
          |                                        ** *                                                             |
          |                                  **  ****** *                                                           |
      400 |-+                             ** *************                                                        +-|
          |                              *******************                                                        |
      350 |-+                          ***********************                                                    +-|
          |                        **  ***********************                                                      |
      300 |-+                      **  ************************ **                                                +-|
          |                       ********************************                                                  |
      250 |-+                     ********************************    **                                          +-|
          |                      ***********************************  **                                            |
          |                      *************************************** *                                          |
      200 |-+                  ***************************************** *                                        +-|
          |                  *********************************************                                          |
      150 |-+               **************************************************                                    +-|
          |               *******************************************************                                   |
      100 |-+          *  *********************************************************                               +-|
          |         ** **************************************************************                               |
          |         *****************************************************************  ***  * **                    |
       50 |-+    ******************************************************************************** *** **    ***  **-|
          |+    +***************************************************************************************************|
        0 +---------------------------------------------------------------------------------------------------------+
           12   24   36    48   60   72   84    96  108  120  132  144   156  168  180  192   204  216  228  240   252
                                                          Mean Depth

If distrubtion looks normal, a 1.645 sigma cutoff (~90% of the data) would be 29166.52005
The 95% cutoff would be 230
Would you like to use a different maximum mean depth cutoff than 230, yes or no 
no
Number of sites filtered based on maximum mean depth
 2370 of 38886

Number of sites filtered based on within locus depth mismatch
 23 of 36516

Total number of sites filtered
 12960 of 49453

Remaining sites
 36493

Filtered VCF file is called Output_prefix.FIL.recode.vcf

Filter stats stored in dDocent_filters_out.filterstats

Step7: Convert our variant calls to SNPs

vcfallelicprimitives dDocent_filters_out.FIL.recode.vcf --keep-info --keep-geno > DP3g95p5maf05.prim.vcf

Feed this VCF file into VCFtools to remove indels

vcftools --vcf DP3g95p5maf05.prim.vcf --remove-indels --recode --recode-INFO-all --out SNP.DP3g95p5maf05

After filtering, kept 38619 out of a possible 40143 Sites

Step8: HWE filter

./filter_hwe_by_pop.pl -v SNP.DP3g95p5maf05.recode.vcf -p popmap -o SNP.DP3g95p5maf05.HWE -h 0.01
Processing population: EOB (36 inds)
Processing population: PBF (30 inds)
Processing population: WOB (30 inds)
Processing population: WOF (28 inds)
Outputting results of HWE test for filtered loci to 'filtered.hwe'
Kept 31009 of a possible 38619 loci (filtered 7610 loci)

Written on April 30, 2020