Modified pipeline for identification of clusters and cluster specific analysis
Pipeline modificaitons
After reorganizing the entire pipeline from identification of clusters to individual cluster analysis, this post walks through all the steps performed per the changes.This analysis is housed in mod_pipeline/ on kitt.
The steps refer to the step numbers in the pipeline figure generated for the MS.
Estimating species diversity
KITT: mod_pipeline/species_diversity/
Step1: This step uses all samples post demultiplexing and removal of clones and removal of samples with read count < 400K.
Step2: Generating a reference assembly
KITT: mod_pipeline/species_diversity/reference_assembly/ Five samples per location were randomly chosen to generate the reference assembly. Parameter optimization was performed using Used this as reference for parameter optimization for reference assembly.
- ReferenceOpt.sh
ReferenceOpt.sh 2 6 2 6 PE 20
dDocent ReferenceOpt version 2.7.8
K1 is 2 K2 is 2 c is 0.80
K1 is 2 K2 is 2 c is 0.82
K1 is 2 K2 is 2 c is 0.84
K1 is 2 K2 is 2 c is 0.86
K1 is 2 K2 is 2 c is 0.88
K1 is 2 K2 is 2 c is 0.90
K1 is 2 K2 is 2 c is 0.92
K1 is 2 K2 is 2 c is 0.94
K1 is 2 K2 is 2 c is 0.96
K1 is 2 K2 is 2 c is 0.98
K1 is 2 K2 is 3 c is 0.80
K1 is 2 K2 is 3 c is 0.82
K1 is 2 K2 is 3 c is 0.84
K1 is 2 K2 is 3 c is 0.86
K1 is 2 K2 is 3 c is 0.88
K1 is 2 K2 is 3 c is 0.90
K1 is 2 K2 is 3 c is 0.92
K1 is 2 K2 is 3 c is 0.94
K1 is 2 K2 is 3 c is 0.96
K1 is 2 K2 is 3 c is 0.98
K1 is 2 K2 is 4 c is 0.80
K1 is 2 K2 is 4 c is 0.82
K1 is 2 K2 is 4 c is 0.84
K1 is 2 K2 is 4 c is 0.86
K1 is 2 K2 is 4 c is 0.88
K1 is 2 K2 is 4 c is 0.90
K1 is 2 K2 is 4 c is 0.92
K1 is 2 K2 is 4 c is 0.94
K1 is 2 K2 is 4 c is 0.96
K1 is 2 K2 is 4 c is 0.98
K1 is 2 K2 is 5 c is 0.80
K1 is 2 K2 is 5 c is 0.82
K1 is 2 K2 is 5 c is 0.84
K1 is 2 K2 is 5 c is 0.86
K1 is 2 K2 is 5 c is 0.88
K1 is 2 K2 is 5 c is 0.90
K1 is 2 K2 is 5 c is 0.92
K1 is 2 K2 is 5 c is 0.94
K1 is 2 K2 is 5 c is 0.96
K1 is 2 K2 is 5 c is 0.98
K1 is 2 K2 is 6 c is 0.80
K1 is 2 K2 is 6 c is 0.82
K1 is 2 K2 is 6 c is 0.84
K1 is 2 K2 is 6 c is 0.86
K1 is 2 K2 is 6 c is 0.88
K1 is 2 K2 is 6 c is 0.90
K1 is 2 K2 is 6 c is 0.92
K1 is 2 K2 is 6 c is 0.94
K1 is 2 K2 is 6 c is 0.96
K1 is 2 K2 is 6 c is 0.98
K1 is 3 K2 is 2 c is 0.80
K1 is 3 K2 is 2 c is 0.82
K1 is 3 K2 is 2 c is 0.84
K1 is 3 K2 is 2 c is 0.86
K1 is 2 K2 is 5 c is 0.88
K1 is 2 K2 is 5 c is 0.90
K1 is 2 K2 is 5 c is 0.92
K1 is 2 K2 is 5 c is 0.94
K1 is 2 K2 is 5 c is 0.96
K1 is 2 K2 is 5 c is 0.98
K1 is 2 K2 is 6 c is 0.80
K1 is 2 K2 is 6 c is 0.82
K1 is 2 K2 is 6 c is 0.84
K1 is 2 K2 is 6 c is 0.86
K1 is 2 K2 is 6 c is 0.88
K1 is 2 K2 is 6 c is 0.90
K1 is 2 K2 is 6 c is 0.92
K1 is 2 K2 is 6 c is 0.94
K1 is 2 K2 is 6 c is 0.96
K1 is 2 K2 is 6 c is 0.98
K1 is 3 K2 is 2 c is 0.80
K1 is 3 K2 is 2 c is 0.82
K1 is 3 K2 is 2 c is 0.84
K1 is 3 K2 is 2 c is 0.86
K1 is 3 K2 is 2 c is 0.88
K1 is 3 K2 is 2 c is 0.90
K1 is 3 K2 is 2 c is 0.92
K1 is 3 K2 is 2 c is 0.94
K1 is 3 K2 is 2 c is 0.96
K1 is 3 K2 is 2 c is 0.98
K1 is 3 K2 is 3 c is 0.80
K1 is 3 K2 is 3 c is 0.82
K1 is 3 K2 is 3 c is 0.84
K1 is 3 K2 is 3 c is 0.86
K1 is 3 K2 is 3 c is 0.88
K1 is 3 K2 is 3 c is 0.90
K1 is 3 K2 is 3 c is 0.92
K1 is 3 K2 is 3 c is 0.94
K1 is 3 K2 is 3 c is 0.96
K1 is 3 K2 is 3 c is 0.98
K1 is 3 K2 is 4 c is 0.80
K1 is 3 K2 is 4 c is 0.82
K1 is 3 K2 is 4 c is 0.84
K1 is 3 K2 is 4 c is 0.86
K1 is 3 K2 is 4 c is 0.88
K1 is 3 K2 is 4 c is 0.90
K1 is 3 K2 is 4 c is 0.92
K1 is 3 K2 is 4 c is 0.94
K1 is 3 K2 is 4 c is 0.96
K1 is 3 K2 is 4 c is 0.98
K1 is 3 K2 is 5 c is 0.80
K1 is 3 K2 is 5 c is 0.82
K1 is 3 K2 is 5 c is 0.84
K1 is 3 K2 is 5 c is 0.86
K1 is 3 K2 is 5 c is 0.88
K1 is 3 K2 is 5 c is 0.90
K1 is 3 K2 is 5 c is 0.92
K1 is 3 K2 is 5 c is 0.94
K1 is 3 K2 is 5 c is 0.96
K1 is 3 K2 is 5 c is 0.98
K1 is 3 K2 is 6 c is 0.80
K1 is 3 K2 is 6 c is 0.82
K1 is 3 K2 is 6 c is 0.84
K1 is 3 K2 is 6 c is 0.86
K1 is 3 K2 is 6 c is 0.88
K1 is 3 K2 is 6 c is 0.90
K1 is 3 K2 is 6 c is 0.92
K1 is 3 K2 is 6 c is 0.94
K1 is 3 K2 is 6 c is 0.96
K1 is 3 K2 is 6 c is 0.98
K1 is 4 K2 is 2 c is 0.80
K1 is 4 K2 is 2 c is 0.82
K1 is 4 K2 is 2 c is 0.84
K1 is 4 K2 is 2 c is 0.86
K1 is 4 K2 is 2 c is 0.88
K1 is 4 K2 is 2 c is 0.90
K1 is 4 K2 is 2 c is 0.92
K1 is 4 K2 is 2 c is 0.94
K1 is 4 K2 is 2 c is 0.96
K1 is 4 K2 is 2 c is 0.98
K1 is 4 K2 is 3 c is 0.80
K1 is 4 K2 is 3 c is 0.82
K1 is 4 K2 is 3 c is 0.84
K1 is 4 K2 is 3 c is 0.86
K1 is 4 K2 is 3 c is 0.88
K1 is 4 K2 is 3 c is 0.90
K1 is 4 K2 is 3 c is 0.92
K1 is 4 K2 is 3 c is 0.94
K1 is 4 K2 is 3 c is 0.96
K1 is 4 K2 is 3 c is 0.98
K1 is 4 K2 is 4 c is 0.80
K1 is 4 K2 is 4 c is 0.82
K1 is 4 K2 is 4 c is 0.84
K1 is 4 K2 is 4 c is 0.86
K1 is 4 K2 is 4 c is 0.88
K1 is 4 K2 is 4 c is 0.90
K1 is 4 K2 is 4 c is 0.92
K1 is 4 K2 is 4 c is 0.94
K1 is 4 K2 is 4 c is 0.96
K1 is 4 K2 is 4 c is 0.98
K1 is 4 K2 is 5 c is 0.80
K1 is 4 K2 is 5 c is 0.82
K1 is 4 K2 is 5 c is 0.84
K1 is 4 K2 is 5 c is 0.86
K1 is 4 K2 is 5 c is 0.88
K1 is 4 K2 is 5 c is 0.90
K1 is 4 K2 is 5 c is 0.92
K1 is 4 K2 is 5 c is 0.94
K1 is 4 K2 is 5 c is 0.96
K1 is 4 K2 is 5 c is 0.98
K1 is 4 K2 is 6 c is 0.80
K1 is 4 K2 is 6 c is 0.82
K1 is 4 K2 is 6 c is 0.84
K1 is 4 K2 is 6 c is 0.86
K1 is 4 K2 is 6 c is 0.88
K1 is 4 K2 is 6 c is 0.90
K1 is 4 K2 is 6 c is 0.92
K1 is 4 K2 is 6 c is 0.94
K1 is 4 K2 is 6 c is 0.96
K1 is 4 K2 is 6 c is 0.98
K1 is 5 K2 is 2 c is 0.80
K1 is 5 K2 is 2 c is 0.82
K1 is 5 K2 is 2 c is 0.84
K1 is 5 K2 is 2 c is 0.86
K1 is 5 K2 is 2 c is 0.88
K1 is 5 K2 is 2 c is 0.90
K1 is 5 K2 is 2 c is 0.92
K1 is 5 K2 is 2 c is 0.94
K1 is 5 K2 is 2 c is 0.96
K1 is 5 K2 is 2 c is 0.98
K1 is 5 K2 is 3 c is 0.80
K1 is 5 K2 is 3 c is 0.82
K1 is 5 K2 is 3 c is 0.84
K1 is 5 K2 is 3 c is 0.86
K1 is 5 K2 is 3 c is 0.88
K1 is 5 K2 is 3 c is 0.90
K1 is 5 K2 is 3 c is 0.92
K1 is 5 K2 is 3 c is 0.94
K1 is 5 K2 is 3 c is 0.96
K1 is 5 K2 is 3 c is 0.98
K1 is 5 K2 is 4 c is 0.80
K1 is 5 K2 is 4 c is 0.82
K1 is 5 K2 is 4 c is 0.84
K1 is 5 K2 is 4 c is 0.86
K1 is 5 K2 is 4 c is 0.88
K1 is 5 K2 is 4 c is 0.90
K1 is 5 K2 is 4 c is 0.92
K1 is 5 K2 is 4 c is 0.94
K1 is 5 K2 is 4 c is 0.96
K1 is 5 K2 is 4 c is 0.98
K1 is 5 K2 is 5 c is 0.80
K1 is 5 K2 is 5 c is 0.82
K1 is 5 K2 is 5 c is 0.84
K1 is 5 K2 is 5 c is 0.86
K1 is 5 K2 is 5 c is 0.88
K1 is 5 K2 is 5 c is 0.90
K1 is 5 K2 is 5 c is 0.92
K1 is 5 K2 is 5 c is 0.94
K1 is 5 K2 is 5 c is 0.96
K1 is 5 K2 is 5 c is 0.98
K1 is 5 K2 is 6 c is 0.80
K1 is 5 K2 is 6 c is 0.82
K1 is 5 K2 is 6 c is 0.84
K1 is 5 K2 is 6 c is 0.86
K1 is 5 K2 is 6 c is 0.88
K1 is 5 K2 is 6 c is 0.90
K1 is 5 K2 is 6 c is 0.92
K1 is 5 K2 is 6 c is 0.94
K1 is 5 K2 is 6 c is 0.96
K1 is 5 K2 is 6 c is 0.98
K1 is 6 K2 is 2 c is 0.80
K1 is 6 K2 is 2 c is 0.82
K1 is 6 K2 is 2 c is 0.84
K1 is 6 K2 is 2 c is 0.86
K1 is 6 K2 is 2 c is 0.88
K1 is 6 K2 is 2 c is 0.90
K1 is 6 K2 is 2 c is 0.92
K1 is 6 K2 is 2 c is 0.94
K1 is 6 K2 is 2 c is 0.96
K1 is 6 K2 is 2 c is 0.98
K1 is 6 K2 is 3 c is 0.80
K1 is 6 K2 is 3 c is 0.82
K1 is 6 K2 is 3 c is 0.84
K1 is 6 K2 is 3 c is 0.86
K1 is 6 K2 is 3 c is 0.88
K1 is 6 K2 is 3 c is 0.90
K1 is 6 K2 is 3 c is 0.92
K1 is 6 K2 is 3 c is 0.94
K1 is 6 K2 is 3 c is 0.96
K1 is 6 K2 is 3 c is 0.98
K1 is 6 K2 is 4 c is 0.80
K1 is 6 K2 is 4 c is 0.82
K1 is 6 K2 is 4 c is 0.84
K1 is 6 K2 is 4 c is 0.86
K1 is 6 K2 is 4 c is 0.88
K1 is 6 K2 is 4 c is 0.90
K1 is 6 K2 is 4 c is 0.92
K1 is 6 K2 is 4 c is 0.94
K1 is 6 K2 is 4 c is 0.96
K1 is 6 K2 is 4 c is 0.98
K1 is 6 K2 is 5 c is 0.80
K1 is 6 K2 is 5 c is 0.82
K1 is 6 K2 is 5 c is 0.84
K1 is 6 K2 is 5 c is 0.86
K1 is 6 K2 is 5 c is 0.88
K1 is 6 K2 is 5 c is 0.90
K1 is 6 K2 is 5 c is 0.92
K1 is 6 K2 is 5 c is 0.94
K1 is 6 K2 is 5 c is 0.96
K1 is 6 K2 is 5 c is 0.98
K1 is 6 K2 is 6 c is 0.80
K1 is 6 K2 is 6 c is 0.82
K1 is 6 K2 is 6 c is 0.84
K1 is 6 K2 is 6 c is 0.86
K1 is 6 K2 is 6 c is 0.88
K1 is 6 K2 is 6 c is 0.90
K1 is 6 K2 is 6 c is 0.92
K1 is 6 K2 is 6 c is 0.94
K1 is 6 K2 is 6 c is 0.96
K1 is 6 K2 is 6 c is 0.98
Histogram of number of reference contigs
8 +------------------------------------------------------------------------------------------------------------+
| * * + + + + + + + |
| * * 'plot.kopt.data' using (bin($1,binwidth)):(1.0) ******* |
7 |*** ** +-|
|*** ** |
|*** ** |
6 |***** ** * +-|
|***** ** * |
|***** ** * |
5 |***** ***** * * * ** +-|
|***** ***** * * * ** |
4 |***** ***** * ***** ***** +-|
|***** ***** * ***** ***** |
|***** ***** * ***** ***** |
3 |****** ******* ******** ***** *** * * +-|
|****** ******* ******** ***** *** * * |
|****** ******* ******** ***** *** * * |
2 |****** ******** ********** *********** ** ********* ** +-|
|****** ******** ********** ********* * ** ********* ** |
|****** ******** ********** ********* * ** ********* ** |
1 |*************************************** *********************************************************** +-|
|*************************************** ***** * ****************** * **** * * * * |
|*************************************** ***** * ****************** + * + **** * *+ * * |
0 +------------------------------------------------------------------------------------------------------------+
15000 20000 25000 30000 35000 40000 45000 50000 55000
Number of reference contigs
Average contig number = 25114
The top three most common number of contigs
X Contig number
2 16127
2 15503
1 51276
The top three most common number of contigs (with values rounded)
X Contig number
6 15800
5 16800
5 15200
The plot shows 0.9 as the optimal similarity threshold.
- RefMapOpt.sh
RefMapOpt.sh 2 6 2 6 0.9 PE 20
Used the mapping.results to compare values. We have to pick optimal k1,k2 cutoffs. Ideally, you want to maximize properly paired mappings and number of contigs while minimizing mismatched reads.
Cov Non0Cov Contigs MeanContigsMapped K1 K2 SUM Mapped SUM Properly Mean Mapped Mean Properly MisMatched
98.2696 155.183 47633 29719 2 2 70214603 63625835 4.68097e+06 4.24172e+06 164558
145.521 192.78 31099 23272.4 2 3 67885501 61904868 4.5257e+06 4.12699e+06 116478
175.186 212.939 24689 20193 2 4 64880128 59927449 4.32534e+06 3.99516e+06 77087.4
198.334 228.68 20280 17511.3 2 5 60336126 55456066 4.02241e+06 3.69707e+06 90376.3
218.948 243.525 16823 15074.7 2 6 55253611 51240341 3.68357e+06 3.41602e+06 45771.6
128.898 189.021 36511 24709 3 2 70595010 64493357 4706334 4.29956e+06 134345
159.81 207.222 28177 21605.6 3 3 67546809 61530161 4.50312e+06 4.10201e+06 119537
182.325 220.097 23429 19320.6 3 4 64078229 58592299 4.27188e+06 3.90615e+06 116625
205.33 235.896 19461 16877.4 3 5 59941970 55216232 3.99613e+06 3.68108e+06 81705.3
224.496 249.18 16198 14552.2 3 6 54549227 50362975 3.63662e+06 3.35753e+06 56075.1
132.589 193.627 35354 24042.6 4 2 70315301 64103940 4.68769e+06 4273596 138974
162.541 210.089 27562 21207.3 4 3 67201915 61276951 4.48013e+06 4.08513e+06 126061
186.141 224.259 22978 18988.8 4 4 64159865 58770869 4.27732e+06 3.91806e+06 111165
208.017 238.534 19067 16569.6 4 5 59496810 54478379 3966454 3.63189e+06 90448.1
227.155 251.868 15856 14261.7 4 6 54030051 49858645 3.602e+06 3.32391e+06 54826.3
134.182 195.396 34859 23774.8 5 2 70163844 63618028 4.67759e+06 4.2412e+06 146547
164.281 211.924 27162 20942.1 5 3 66935536 61025545 4.46237e+06 4.06837e+06 123960
188.826 226.936 22572 18700.7 5 4 63935453 58827777 4.26236e+06 3.92185e+06 94486.3
209.67 240.056 18738 16309.8 5 5 58935223 54142343 3.92901e+06 3.60949e+06 72980.3
230.211 254.952 15564 14015.5 5 6 53748447 49230968 3.58323e+06 3.28206e+06 54917.5
135.847 197.179 34379 23525.5 6 2 70056427 63833654 4.67043e+06 4.25558e+06 127198
166.401 213.862 26829 20762.7 6 3 66968079 61438385 4.46454e+06 4.09589e+06 111078
190.838 228.795 22231 18463.8 6 4 63640815 58681873 4242721 3.91212e+06 79363.3
212.021 242.331 18463 16098.6 6 5 58721215 53718516 3.91475e+06 3.58123e+06 87104.8
232.112 256.639 15250 13755.3 6 6 53099038 48668830 3.53994e+06 3.24459e+06 50503.3
❯ sort -k 10 -n -r mapping.results | column -t -s $'\t'
132.589 193.627 35354 24042.6 4 2 70315301 64103940 4.68769e+06 4273596 138974
128.898 189.021 36511 24709 3 2 70595010 64493357 4706334 4.29956e+06 134345
135.847 197.179 34379 23525.5 6 2 70056427 63833654 4.67043e+06 4.25558e+06 127198
98.2696 155.183 47633 29719 2 2 70214603 63625835 4.68097e+06 4.24172e+06 164558
134.182 195.396 34859 23774.8 5 2 70163844 63618028 4.67759e+06 4.2412e+06 146547
145.521 192.78 31099 23272.4 2 3 67885501 61904868 4.5257e+06 4.12699e+06 116478
159.81 207.222 28177 21605.6 3 3 67546809 61530161 4.50312e+06 4.10201e+06 119537
166.401 213.862 26829 20762.7 6 3 66968079 61438385 4.46454e+06 4.09589e+06 111078
162.541 210.089 27562 21207.3 4 3 67201915 61276951 4.48013e+06 4.08513e+06 126061
164.281 211.924 27162 20942.1 5 3 66935536 61025545 4.46237e+06 4.06837e+06 123960
175.186 212.939 24689 20193 2 4 64880128 59927449 4.32534e+06 3.99516e+06 77087.4
188.826 226.936 22572 18700.7 5 4 63935453 58827777 4.26236e+06 3.92185e+06 94486.3
186.141 224.259 22978 18988.8 4 4 64159865 58770869 4.27732e+06 3.91806e+06 111165
190.838 228.795 22231 18463.8 6 4 63640815 58681873 4242721 3.91212e+06 79363.3
182.325 220.097 23429 19320.6 3 4 64078229 58592299 4.27188e+06 3.90615e+06 116625
198.334 228.68 20280 17511.3 2 5 60336126 55456066 4.02241e+06 3.69707e+06 90376.3
205.33 235.896 19461 16877.4 3 5 59941970 55216232 3.99613e+06 3.68108e+06 81705.3
208.017 238.534 19067 16569.6 4 5 59496810 54478379 3966454 3.63189e+06 90448.1
209.67 240.056 18738 16309.8 5 5 58935223 54142343 3.92901e+06 3.60949e+06 72980.3
212.021 242.331 18463 16098.6 6 5 58721215 53718516 3.91475e+06 3.58123e+06 87104.8
218.948 243.525 16823 15074.7 2 6 55253611 51240341 3.68357e+06 3.41602e+06 45771.6
224.496 249.18 16198 14552.2 3 6 54549227 50362975 3.63662e+06 3.35753e+06 56075.1
227.155 251.868 15856 14261.7 4 6 54030051 49858645 3.602e+06 3.32391e+06 54826.3
230.211 254.952 15564 14015.5 5 6 53748447 49230968 3.58323e+06 3.28206e+06 54917.5
232.112 256.639 15250 13755.3 6 6 53099038 48668830 3.53994e+06 3.24459e+06 50503.3
Cov Non0Cov Contigs MeanContigsMapped K1 K2 SUM Mapped SUM Properly Mean Mapped Mean Properly MisMatched
Comparing top choices:
Cov Non0Cov Contigs MeanContigsMapped K1 K2 SUM Mapped SUM Properly Mean Mapped Mean Properly MisMatched
128.898 189.021 36511 24709 3 2 70595010 64493357 4706334 4.29956e+06 134345
135.847 197.179 34379 23525.5 6 2 70056427 63833654 4.67043e+06 4.25558e+06 127198
98.2696 155.183 47633 29719 2 2 70214603 63625835 4.68097e+06 4.24172e+06 164558
134.182 195.396 34859 23774.8 5 2 70163844 63618028 4.67759e+06 4.2412e+06 146547
With K1=3 and K2=2 we have the most mean properly paired mappings, second highest number of contigs and the fewer mismatched reads. So K1 =3 and K2 = 2 were chosen for reference assembly. These parameters were used for the reference assembly that was built using dDocent.
cat config.file
Number of Processors [15/4491]
20
Maximum Memory
0
Trimming
no
Assembly?
yes
Type_of_Assembly
PE
Clustering_Similarity%
0.9
Minimum within individual coverage level to include a read for assembly (K1)
3
Minimum number of individuals a read must be present in to include for assembly (K2)
2
Mapping_Reads?
no
Mapping_Match_Value
Mapping_MisMatch_Value
Mapping_GapOpen_Penalty
Calling_SNPs?
no
Email
xx@uri.edu
####Filtering reference assembly for coral contigs
Blast the reference assembly with a database made from 3 coral genomes (Pocillopora acuta, P. verrucosa and P. damicornis).
blastn -db /home/tejashree/Moorea/genomes/coral -query /home/tejashree/Moorea/ddocent/mod_pipeline/species_diversity/reference_assembly/reference.fasta -evalue 0.001 -outfmt 6 -out /home/tejashree/Moorea/blast/mod_pipeline/species_diversity/coral_hits -num_threads 20
Get unique hits
cut -f1 coral_hits| uniq > uniq_coral_hits.txt
Filter reference assembly to keep only those contigs that mapped to atleast one coral genome. These are listed in the uniq_coral_hits.txt. I wrote a script subset_fasta.py to subset the reference assembly that is run as following:
python3 subset_fasta.py -i /home/tejashree/Moorea/ddocent/mod_pipeline/species_diversity/reference_assembly/reference.fasta -f /home/tejashree/Moorea/blast/mod_pipeline/species_diversity/uniq_coral_hits.txt -o /home/tejashree/Moorea/ddocent/mod_pipeline/species_diversity/reference_assembly/coral_reference.fasta
This was moved to species_diversity and renamed as reference.fasta since dDocent will look for the specific file name.
cp reference_assembly/coral_reference.fasta .
mv coral_reference.fasta reference.fasta
Step3: Mapping reads to filtered reference:
I created symlinks for all samples to here and used dDocent for trimming and mapping to reference.
❯ cat config.file
Number of Processors
20
Maximum Memory
0
Trimming
yes
Assembly?
no
Type_of_Assembly
PE
Clustering_Similarity%
0.9
Minimum within individual coverage level to include a read for assembly (K1)
3
Minimum number of individuals a read must be present in to include for assembly (K2)
2
Mapping_Reads?
yes
Mapping_Match_Value
1
Mapping_MisMatch_Value
3
Mapping_GapOpen_Penalty
5
Calling_SNPs?
yes
Email
xxx@uri.edu
Step4:SNP calling and filtering for ddRAD samples
SNP calling Housed in dir mod_pipeline/species_diversity/ddr_snp_calling Created symlinks from mapping step for the the files required for SNP calling to avoid clutter. Created a config.file to only call SNPs using dDocent.
❯ cat config.file
Number of Processors
20
Maximum Memory
0
Trimming
no
Assembly?
no
Type_of_Assembly
Clustering_Similarity%
Minimum within individual coverage level to include a read for assembly (K1)
Minimum number of individuals a read must be present in to include for assembly (K2)
Mapping_Reads?
no
Mapping_Match_Value
Mapping_MisMatch_Value
Mapping_GapOpen_Penalty
Calling_SNPs?
yes
Email
xxx@uri.edu
SNP filtering
Housed in /home/tejashree/Moorea/ddocent/mod_pipeline/species_diversity/ddr_snp_calling/snp_filtering
Generated a symlink to the final vcf output from SNP calling:
ln -s ../TotalRawSNPs.vcf
- Step1: 50% of individuals, a minimum quality score of 30, and a minor allele count of 3
vcftools --vcf TotalRawSNPs.vcf --max-missing 0.5 --mac 3 --minQ 30 --recode --recode-INFO-all --out raw.g5mac3
After filtering, kept 56 out of 56 Individuals
Outputting VCF file...
After filtering, kept 205536 out of a possible 625287 Sites
- Step2: Minimum mean depth: minDP 5
vcftools --vcf raw.g5mac3.recode.vcf --minDP 5 --recode --recode-INFO-all --out raw.g5mac3dp5
After filtering, kept 56 out of 56 Individuals
Outputting VCF file...
After filtering, kept 205536 out of a possible 205536 Sites
- Step3: Filter missing indiv post minDP =5
./filter_missing_ind.sh raw.g5mac3dp5.recode.vcf raw.g5mac3dplm
After filtering, kept 205536 out of a possible 205536 Sites
Run Time = 3.00 seconds
Histogram of % missing data per individual
14 +---------------------------------------------------------------------------------------------------------+
| + + + + + + |
| ******** 'totalmissing' using (bin($1,binwidth)):(1.0) ******* |
12 |-+ * * +-|
| * * |
| * * |
| * * |
10 |-+ * * +-|
| ******** * * |
| * * * * |
8 |-+ * * * * +-|
| * * * * |
| * * * * |
6 |-+ ********* * * * ******** +-|
| * * * * * * * |
| * * * * ********* * * |
4 |-+ * * ********* * ********* * +-|
| * * * * * * * * |
| ******** * * * * * * ********* |
| * * * * * * * * * * |
2 |-+ * * * * * * * * * ************ +-|
| * * * * * * * * * * ********* |
| * * * * * * * * * * * + * |
0 +---------------------------------------------------------------------------------------------------------+
0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26
% of missing data
The 85% cutoff would be 0.205789
Would you like to set a different cutoff, yes or no
yes
Please enter new cutoff
0.25
All individuals with more than 25.0% missing data will be removed.
After filtering, kept 56 out of 56 Individuals
Outputting VCF file...
After filtering, kept 205536 out of a possible 205536 Sites
mawk '$5 > 0.25' raw.g5mac3dplm.imiss | cut -f1 | less
INDV
- Step4: Restrict the data to variants called in a high percentage of individuals and filter by mean depth of genotypes
This applied a genotype call rate (95%) across all individuals. Setting maf = 0.001
vcftools --vcf raw.g5mac3dplm.recode.vcf --max-missing 0.95 --maf 0.001 --recode --recode-INFO-all --out DP3g95maf001 --min-meanDP 20
After filtering, kept 56 out of 56 Individuals
Outputting VCF file...
After filtering, kept 90254 out of a possible 205536 Sites
- Step5: Filtering by population specific call rate when multiple localities are present
ln -s ../popmap .
mawk '$2 == "EOB"' popmap > 1.keep && mawk '$2 == "PBF"' popmap > 2.keep && mawk '$2 == "WOB"' popmap > 3.keep && mawk '$2 == "WOF"' popmap > 4.keep
vcftools --vcf DP3g95maf001.recode.vcf --keep 1.keep --missing-site --out 1
vcftools --vcf DP3g95maf001.recode.vcf --keep 2.keep --missing-site --out 2
vcftools --vcf DP3g95maf001.recode.vcf --keep 3.keep --missing-site --out 3
vcftools --vcf DP3g95maf001.recode.vcf --keep 4.keep --missing-site --out 4
cat 1.lmiss 2.lmiss 3.lmiss 4.lmiss | mawk '!/CHR/' | mawk '$6 > 0.1' | cut -f1,2 >> badloci
vcftools --vcf DP3g95maf001.recode.vcf --exclude-positions badloci --recode --recode-INFO-all --out DP3g95p5maf001
After filtering, kept 56 out of 56 Individuals
Outputting VCF file...
After filtering, kept 87616 out of a possible 90254 Sites
- Step6: Used dDocent_filters script
./dDocent_filters DP3g95p5maf001.recode.vcf dDocent_filters_out
This script will automatically filter a FreeBayes generated VCF file using criteria related to site depth,
quality versus depth, strand representation, allelic balance at heterzygous individuals, and paired read representation.
The script assumes that loci and individuals with low call rates (or depth) have already been removed.
Contact Jon Puritz (jpuritz@gmail.com) for questions and see script comments for more details on particular filters
Number of sites filtered based on allele balance at heterozygous loci, locus quality, and mapping quality / Depth
11141 of 87616
Are reads expected to overlap? In other words, is fragment size less than 2X the read length? Enter yes or no.
no
Number of additional sites filtered based on overlapping forward and reverse reads
16957 of 76475
Is this from a mixture of SE and PE libraries? Enter yes or no.
no
Number of additional sites filtered based on properly paired status
1507 of 59518
Number of sites filtered based on high depth and lower than 2*DEPTH quality score
5618 of 58011
Histogram of mean depth per site
900 +---------------------------------------------------------------------------------------------------------+
|+ + + + + + + + + + + + + + + + + + + + +|
| ** 'meandepthpersite' using (bin($1,binwidth)):(1.0) ******* |
800 |-+ *** *** +-|
| ******** |
700 |-+ ********* * +-|
| ** ********** * |
| ** ************** |
600 |-+ ** ************** +-|
| ******************** |
500 |-+ ********************** +-|
| ********************** |
| ************************ |
400 |-+ ************************** ** +-|
| ****************************** * |
300 |-+ ************************************** ** +-|
| * ************************************** ** |
| ********************************************* * * |
200 |-+ *************************************************** * * +-|
| *********************************************************** |
100 |-+ ************************************************************ ** +-|
| ******************************************************************* ***** * * * * |
|+ + + ************************************************************************************ ********|
0 +---------------------------------------------------------------------------------------------------------+
12 24 36 48 60 72 84 96 108 120 132 144 156 168 180 192 204 216 228 240 252
Mean Depth
If distrubtion looks normal, a 1.645 sigma cutoff (~90% of the data) would be 13239.1453
The 95% cutoff would be 232
Would you like to use a different maximum mean depth cutoff than 232, yes or no
no
Number of sites filtered based on maximum mean depth
3520 of 58011
Number of sites filtered based on within locus depth mismatch
26 of 54491
Total number of sites filtered
33151 of 87616
Remaining sites
54465
Filtered VCF file is called dDocent_filters_out.FIL.recode.vcf
Filter stats stored in dDocent_filters_out.filterstats
- Step7: Convert our variant calls to SNPs
vcfallelicprimitives dDocent_filters_out.FIL.recode.vcf --keep-info --keep-geno > DP3g95p5maf001.prim.vcf
vcftools --vcf DP3g95p5maf001.prim.vcf --remove-indels --recode --recode-INFO-all --out SNP.DP3g95p5maf001
After filtering, kept 56 out of 56 Individuals
Outputting VCF file...
After filtering, kept 58541 out of a possible 61467 Sites
- Step8: HWE filter
Setting -h 0.001
./filter_hwe_by_pop.pl -v SNP.DP3g95p5maf001.recode.vcf -p popmap -o SNP.DP3g95p5maf001.HWE -h 0.001
Processing population: EOB (17 inds)
Processing population: PBF (14 inds)
Processing population: WOB (11 inds)
Processing population: WOF (14 inds)
Outputting results of HWE test for filtered loci to 'filtered.hwe'
Kept 56824 of a possible 58541 loci (filtered 1717 loci)
- Haplotyping using rad_haplotyper
Made symlinks for bam and their indexes from ddr_snp_calling to snpPfitlering since they are required by rad_haplotyper
rad_haplotyper.pl -v SNP.DP3g95p5maf001.HWE.recode.vcf -x 20 -mp 1 -u 20 -ml 4 -n -r reference.fasta
Removed 146 loci (3541 SNPs) with more than 20 SNPs at a locus
Building haplotypes for EOB_174_ddr
Building haplotypes for EOB_175_ddr
Building haplotypes for EOB_176_ddr
Building haplotypes for EOB_177_ddr
Building haplotypes for EOB_178_ddr
Building haplotypes for EOB_182_ddr
Building haplotypes for EOB_183_ddr
Building haplotypes for EOB_184_ddr
Building haplotypes for EOB_185_ddr
Building haplotypes for EOB_186_ddr
Building haplotypes for EOB_188_ddr
Building haplotypes for EOB_189_ddr
Building haplotypes for EOB_190_ddr
Building haplotypes for EOB_191_ddr
Building haplotypes for EOB_192_ddr
Building haplotypes for EOB_493_ddr
Building haplotypes for EOB_494_ddr
Building haplotypes for PBF_157_ddr
Building haplotypes for PBF_158_ddr
Building haplotypes for PBF_160_ddr
Building haplotypes for PBF_161_ddr
Building haplotypes for PBF_164_ddr
Building haplotypes for PBF_165_ddr
Building haplotypes for PBF_167_ddr
Building haplotypes for PBF_168_ddr
Building haplotypes for PBF_169_ddr
Building haplotypes for PBF_171_ddr
Building haplotypes for PBF_172_ddr
Building haplotypes for PBF_489_ddr
Building haplotypes for PBF_490_ddr
Building haplotypes for PBF_491_ddr
Building haplotypes for WOB_35_ddr
Building haplotypes for WOB_41_ddr
Building haplotypes for WOB_42_ddr
Building haplotypes for WOB_47_ddr
Building haplotypes for WOB_48_ddr
Building haplotypes for WOB_49_ddr
Building haplotypes for WOB_50_ddr
Building haplotypes for WOB_59_ddr
Building haplotypes for WOB_60_ddr
Building haplotypes for WOB_62_ddr
Building haplotypes for WOB_65_ddr
Building haplotypes for WOF_217_ddr
Building haplotypes for WOF_224_ddr
Building haplotypes for WOF_225_ddr
Building haplotypes for WOF_232_ddr
Building haplotypes for WOF_234_ddr
Building haplotypes for WOF_236_ddr
Building haplotypes for WOF_237_ddr
Building haplotypes for WOF_238_ddr
Building haplotypes for WOF_239_ddr
Building haplotypes for WOF_240_ddr
Building haplotypes for WOF_241_ddr
Building haplotypes for WOF_243_ddr
Building haplotypes for WOF_244_ddr
Building haplotypes for WOF_245_ddr
Filtered 11 loci below missing data cutoff
Filtered 913 possible paralogs
Filtered 417 loci with low coverage or genotyping errors
Filtered 0 loci with an excess of haplotypes
head stats.out
rad_haplotyper -v SNP.DP3g95p5maf001.HWE.recode.vcf -x 20 -mp 1 -u 20 -ml 4 -n -r reference.fasta
Locus Sites Haplotypes Inds_Haplotyped Total_Inds Prop_Haplotyped Status Poss_Paralog Low_Cov/Geno_Err Miss_Geno Comment
dDocent_Contig_10000 1 2 56 56 1.000 PASSED 0 0 0
dDocent_Contig_10002 11 6 55 56 0.982 PASSED 0 0 1
dDocent_Contig_10005 2 3 56 56 1.000 PASSED 0 0 0
dDocent_Contig_10008 4 5 56 56 1.000 PASSED 0 0 0
dDocent_Contig_10009 10 9 56 56 1.000 PASSED 0 0 0
dDocent_Contig_10011 4 5 56 56 1.000 PASSED 0 0 0
dDocent_Contig_10016 2 4 56 56 1.000 PASSED 0 0 0
dDocent_Contig_10017 10 14 36 56 0.643 FILTERED 18 0 2 Possible paralog
We can use this file to create a list of loci to filter
grep FILTERED stats.out | mawk '!/Complex/' | cut -f1 > loci.to.remove
Remove bad loci identified by rad_haplotyper Now that we have the list we can parse through the VCF file and remove the bad RAD loci. Use script from dDocent to do this: remove.bad.hap.loci.sh
./remove.bad.hap.loci.sh loci.to.remove SNP.DP3g95p5maf001.HWE.recode.vcf
mawk '!/#/' SNP.DP3g95p5maf001.HWE.filtered.vcf | wc -l
40251
- Handling techinical replicates Used this markdown as reference This data set contains technical replicates. We will use a custom script dup_sample_filter.sh to automatically remove sites in VCF files that do not have congruent genotypes across duplicate individuals. It will only consider genotypes that have at least 5 reads. The techinical reps are listed in file techinical_reps. EOB_492 had less than 400K reads and was filtered out so took that out of this list.
./dup_sample_filter.sh SNP.DP3g95p5maf001.HWE.filtered.vcf technical_reps
head mismatched.loci
4 dDocent_Contig_5321 108
2 dDocent_Contig_8662 75
3 dDocent_Contig_20246 89
2 dDocent_Contig_2543 119
3 dDocent_Contig_17193 21
2 dDocent_Contig_25017 178
3 dDocent_Contig_4016 74
2 dDocent_Contig_22187 137
2 dDocent_Contig_7004 162
2 dDocent_Contig_8454 22
echo -e "Mismatches\tNumber_of_Loci" > mismatch.txt
for i in {2..20}
do
paste <(echo $i) <(mawk -v x=$i '$1 > x' mismatched.loci | wc -l) >> mismatch.txt
done
wc -l bad.loci*
27 bad.loci.EOB_176_ddr.EOB_493_ddr
31 bad.loci.EOB_176_ddr.EOB_494_ddr
54 bad.loci.EOB_493_ddr.EOB_494_ddr
76 bad.loci.PBF_157_ddr.PBF_489_ddr
20 bad.loci.PBF_157_ddr.PBF_490_ddr
14 bad.loci.PBF_157_ddr.PBF_491_ddr
74 bad.loci.PBF_489_ddr.PBF_490_ddr
76 bad.loci.PBF_489_ddr.PBF_491_ddr
22 bad.loci.PBF_490_ddr.PBF_491_ddr
394 total
cat mismatch.txt
Mismatches Number_of_Loci
2 87
3 14
4 2
5 1
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
I chose a cut off of 4,to filter out loci with mismatched values > 4 to remove most affected loci.
mawk '$1 > 4' mismatched.loci | cut -f2,3 > mismatchedloci
vcftools --vcf SNP.DP3g95p5maf001.HWE.filtered.vcf --exclude-positions mismatchedloci --recode --recode-INFO-all --out SNP.DP3g95p5maf001.HWE.filtered.MM
After filtering, kept 56 out of 56 Individuals
Outputting VCF file...
After filtering, kept 40249 out of a possible 40251 Sites
Step 5: Cluster identification
The final vcf file was used in script identify_clusters.R to identify putative species clusters using DAPC.
> grps <- find.clusters(SNPs) #45 PCs and k=5 clusters
Choose the number PCs to retain (>= 1):
45
Choose the number of clusters (>=2:
5
> names(grps)
[1] "Kstat" "stat" "grp" "size"
> grps$Kstat
K=1 K=2 K=3 K=4 K=5 K=6
581.0997 571.7620 566.6947 564.7860 563.6012 564.6559
> grps$grp
EOB_174_ddr EOB_175_ddr EOB_176_ddr EOB_177_ddr EOB_178_ddr EOB_182_ddr EOB_183_ddr EOB_184_ddr EOB_185_ddr EOB_186_ddr
1 1 2 3 3 3 3 3 1 3
EOB_188_ddr EOB_189_ddr EOB_190_ddr EOB_191_ddr EOB_192_ddr EOB_493_ddr EOB_494_ddr PBF_157_ddr PBF_158_ddr PBF_160_ddr
3 1 1 3 1 2 2 4 1 3
PBF_161_ddr PBF_164_ddr PBF_165_ddr PBF_167_ddr PBF_168_ddr PBF_169_ddr PBF_171_ddr PBF_172_ddr PBF_489_ddr PBF_490_ddr
1 1 5 5 5 1 1 1 4 4
PBF_491_ddr WOB_35_ddr WOB_41_ddr WOB_42_ddr WOB_47_ddr WOB_48_ddr WOB_49_ddr WOB_50_ddr WOB_59_ddr WOB_60_ddr
4 1 3 3 3 3 3 3 3 3
WOB_62_ddr WOB_65_ddr WOF_217_ddr WOF_224_ddr WOF_225_ddr WOF_232_ddr WOF_234_ddr WOF_236_ddr WOF_237_ddr WOF_238_ddr
3 3 5 1 1 1 1 1 1 1
WOF_239_ddr WOF_240_ddr WOF_241_ddr WOF_243_ddr WOF_244_ddr WOF_245_ddr
1 1 1 1 1 1
Levels: 1 2 3 4 5
> dapc <- dapc(SNPs, grps$grp)# 20 PCs and 2 Discriminant functions to retain
Choose the number PCs to retain (>=1):
20
Choose the number discrmod_pipeline/cluster_variation/iminant functions to retain (>=1):
2
Thus according to the above DAPC result we have 5 clusters in total out of which clusters 1 and 3 have the highest membership. These will be further used to estimate genetic and epigenetic variation in STEP 6. Below are the compositions of the 2 clusters. Cluster 2 has only one PBF sample. To keep the comparison between east vs west back reef, fringe sample PBF160 will be removed from this cluster. Cluster
- Cluster 1: EOB_174_ddr,EOB_175_ddr,EOB_185_ddr,EOB_189_ddr,EOB_190_ddr,EOB_192_ddr,PBF_158_ddr,PBF_161_ddr,PBF_164_ddr,PBF_169_ddr,PBF_171_ddr,PBF_172_ddr,WOB_35_ddr,WOF_224_ddr WOF_225_ddr,WOF_232_ddr,WOF_234_ddr,WOF_236_ddr,WOF_237_ddr,WOF_238_ddr,WOF_239_ddr,WOF_240_ddr,WOF_241_ddr,WOF_243_ddr,WOF_244_ddr,WOF_245_ddr.
- Cluster 2: EOB_177_ddr EOB_178_ddr EOB_182_ddr EOB_183_ddr EOB_184_ddr,EOB_186_ddr,EOB_188_ddr,EOB_191_ddr,WOB_41_ddr,WOB_42_ddr,WOB_47_ddr,WOB_48_ddr,WOB_49_ddr,WOB_50_ddr,WOB_59_ddr,WOB_60_ddr,WOB_62_ddr,WOB_65_ddr.
Step 6: Determining genetic and epigenetic variation per species cluster
Housed in: mod_pipeline/cluster_variation/ Analysis for cluster1: mod_pipeline/cluster_variation/clust1 Analysis for cluster2: mod_pipeline/cluster_variation/clust2
Step 7: Reference assembly per cluster
Housed in: Analysis for cluster1: mod_pipeline/cluster_variation/clust1/reference_assembly/ Analysis for cluster2: mod_pipeline/cluster_variation/clust2/reference_assembly/ A reference set of up to 5 samples per location were used to build the reference assembly. Parameter optimization was performed as before.
Cluster 1:
ReferenceOpt.sh 2 6 2 6 PE 20
All required software is installed!
dDocent ReferenceOpt version 2.7.8
K1 is 2 K2 is 2 c is 0.80
K1 is 2 K2 is 2 c is 0.82
K1 is 2 K2 is 2 c is 0.84
K1 is 2 K2 is 2 c is 0.86
K1 is 2 K2 is 2 c is 0.88
K1 is 2 K2 is 2 c is 0.90
K1 is 2 K2 is 2 c is 0.92
K1 is 2 K2 is 2 c is 0.94
K1 is 2 K2 is 2 c is 0.96
K1 is 2 K2 is 2 c is 0.98
K1 is 2 K2 is 3 c is 0.80
K1 is 2 K2 is 3 c is 0.82
K1 is 2 K2 is 3 c is 0.84
K1 is 2 K2 is 3 c is 0.86
K1 is 2 K2 is 3 c is 0.88
K1 is 2 K2 is 3 c is 0.90
K1 is 2 K2 is 3 c is 0.92
K1 is 2 K2 is 3 c is 0.94
K1 is 2 K2 is 3 c is 0.96
K1 is 2 K2 is 3 c is 0.98
K1 is 2 K2 is 4 c is 0.80
K1 is 2 K2 is 4 c is 0.82
K1 is 2 K2 is 4 c is 0.84
K1 is 2 K2 is 4 c is 0.86
K1 is 2 K2 is 4 c is 0.88
K1 is 2 K2 is 4 c is 0.90
K1 is 2 K2 is 4 c is 0.92
K1 is 2 K2 is 4 c is 0.94
K1 is 2 K2 is 4 c is 0.96
K1 is 2 K2 is 4 c is 0.98
K1 is 2 K2 is 5 c is 0.80
K1 is 2 K2 is 5 c is 0.82
K1 is 2 K2 is 5 c is 0.84
K1 is 2 K2 is 5 c is 0.86
K1 is 2 K2 is 5 c is 0.88
K1 is 2 K2 is 5 c is 0.90
K1 is 2 K2 is 5 c is 0.92
K1 is 2 K2 is 5 c is 0.94
K1 is 2 K2 is 5 c is 0.96
K1 is 2 K2 is 5 c is 0.98
K1 is 2 K2 is 6 c is 0.80
K1 is 2 K2 is 6 c is 0.82
K1 is 2 K2 is 6 c is 0.84
K1 is 2 K2 is 6 c is 0.86
K1 is 2 K2 is 6 c is 0.88
K1 is 2 K2 is 6 c is 0.90
K1 is 2 K2 is 6 c is 0.92
K1 is 2 K2 is 6 c is 0.94
K1 is 2 K2 is 6 c is 0.96
K1 is 2 K2 is 6 c is 0.98
K1 is 3 K2 is 2 c is 0.80
K1 is 3 K2 is 2 c is 0.82
K1 is 3 K2 is 2 c is 0.84
K1 is 3 K2 is 2 c is 0.86
K1 is 3 K2 is 2 c is 0.88
K1 is 3 K2 is 2 c is 0.90
K1 is 3 K2 is 2 c is 0.92
K1 is 3 K2 is 2 c is 0.94
K1 is 3 K2 is 2 c is 0.96
K1 is 3 K2 is 2 c is 0.98
K1 is 3 K2 is 3 c is 0.80
K1 is 3 K2 is 3 c is 0.82
K1 is 3 K2 is 3 c is 0.84
K1 is 3 K2 is 3 c is 0.86
K1 is 3 K2 is 3 c is 0.88
K1 is 3 K2 is 2 c is 0.90
K1 is 3 K2 is 2 c is 0.92
K1 is 3 K2 is 2 c is 0.94
K1 is 3 K2 is 2 c is 0.96
K1 is 3 K2 is 2 c is 0.98
K1 is 3 K2 is 3 c is 0.80
K1 is 3 K2 is 3 c is 0.82
K1 is 3 K2 is 3 c is 0.84
K1 is 3 K2 is 3 c is 0.86
K1 is 3 K2 is 3 c is 0.88
K1 is 3 K2 is 3 c is 0.90
K1 is 3 K2 is 3 c is 0.92
K1 is 3 K2 is 3 c is 0.94
K1 is 3 K2 is 3 c is 0.96
K1 is 3 K2 is 3 c is 0.98
K1 is 3 K2 is 4 c is 0.80
K1 is 3 K2 is 4 c is 0.82
K1 is 3 K2 is 4 c is 0.84
K1 is 3 K2 is 4 c is 0.86
K1 is 3 K2 is 4 c is 0.88
K1 is 3 K2 is 4 c is 0.90
K1 is 3 K2 is 4 c is 0.92
K1 is 3 K2 is 4 c is 0.94
K1 is 3 K2 is 4 c is 0.96
K1 is 3 K2 is 4 c is 0.98
K1 is 3 K2 is 5 c is 0.80
K1 is 3 K2 is 5 c is 0.82
K1 is 3 K2 is 5 c is 0.84
K1 is 3 K2 is 5 c is 0.86
K1 is 3 K2 is 5 c is 0.88
K1 is 3 K2 is 5 c is 0.90
K1 is 3 K2 is 5 c is 0.92
K1 is 3 K2 is 5 c is 0.94
K1 is 3 K2 is 5 c is 0.96
K1 is 3 K2 is 5 c is 0.98
K1 is 3 K2 is 6 c is 0.80
K1 is 3 K2 is 6 c is 0.82
K1 is 3 K2 is 6 c is 0.84
K1 is 3 K2 is 6 c is 0.86
K1 is 3 K2 is 6 c is 0.88
K1 is 3 K2 is 6 c is 0.90
K1 is 3 K2 is 6 c is 0.92
K1 is 3 K2 is 6 c is 0.94
K1 is 3 K2 is 6 c is 0.96
K1 is 3 K2 is 6 c is 0.98
K1 is 4 K2 is 2 c is 0.80
K1 is 4 K2 is 2 c is 0.82
K1 is 4 K2 is 2 c is 0.84
K1 is 4 K2 is 2 c is 0.86
K1 is 4 K2 is 2 c is 0.88
K1 is 4 K2 is 2 c is 0.90
K1 is 4 K2 is 2 c is 0.92
K1 is 4 K2 is 2 c is 0.94
K1 is 4 K2 is 2 c is 0.96
K1 is 4 K2 is 2 c is 0.98
K1 is 4 K2 is 3 c is 0.80
K1 is 4 K2 is 3 c is 0.82
K1 is 4 K2 is 3 c is 0.84
K1 is 4 K2 is 3 c is 0.86
K1 is 4 K2 is 3 c is 0.88
K1 is 4 K2 is 3 c is 0.90
K1 is 4 K2 is 3 c is 0.92
K1 is 4 K2 is 3 c is 0.94
K1 is 4 K2 is 3 c is 0.96
K1 is 4 K2 is 3 c is 0.98
K1 is 4 K2 is 4 c is 0.80
K1 is 4 K2 is 4 c is 0.82
K1 is 4 K2 is 4 c is 0.84
K1 is 4 K2 is 4 c is 0.86
K1 is 4 K2 is 4 c is 0.88
K1 is 4 K2 is 4 c is 0.90
K1 is 4 K2 is 4 c is 0.92
K1 is 4 K2 is 4 c is 0.94
K1 is 4 K2 is 4 c is 0.96
K1 is 4 K2 is 4 c is 0.98
K1 is 4 K2 is 5 c is 0.80
K1 is 4 K2 is 5 c is 0.82
K1 is 4 K2 is 5 c is 0.84
K1 is 4 K2 is 5 c is 0.86
K1 is 4 K2 is 5 c is 0.88
K1 is 4 K2 is 5 c is 0.90
K1 is 4 K2 is 5 c is 0.92
K1 is 4 K2 is 5 c is 0.94
K1 is 4 K2 is 5 c is 0.96
K1 is 4 K2 is 5 c is 0.98
K1 is 4 K2 is 6 c is 0.80
K1 is 4 K2 is 6 c is 0.82
K1 is 4 K2 is 6 c is 0.84
K1 is 4 K2 is 6 c is 0.86
K1 is 4 K2 is 6 c is 0.88
K1 is 4 K2 is 6 c is 0.90
K1 is 4 K2 is 6 c is 0.92
K1 is 4 K2 is 6 c is 0.94
K1 is 4 K2 is 6 c is 0.96
K1 is 4 K2 is 6 c is 0.98
K1 is 5 K2 is 2 c is 0.80
K1 is 5 K2 is 2 c is 0.82
K1 is 5 K2 is 2 c is 0.84
K1 is 5 K2 is 2 c is 0.86
K1 is 5 K2 is 2 c is 0.88
K1 is 5 K2 is 2 c is 0.90
K1 is 5 K2 is 2 c is 0.92
K1 is 5 K2 is 2 c is 0.94
K1 is 5 K2 is 2 c is 0.96
K1 is 5 K2 is 2 c is 0.98
K1 is 5 K2 is 3 c is 0.80
K1 is 5 K2 is 3 c is 0.82
K1 is 5 K2 is 3 c is 0.84
K1 is 5 K2 is 3 c is 0.86
K1 is 5 K2 is 3 c is 0.88
K1 is 5 K2 is 3 c is 0.90
K1 is 5 K2 is 3 c is 0.92
K1 is 5 K2 is 3 c is 0.94
K1 is 5 K2 is 3 c is 0.96
K1 is 5 K2 is 3 c is 0.98
K1 is 5 K2 is 4 c is 0.80
K1 is 5 K2 is 4 c is 0.82
K1 is 5 K2 is 4 c is 0.84
K1 is 5 K2 is 4 c is 0.86
K1 is 5 K2 is 4 c is 0.88
K1 is 5 K2 is 4 c is 0.90
K1 is 5 K2 is 4 c is 0.92
K1 is 5 K2 is 4 c is 0.94
K1 is 5 K2 is 4 c is 0.96
K1 is 5 K2 is 4 c is 0.98
K1 is 5 K2 is 5 c is 0.80
K1 is 5 K2 is 5 c is 0.82
K1 is 5 K2 is 5 c is 0.84
K1 is 5 K2 is 5 c is 0.86
K1 is 5 K2 is 5 c is 0.88
K1 is 5 K2 is 5 c is 0.90
K1 is 5 K2 is 5 c is 0.92
K1 is 5 K2 is 5 c is 0.94
K1 is 5 K2 is 5 c is 0.96
K1 is 5 K2 is 5 c is 0.98
K1 is 5 K2 is 6 c is 0.80
K1 is 5 K2 is 6 c is 0.82
K1 is 5 K2 is 6 c is 0.84
K1 is 5 K2 is 6 c is 0.86
K1 is 5 K2 is 6 c is 0.88
K1 is 5 K2 is 6 c is 0.90
K1 is 5 K2 is 6 c is 0.92
K1 is 5 K2 is 6 c is 0.94
K1 is 5 K2 is 6 c is 0.96
K1 is 5 K2 is 6 c is 0.98
K1 is 6 K2 is 2 c is 0.80
K1 is 6 K2 is 2 c is 0.82
K1 is 6 K2 is 2 c is 0.84
K1 is 6 K2 is 2 c is 0.86
K1 is 6 K2 is 2 c is 0.88
K1 is 6 K2 is 2 c is 0.90
K1 is 6 K2 is 2 c is 0.92
K1 is 6 K2 is 2 c is 0.94
K1 is 6 K2 is 2 c is 0.96
K1 is 6 K2 is 2 c is 0.98
K1 is 6 K2 is 3 c is 0.80
K1 is 6 K2 is 3 c is 0.82
K1 is 6 K2 is 3 c is 0.84
K1 is 6 K2 is 3 c is 0.86
K1 is 6 K2 is 3 c is 0.88
K1 is 6 K2 is 3 c is 0.90
K1 is 6 K2 is 3 c is 0.92
K1 is 6 K2 is 3 c is 0.94
K1 is 6 K2 is 3 c is 0.96
K1 is 6 K2 is 3 c is 0.98
K1 is 6 K2 is 4 c is 0.80
K1 is 6 K2 is 4 c is 0.82
K1 is 6 K2 is 4 c is 0.84
K1 is 6 K2 is 4 c is 0.86
K1 is 6 K2 is 4 c is 0.88
K1 is 6 K2 is 4 c is 0.90
K1 is 6 K2 is 4 c is 0.92
K1 is 6 K2 is 4 c is 0.94
K1 is 6 K2 is 4 c is 0.96
K1 is 6 K2 is 4 c is 0.98
K1 is 6 K2 is 5 c is 0.80
K1 is 6 K2 is 5 c is 0.82
K1 is 6 K2 is 5 c is 0.84
K1 is 6 K2 is 5 c is 0.86
K1 is 6 K2 is 5 c is 0.88
K1 is 6 K2 is 5 c is 0.90
K1 is 6 K2 is 5 c is 0.92
K1 is 6 K2 is 5 c is 0.94
K1 is 6 K2 is 5 c is 0.96
K1 is 6 K2 is 5 c is 0.98
K1 is 6 K2 is 6 c is 0.80
K1 is 6 K2 is 6 c is 0.82
K1 is 6 K2 is 6 c is 0.84
K1 is 6 K2 is 6 c is 0.86
K1 is 6 K2 is 6 c is 0.88
K1 is 6 K2 is 6 c is 0.90
K1 is 6 K2 is 6 c is 0.92
K1 is 6 K2 is 6 c is 0.94
K1 is 6 K2 is 6 c is 0.96
K1 is 6 K2 is 6 c is 0.98
Histogram of number of reference contigs
10 +-----------------------------------------------------------------------------------------------------------+
| * + + + + + + |
| * 'plot.kopt.data' using (bin($1,binwidth)):(1.0) ******* |
| * |
| * |
8 |-+* +-|
| * |
| ** ** |
| ** ** |
6 |-** * **** *** +-|
| ** * **** *** |
|*** * ******* ***** |
|*** * ******* ***** |
|*** * ******* ***** |
4 |*** * ******* ***** * **** ** +-|
|*** * ******* ***** * **** ** |
|***** ******* ***** ** ****** **** *** ** |
|***** ******* ***** ** ****** **** *** ** |
2 |************** ******** *********** **** ************ * +-|
|************** ******** ********* * * ** ************ * |
|************** ******** ********* * * ** ************ * |
|************************************ *** ************************************************************* |
|************************************ * * ** **************** * * **+** * ** |
0 +-----------------------------------------------------------------------------------------------------------+
15000 20000 25000 30000 35000 40000 45000 50000
Number of reference contigs
Average contig number = 23432.5
The top three most common number of contigs
X Contig number
2 18025
2 15972
1 48007
The top three most common number of contigs (with values rounded)
X Contig number
7 15800
6 16000
5 18600
The plot shows 0.9 as the optimal similarity threshold.
- RefMapOpt.sh
RefMapOpt.sh 2 6 2 6 0.9 PE 20
❯ column -t -s $'\t' mapping.results
Cov Non0Cov Contigs MeanContigsMapped K1 K2 SUM Mapped SUM Properly Mean Mapped Mean Properly MisMatched
115.139 169.395 45006 30450.6 2 2 62184592 57056335 5.18205e+06 4.75469e+06 152962
173.658 212.952 28860 23475.2 2 3 60143223 55563249 5.01194e+06 4.63027e+06 112633
210.025 237.405 22786 20128.4 2 4 57430233 53335471 4.78585e+06 4.44462e+06 85266.1
232.32 251.93 19314 17794.6 2 5 53847160 49871279 4.48726e+06 4.15594e+06 78210.1
250.174 264.634 16629 15710.7 2 6 49924792 46070374 4.1604e+06 3.8392e+06 71423.3
157.101 209.089 32874 24652.8 3 2 61976199 56849424 5.16468e+06 4737452 148131
193.611 230.792 25818 21633.2 3 3 59986094 55261401 4.99884e+06 4.60512e+06 125352
219.978 246.746 21657 19289.2 3 4 57171343 53050666 4.76428e+06 4.42089e+06 87394.7
240.112 259.609 18646 17234.1 3 5 53728519 49724237 4.47738e+06 4.14369e+06 82374.8
258.866 273.334 16104 15244.2 3 6 50028488 46307348 4.16904e+06 3.85895e+06 65866.4
163.566 216.162 31568 23850 4 2 61963309 56926275 5.16361e+06 4.74386e+06 141518
197.21 234.517 25269 21225.9 4 3 59802044 55220282 4.9835e+06 4.60169e+06 113698
222.599 249.277 21264 18971.8 4 4 56802927 52438826 4.73358e+06 4.3699e+06 108330
243.425 262.806 18330 16967.1 4 5 53546681 49696912 4.46222e+06 4.14141e+06 72349.9
260.319 274.708 15843 15006.3 4 6 49493872 45781538 4.12449e+06 3.81513e+06 70571.8
165.98 218.947 31098 23540.9 5 2 61941838 56820418 5.16182e+06 4.73503e+06 152504
199.335 236.588 24936 20986.8 5 3 59649854 54966459 4.97082e+06 4.58054e+06 122234
224.413 250.952 20977 18742.3 5 4 56492927 51928373 4.70774e+06 4.32736e+06 123467
245.133 264.278 18073 16753 5 5 53166501 49169068 4.43054e+06 4.09742e+06 77644.9
262.851 277.083 15598 14790.4 5 6 49202472 45474791 4100206 3.78957e+06 70078.3
167.45 220.307 30723 23319 6 2 61736794 56652019 5.14473e+06 4.721e+06 151801
200.662 237.693 24652 20789.4 6 3 59362905 54651770 4.94691e+06 4.55431e+06 125970
226.587 252.808 20730 18563.8 6 4 56368602 51884947 4.69738e+06 4.32375e+06 110975
246.752 265.647 17855 16574.5 6 5 52872061 48949740 4.40601e+06 4079145 78545.2
264.483 278.337 15368 14597.2 6 6 48777980 45077024 4.06483e+06 3.75642e+06 68517.8
❯ sort -k 10 -n -r mapping.results | column -t -s $'\t'
157.101 209.089 32874 24652.8 3 2 61976199 56849424 5.16468e+06 4737452 148131
246.752 265.647 17855 16574.5 6 5 52872061 48949740 4.40601e+06 4079145 78545.2
115.139 169.395 45006 30450.6 2 2 62184592 57056335 5.18205e+06 4.75469e+06 152962
163.566 216.162 31568 23850 4 2 61963309 56926275 5.16361e+06 4.74386e+06 141518
165.98 218.947 31098 23540.9 5 2 61941838 56820418 5.16182e+06 4.73503e+06 152504
167.45 220.307 30723 23319 6 2 61736794 56652019 5.14473e+06 4.721e+06 151801
173.658 212.952 28860 23475.2 2 3 60143223 55563249 5.01194e+06 4.63027e+06 112633
193.611 230.792 25818 21633.2 3 3 59986094 55261401 4.99884e+06 4.60512e+06 125352
197.21 234.517 25269 21225.9 4 3 59802044 55220282 4.9835e+06 4.60169e+06 113698
199.335 236.588 24936 20986.8 5 3 59649854 54966459 4.97082e+06 4.58054e+06 122234
200.662 237.693 24652 20789.4 6 3 59362905 54651770 4.94691e+06 4.55431e+06 125970
210.025 237.405 22786 20128.4 2 4 57430233 53335471 4.78585e+06 4.44462e+06 85266.1
219.978 246.746 21657 19289.2 3 4 57171343 53050666 4.76428e+06 4.42089e+06 87394.7
222.599 249.277 21264 18971.8 4 4 56802927 52438826 4.73358e+06 4.3699e+06 108330
224.413 250.952 20977 18742.3 5 4 56492927 51928373 4.70774e+06 4.32736e+06 123467
226.587 252.808 20730 18563.8 6 4 56368602 51884947 4.69738e+06 4.32375e+06 110975
232.32 251.93 19314 17794.6 2 5 53847160 49871279 4.48726e+06 4.15594e+06 78210.1
240.112 259.609 18646 17234.1 3 5 53728519 49724237 4.47738e+06 4.14369e+06 82374.8
243.425 262.806 18330 16967.1 4 5 53546681 49696912 4.46222e+06 4.14141e+06 72349.9
245.133 264.278 18073 16753 5 5 53166501 49169068 4.43054e+06 4.09742e+06 77644.9
258.866 273.334 16104 15244.2 3 6 50028488 46307348 4.16904e+06 3.85895e+06 65866.4
250.174 264.634 16629 15710.7 2 6 49924792 46070374 4.1604e+06 3.8392e+06 71423.3
260.319 274.708 15843 15006.3 4 6 49493872 45781538 4.12449e+06 3.81513e+06 70571.8
262.851 277.083 15598 14790.4 5 6 49202472 45474791 4100206 3.78957e+06 70078.3
264.483 278.337 15368 14597.2 6 6 48777980 45077024 4.06483e+06 3.75642e+06 68517.8
Cov Non0Cov Contigs MeanContigsMapped K1 K2 SUM Mapped SUM Properly Mean Mapped Mean Properly MisMatched
Cov Non0Cov Contigs MeanContigsMapped K1 K2 SUM Mapped SUM Properly Mean Mapped Mean Properly MisMatched
115.139 169.395 45006 30450.6 2 2 62184592 57056335 5.18205e+06 4.75469e+06 152962
163.566 216.162 31568 23850 4 2 61963309 56926275 5.16361e+06 4.74386e+06 141518
165.98 218.947 31098 23540.9 5 2 61941838 56820418 5.16182e+06 4.73503e+06 152504
167.45 220.307 30723 23319 6 2 61736794 56652019 5.14473e+06 4.721e+06 151801
With the same logic as before K1=4 and K2=2 were chosen for reference assembly.
Cluster 2:
- ReferenceOpt.sh
ReferenceOpt.sh 2 6 2 6 PE 20
All required software is installed!
dDocent ReferenceOpt version 2.7.8
K1 is 2 K2 is 2 c is 0.80
K1 is 2 K2 is 2 c is 0.82
K1 is 2 K2 is 2 c is 0.84
K1 is 2 K2 is 2 c is 0.86
K1 is 2 K2 is 2 c is 0.88
K1 is 2 K2 is 2 c is 0.90
K1 is 2 K2 is 2 c is 0.92
K1 is 2 K2 is 2 c is 0.94
K1 is 2 K2 is 2 c is 0.96
K1 is 2 K2 is 2 c is 0.98
K1 is 2 K2 is 3 c is 0.80
K1 is 2 K2 is 3 c is 0.82
K1 is 2 K2 is 3 c is 0.84
K1 is 2 K2 is 3 c is 0.86
K1 is 2 K2 is 3 c is 0.88
K1 is 2 K2 is 3 c is 0.90
K1 is 2 K2 is 3 c is 0.92
K1 is 2 K2 is 3 c is 0.94
K1 is 2 K2 is 3 c is 0.96
K1 is 2 K2 is 3 c is 0.98
K1 is 2 K2 is 4 c is 0.80
K1 is 2 K2 is 4 c is 0.82
K1 is 2 K2 is 4 c is 0.84
K1 is 2 K2 is 4 c is 0.86
K1 is 2 K2 is 4 c is 0.88
K1 is 2 K2 is 4 c is 0.90
K1 is 2 K2 is 4 c is 0.92
K1 is 2 K2 is 4 c is 0.94
K1 is 2 K2 is 4 c is 0.96
K1 is 2 K2 is 4 c is 0.98
K1 is 2 K2 is 5 c is 0.80
K1 is 2 K2 is 5 c is 0.82
K1 is 2 K2 is 5 c is 0.84
K1 is 2 K2 is 5 c is 0.86
K1 is 2 K2 is 5 c is 0.88
K1 is 2 K2 is 5 c is 0.90
K1 is 2 K2 is 5 c is 0.92
K1 is 2 K2 is 5 c is 0.94
K1 is 2 K2 is 5 c is 0.96
K1 is 2 K2 is 5 c is 0.98
K1 is 2 K2 is 6 c is 0.80
K1 is 2 K2 is 6 c is 0.82
K1 is 2 K2 is 6 c is 0.84
K1 is 2 K2 is 6 c is 0.86
K1 is 2 K2 is 6 c is 0.88
K1 is 2 K2 is 6 c is 0.90
K1 is 2 K2 is 6 c is 0.92
K1 is 2 K2 is 6 c is 0.94
K1 is 2 K2 is 6 c is 0.96
K1 is 2 K2 is 6 c is 0.98
K1 is 3 K2 is 2 c is 0.80
K1 is 3 K2 is 2 c is 0.82
K1 is 3 K2 is 2 c is 0.84
K1 is 3 K2 is 2 c is 0.86
K1 is 3 K2 is 2 c is 0.88
K1 is 3 K2 is 2 c is 0.90
K1 is 3 K2 is 2 c is 0.92
K1 is 3 K2 is 2 c is 0.94
K1 is 3 K2 is 2 c is 0.96
K1 is 3 K2 is 2 c is 0.98
K1 is 3 K2 is 3 c is 0.80
K1 is 3 K2 is 3 c is 0.82
K1 is 3 K2 is 3 c is 0.84
K1 is 3 K2 is 3 c is 0.86
K1 is 3 K2 is 3 c is 0.88
K1 is 3 K2 is 3 c is 0.90
K1 is 3 K2 is 3 c is 0.92
K1 is 3 K2 is 3 c is 0.94
K1 is 3 K2 is 3 c is 0.96
K1 is 3 K2 is 3 c is 0.98
K1 is 3 K2 is 4 c is 0.80
K1 is 3 K2 is 4 c is 0.82
K1 is 3 K2 is 4 c is 0.84
K1 is 3 K2 is 4 c is 0.86
K1 is 3 K2 is 4 c is 0.88
K1 is 3 K2 is 4 c is 0.90
K1 is 3 K2 is 4 c is 0.92
K1 is 3 K2 is 4 c is 0.94
K1 is 3 K2 is 4 c is 0.96
K1 is 3 K2 is 4 c is 0.98
K1 is 3 K2 is 5 c is 0.80
K1 is 3 K2 is 5 c is 0.82
K1 is 3 K2 is 5 c is 0.84
K1 is 3 K2 is 5 c is 0.86
K1 is 3 K2 is 5 c is 0.88
K1 is 3 K2 is 5 c is 0.90
K1 is 3 K2 is 5 c is 0.92
K1 is 3 K2 is 5 c is 0.94
K1 is 3 K2 is 5 c is 0.96
K1 is 3 K2 is 5 c is 0.98
K1 is 3 K2 is 6 c is 0.80
K1 is 3 K2 is 6 c is 0.82
K1 is 3 K2 is 6 c is 0.84
K1 is 3 K2 is 6 c is 0.86
K1 is 3 K2 is 6 c is 0.88
K1 is 3 K2 is 6 c is 0.90
K1 is 3 K2 is 6 c is 0.92
K1 is 3 K2 is 6 c is 0.94
K1 is 3 K2 is 6 c is 0.96
K1 is 3 K2 is 6 c is 0.98
K1 is 4 K2 is 2 c is 0.80
K1 is 4 K2 is 2 c is 0.82
K1 is 4 K2 is 2 c is 0.84
K1 is 4 K2 is 2 c is 0.86
K1 is 4 K2 is 2 c is 0.88
K1 is 4 K2 is 2 c is 0.90
K1 is 4 K2 is 2 c is 0.92
K1 is 4 K2 is 2 c is 0.94
K1 is 4 K2 is 2 c is 0.96
K1 is 4 K2 is 2 c is 0.98
K1 is 4 K2 is 3 c is 0.80
K1 is 4 K2 is 3 c is 0.82
K1 is 4 K2 is 3 c is 0.84
K1 is 4 K2 is 3 c is 0.86
K1 is 4 K2 is 3 c is 0.88
K1 is 4 K2 is 3 c is 0.90
K1 is 4 K2 is 3 c is 0.92
K1 is 4 K2 is 3 c is 0.94
K1 is 4 K2 is 3 c is 0.96
K1 is 4 K2 is 3 c is 0.98
K1 is 4 K2 is 4 c is 0.80
K1 is 4 K2 is 4 c is 0.82
K1 is 4 K2 is 4 c is 0.84
K1 is 4 K2 is 4 c is 0.86
K1 is 4 K2 is 4 c is 0.88
K1 is 4 K2 is 4 c is 0.90
K1 is 4 K2 is 4 c is 0.92
K1 is 4 K2 is 4 c is 0.94
K1 is 4 K2 is 4 c is 0.96
K1 is 4 K2 is 4 c is 0.98
K1 is 4 K2 is 5 c is 0.80
K1 is 4 K2 is 5 c is 0.82
K1 is 4 K2 is 5 c is 0.84
K1 is 4 K2 is 5 c is 0.86
K1 is 4 K2 is 5 c is 0.88
K1 is 4 K2 is 5 c is 0.90
K1 is 4 K2 is 5 c is 0.92
K1 is 4 K2 is 5 c is 0.94
K1 is 4 K2 is 5 c is 0.96
K1 is 4 K2 is 5 c is 0.98
K1 is 4 K2 is 6 c is 0.80
K1 is 4 K2 is 6 c is 0.82
K1 is 4 K2 is 6 c is 0.84
K1 is 4 K2 is 6 c is 0.86
K1 is 4 K2 is 6 c is 0.88
K1 is 4 K2 is 6 c is 0.90
K1 is 4 K2 is 6 c is 0.92
K1 is 4 K2 is 6 c is 0.94
K1 is 4 K2 is 6 c is 0.96
K1 is 4 K2 is 6 c is 0.98
K1 is 5 K2 is 2 c is 0.80
K1 is 5 K2 is 2 c is 0.82
K1 is 5 K2 is 2 c is 0.84
K1 is 5 K2 is 2 c is 0.86
K1 is 5 K2 is 2 c is 0.88
K1 is 5 K2 is 2 c is 0.90
K1 is 5 K2 is 2 c is 0.92
K1 is 5 K2 is 2 c is 0.94
K1 is 5 K2 is 2 c is 0.96
K1 is 5 K2 is 2 c is 0.98
K1 is 5 K2 is 3 c is 0.80
K1 is 5 K2 is 3 c is 0.82
K1 is 5 K2 is 3 c is 0.84
K1 is 5 K2 is 3 c is 0.86
K1 is 5 K2 is 3 c is 0.88
K1 is 5 K2 is 3 c is 0.90
K1 is 5 K2 is 3 c is 0.92
K1 is 5 K2 is 3 c is 0.94
K1 is 5 K2 is 3 c is 0.96
K1 is 5 K2 is 3 c is 0.98
K1 is 5 K2 is 4 c is 0.80
K1 is 5 K2 is 4 c is 0.82
K1 is 5 K2 is 4 c is 0.84
K1 is 5 K2 is 4 c is 0.86
K1 is 5 K2 is 4 c is 0.88
K1 is 5 K2 is 4 c is 0.90
K1 is 5 K2 is 4 c is 0.92
K1 is 5 K2 is 4 c is 0.94
K1 is 5 K2 is 4 c is 0.96
K1 is 5 K2 is 4 c is 0.98
K1 is 5 K2 is 5 c is 0.80
K1 is 5 K2 is 5 c is 0.82
K1 is 5 K2 is 5 c is 0.84
K1 is 5 K2 is 5 c is 0.86
K1 is 5 K2 is 5 c is 0.88
K1 is 5 K2 is 5 c is 0.90
K1 is 5 K2 is 5 c is 0.92
K1 is 5 K2 is 5 c is 0.94
K1 is 5 K2 is 5 c is 0.96
K1 is 5 K2 is 5 c is 0.98
K1 is 5 K2 is 6 c is 0.80
K1 is 5 K2 is 6 c is 0.82
K1 is 5 K2 is 6 c is 0.84
K1 is 5 K2 is 6 c is 0.86
K1 is 5 K2 is 6 c is 0.88
K1 is 5 K2 is 6 c is 0.90
K1 is 5 K2 is 6 c is 0.92
K1 is 5 K2 is 6 c is 0.94
K1 is 5 K2 is 6 c is 0.96
K1 is 5 K2 is 6 c is 0.98
K1 is 6 K2 is 2 c is 0.80
K1 is 6 K2 is 2 c is 0.82
K1 is 6 K2 is 2 c is 0.84
K1 is 6 K2 is 2 c is 0.86
K1 is 6 K2 is 2 c is 0.88
K1 is 6 K2 is 2 c is 0.90
K1 is 6 K2 is 2 c is 0.92
K1 is 6 K2 is 2 c is 0.94
K1 is 6 K2 is 2 c is 0.96
K1 is 6 K2 is 2 c is 0.98
K1 is 6 K2 is 3 c is 0.80
K1 is 6 K2 is 3 c is 0.82
K1 is 6 K2 is 3 c is 0.84
K1 is 6 K2 is 3 c is 0.86
K1 is 6 K2 is 3 c is 0.88
K1 is 6 K2 is 3 c is 0.90
K1 is 6 K2 is 3 c is 0.92
K1 is 6 K2 is 3 c is 0.94
K1 is 6 K2 is 3 c is 0.96
K1 is 6 K2 is 3 c is 0.98
K1 is 6 K2 is 4 c is 0.80
K1 is 6 K2 is 4 c is 0.82
K1 is 6 K2 is 4 c is 0.84
K1 is 6 K2 is 4 c is 0.86
K1 is 6 K2 is 4 c is 0.88
K1 is 6 K2 is 4 c is 0.90
K1 is 6 K2 is 4 c is 0.92
K1 is 6 K2 is 4 c is 0.94
K1 is 6 K2 is 4 c is 0.96
K1 is 6 K2 is 4 c is 0.98
K1 is 6 K2 is 5 c is 0.80
K1 is 6 K2 is 5 c is 0.82
K1 is 6 K2 is 5 c is 0.84
K1 is 6 K2 is 5 c is 0.86
K1 is 6 K2 is 5 c is 0.88
K1 is 6 K2 is 5 c is 0.90
K1 is 6 K2 is 5 c is 0.92
K1 is 6 K2 is 5 c is 0.94
K1 is 6 K2 is 5 c is 0.96
K1 is 6 K2 is 5 c is 0.98
K1 is 6 K2 is 6 c is 0.80
K1 is 6 K2 is 6 c is 0.82
K1 is 6 K2 is 6 c is 0.84
K1 is 6 K2 is 6 c is 0.86
K1 is 6 K2 is 6 c is 0.88
K1 is 6 K2 is 6 c is 0.90
K1 is 6 K2 is 6 c is 0.92
K1 is 6 K2 is 6 c is 0.94
K1 is 6 K2 is 6 c is 0.96
K1 is 6 K2 is 6 c is 0.98
Histogram of number of reference contigs
8 +------------------------------------------------------------------------------------------------------------+
| * + + + + |
| * 'plot.kopt.data' using (bin($1,binwidth)):(1.0) ******* |
7 |-+***** *** +-|
| ***** *** |
| ***** *** |
6 |-+******* **** * ** +-|
| ******* **** * ** |
| ******* **** * ** |
5 |-+******* ***** * *** ** ** ** +-|
| ******* ***** * *** ** ** ** |
4 |-+******* ***** * ***** ** ***** ** ** ** ** +-|
| ******* ***** * ***** ** ***** ** ** ** ** |
| ******* ***** * ***** ** ***** ** ** ** ** |
3 |-+******* ******* ******** ******** ** ****** ** +-|
| ******* ******* ******** ******** ** ****** ** |
| ******* ******* ******** ******** ** ****** ** |
2 |-+******* ******* ************ ************ ********** *** +-|
| ******* ******* * ********* ************ ********** *** |
| ******* ******* * ********* ************ ********** *** |
1 |-+*********************** *********************************************************************************|
| ******** * ********* * ********** * ************** * * ************** * * *** * * * |
| ******** * ********* * ********** * ************** * * +************** * + * *** * * * |
0 +------------------------------------------------------------------------------------------------------------+
10000 15000 20000 25000 30000 35000
Number of reference contigs
Average contig number = 18333.3
The top three most common number of contigs
X Contig number
1 34817
1 34145
1 33632
The top three most common number of contigs (with values rounded)
X Contig number
7 11600
7 11300
7 11000
The plot shows 0.9 as the optimal similarity threshold.
- RefMapOpt.sh
RefMapOpt.sh 2 6 2 6 0.9 PE 20
❯ column -t -s $'\t' mapping.results
Cov Non0Cov Contigs MeanContigsMapped K1 K2 SUM Mapped SUM Properly Mean Mapped Mean Properly MisMatched
136.944 185.375 32866 24160.7 2 2 31506644 29224835 4.50095e+06 4.17498e+06 94539.9
190.862 217.077 22239 19499 2 3 29713432 27847038 4244776 3.97815e+06 54490.1
218.583 234.285 17991 16752.1 2 4 27529237 25865733 3.93275e+06 3.6951e+06 37203.7
242.109 252.244 14812 14200.4 2 5 25104475 23604313 3.58635e+06 3.37204e+06 31076.9
262.57 269.217 11980 11673.6 2 6 22020948 20586322 3.14585e+06 2.9409e+06 26010.3
166.223 204.911 26999 21809.3 3 2 31416156 29208349 4.48802e+06 4.17262e+06 86443.6
200.035 224.409 21130 18784.4 3 3 29588554 27627641 4.22694e+06 3.94681e+06 64797.1
225.696 240.927 17433 16297.9 3 4 27543510 25895090 3.93479e+06 3.6993e+06 37901.9
246.189 255.928 14351 13788.6 3 5 24733147 23243978 3.53331e+06 3.32057e+06 32408.1
266.188 272.636 11592 11308.1 3 6 21601434 20224808 3.08592e+06 2.88926e+06 27398.9
170.772 208.716 26177 21325.7 4 2 31293296 29124020 4.47047e+06 4.16057e+06 86572.6
203.857 227.851 20693 18459.7 4 3 29530371 27602551 4.21862e+06 3.94322e+06 61493.1
229.289 244.259 17085 16003.9 4 4 27423400 25795048 3.91763e+06 3.68501e+06 35259.4
250.082 259.66 14047 13512 4 5 24592124 23115844 3.51316e+06 3.30226e+06 30427.6
269.912 276.206 11325 11056.4 4 6 21399183 19996565 3.05703e+06 2.85665e+06 26775.1
173.256 210.955 25796 21089.7 5 2 31286323 29015273 4.46947e+06 4145039 101309
205.16 228.747 20398 18238.9 5 3 29295390 27444116 4.18506e+06 3920588 56877.9
231.117 245.767 16776 15741.1 5 4 27142198 25538579 3.87746e+06 3.64837e+06 33773.6
253.216 262.57 13748 13239.9 5 5 24370302 22902690 3.48147e+06 3.27181e+06 29392.9
273.423 279.688 11029 10772.1 5 6 21111002 19724281 3.01586e+06 2.81775e+06 25454.4
175.59 212.935 25458 20890.4 6 2 31292522 28969496 4.47036e+06 4.1385e+06 104483
208.173 231.435 20092 18013.7 6 3 29279778 27382110 4.18283e+06 3911730 53734.1
233.964 248.522 16499 15497.1 6 4 27022859 25409205 3.86041e+06 3.62989e+06 31701.7
255.553 264.653 13462 12979.1 6 5 24083598 22583410 3440514 3.2262e+06 28968.3
276.529 282.736 10760 10513.9 6 6 20830094 19421794 2.97573e+06 2774542 28466.9
❯ sort -k 10 -n -r mapping.results | column -t -s $'\t'
173.256 210.955 25796 21089.7 5 2 31286323 29015273 4.46947e+06 4145039 101309
205.16 228.747 20398 18238.9 5 3 29295390 27444116 4.18506e+06 3920588 56877.9
208.173 231.435 20092 18013.7 6 3 29279778 27382110 4.18283e+06 3911730 53734.1
276.529 282.736 10760 10513.9 6 6 20830094 19421794 2.97573e+06 2774542 28466.9
136.944 185.375 32866 24160.7 2 2 31506644 29224835 4.50095e+06 4.17498e+06 94539.9
166.223 204.911 26999 21809.3 3 2 31416156 29208349 4.48802e+06 4.17262e+06 86443.6
170.772 208.716 26177 21325.7 4 2 31293296 29124020 4.47047e+06 4.16057e+06 86572.6
175.59 212.935 25458 20890.4 6 2 31292522 28969496 4.47036e+06 4.1385e+06 104483
190.862 217.077 22239 19499 2 3 29713432 27847038 4244776 3.97815e+06 54490.1
200.035 224.409 21130 18784.4 3 3 29588554 27627641 4.22694e+06 3.94681e+06 64797.1
203.857 227.851 20693 18459.7 4 3 29530371 27602551 4.21862e+06 3.94322e+06 61493.1
225.696 240.927 17433 16297.9 3 4 27543510 25895090 3.93479e+06 3.6993e+06 37901.9
218.583 234.285 17991 16752.1 2 4 27529237 25865733 3.93275e+06 3.6951e+06 37203.7
229.289 244.259 17085 16003.9 4 4 27423400 25795048 3.91763e+06 3.68501e+06 35259.4
231.117 245.767 16776 15741.1 5 4 27142198 25538579 3.87746e+06 3.64837e+06 33773.6
233.964 248.522 16499 15497.1 6 4 27022859 25409205 3.86041e+06 3.62989e+06 31701.7
242.109 252.244 14812 14200.4 2 5 25104475 23604313 3.58635e+06 3.37204e+06 31076.9
246.189 255.928 14351 13788.6 3 5 24733147 23243978 3.53331e+06 3.32057e+06 32408.1
250.082 259.66 14047 13512 4 5 24592124 23115844 3.51316e+06 3.30226e+06 30427.6
253.216 262.57 13748 13239.9 5 5 24370302 22902690 3.48147e+06 3.27181e+06 29392.9
255.553 264.653 13462 12979.1 6 5 24083598 22583410 3440514 3.2262e+06 28968.3
262.57 269.217 11980 11673.6 2 6 22020948 20586322 3.14585e+06 2.9409e+06 26010.3
266.188 272.636 11592 11308.1 3 6 21601434 20224808 3.08592e+06 2.88926e+06 27398.9
269.912 276.206 11325 11056.4 4 6 21399183 19996565 3.05703e+06 2.85665e+06 26775.1
273.423 279.688 11029 10772.1 5 6 21111002 19724281 3.01586e+06 2.81775e+06 25454.4
Cov Non0Cov Contigs MeanContigsMapped K1 K2 SUM Mapped SUM Properly Mean Mapped Mean Properly MisMatched
Cov Non0Cov Contigs MeanContigsMapped K1 K2 SUM Mapped SUM Properly Mean Mapped Mean Properly MisMatched
136.944 185.375 32866 24160.7 2 2 31506644 29224835 4.50095e+06 4.17498e+06 94539.9
166.223 204.911 26999 21809.3 3 2 31416156 29208349 4.48802e+06 4.17262e+06 86443.6
170.772 208.716 26177 21325.7 4 2 31293296 29124020 4.47047e+06 4.16057e+06 86572.6
175.59 212.935 25458 20890.4 6 2 31292522 28969496 4.47036e+06 4.1385e+06 104483
Top choice is K1=3 and K2=2 and used for assembly. These parameters were used for the reference assembly that was built using dDocent.
Step 7: Reference assembly per cluster
Cluster 1:
Cluster 2:
Number of Processors
20
Maximum Memory
0
Trimming
no
Assembly?
yes
Type_of_Assembly
PE
Clustering_Similarity%
0.9
Minimum within individual coverage level to include a read for assembly (K1)
3
Minimum number of individuals a read must be present in to include for assembly (K2)
2
Mapping_Reads?
no
Mapping_Match_Value
Mapping_MisMatch_Value
Mapping_GapOpen_Penalty
Calling_SNPs?
no
Email
xxxx@uri.edu
Filtering reference assembly for coral contigs only
Cluster 1:
blastn -db /home/tejashree/Moorea/genomes/coral -query /home/tejashree/Moorea/ddocent/mod_pipeline/cluster_variation/clust1/reference_assembly/reference.fasta -evalue 0.001 -outfmt 6 -out /home/tejashree/Moorea/blast/mod_pipeline/cluster_variation/clust1/coral_hits_clust1 -num_threads 20
cut -f1 coral_hits_clust1 | uniq > uniq_coral_hits_clust1.txt
python3 subset_fasta.py -i /home/tejashree/Moorea/ddocent/mod_pipeline/cluster_variation/clust1/reference_assembly/reference.fasta -f /home/tejashree/Moorea/blast/mod_pipeline/cluster_variation/clust1/uniq_coral_hits_clust1.txt -o /home/tejashree/Moorea/ddocent/mod_pipeline/cluster_variation/clust1/reference_assembly/coral_reference.fasta
Cluster 2:
blastn -db /home/tejashree/Moorea/genomes/coral -query /home/tejashree/Moorea/ddocent/mod_pipeline/cluster_variation/clust2/reference_assembly/reference.fasta -evalue 0.001 -outfmt 6 -out /home/tejashree/Moorea/blast/mod_pipeline/cluster_variation/clust2/coral_hits_clust2 -num_threads 20
cut -f1 coral_hits_clust2 | uniq > uniq_coral_hits_clust2.txt
python3 subset_fasta.py -i /home/tejashree/Moorea/ddocent/mod_pipeline/cluster_variation/clust2/reference_assembly/reference.fasta -f /home/tejashree/Moorea/blast/mod_pipeline/cluster_variation/clust2/uniq_coral_hits_clust2.txt -o /home/tejashree/Moorea/ddocent/mod_pipeline/cluster_variation/clust2/reference_assembly/coral_reference.fasta
This was moved to respective cluster dir and renamed as reference.fasta since dDocent will look for the specific file name.
Step8: Mapping reads to filtered reference:
I created symlinks for all samples and their trimmed files to here and used dDocent for mapping to reference.
Cluster 1:
Number of Processors
20
Maximum Memory
0
Trimming
no
Assembly?
no
Type_of_Assembly
PE
Clustering_Similarity%
0.9
Minimum within individual coverage level to include a read for assembly (K1)
Minimum number of individuals a read must be present in to include for assembly (K2)
Mapping_Reads?
yes
Mapping_Match_Value
1
Mapping_MisMatch_Value
3
Mapping_GapOpen_Penalty
5
Calling_SNPs?
yes
Email
xxx@uri.edu
Cluster 2:
Number of Processors
20
Maximum Memory
0
Trimming
no
Assembly?
no
Type_of_Assembly
PE
Clustering_Similarity%
0.9
Minimum within individual coverage level to include a read for assembly (K1)
3
Minimum number of individuals a read must be present in to include for assembly (K2)
2
Mapping_Reads?
yes
Mapping_Match_Value
1
Mapping_MisMatch_Value
3
Mapping_GapOpen_Penalty
5
Calling_SNPs?
yes
Email
Step 9: SNP calling per cluster
A new directory was made for this purpose housed in ddr_snp_calling inside clust1 or clust2 for their respective analysis. All related files required for SNP calling for all the ddRAD samples per cluster were symlinked to this directory and dDocent was used for SNP calling.
CLuster 1:
Number of Processors
20
Maximum Memory
0
Trimming
no
Assembly?
no
Type_of_Assembly
Clustering_Similarity%
Minimum within individual coverage level to include a read for assembly (K1)
Minimum number of individuals a read must be present in to include for assembly (K2)
Mapping_Reads?
no
Mapping_Match_Value
Mapping_MisMatch_Value
Mapping_GapOpen_Penalty
Calling_SNPs?
yes
Email
Cluster 2:
Number of Processors
20
Maximum Memory
0
Trimming
no
Assembly?
no
Type_of_Assembly
Clustering_Similarity%
Minimum within individual coverage level to include a read for assembly (K1)
Minimum number of individuals a read must be present in to include for assembly (K2)
Mapping_Reads?
no
Mapping_Match_Value
Mapping_MisMatch_Value
Mapping_GapOpen_Penalty
Calling_SNPs?
yes
Email
SNP Filtering
Housed in /home/tejashree/Moorea/ddocent/cluster_variaiton/clust1/ddr_snp_calling/snp_filtering for cluster 1 and /home/tejashree/Moorea/ddocent/cluster_variaiton/clust2/ddr_snp_calling/snp_filtering/ for cluster 2. Generated a symlink to the final vcf output from SNP calling: ln -s ../TotalRawSNPs.vcf in this dir for filtration. All steps are the same for both clusters and cluster specific results are shown below for each step.
- Step1: 50% of individuals, a minimum quality score of 30, and a minor allele count of 3
vcftools --vcf TotalRawSNPs.vcf --max-missing 0.5 --mac 3 --minQ 30 --recode --recode-INFO-all --out raw.g5mac3
Cluster 1:
After filtering, kept 26 out of 26 Individuals
Outputting VCF file...
After filtering, kept 130368 out of a possible 400358 Sites
Cluster 2:
After filtering, kept 18 out of 18 Individuals
Outputting VCF file...
After filtering, kept 99445 out of a possible 286505 Sites
- Step2: Minimum mean depth: minDP 5
vcftools --vcf raw.g5mac3.recode.vcf --minDP 5 --recode --recode-INFO-all --out raw.g5mac3dp5
Cluster 1:
After filtering, kept 26 out of 26 Individuals
Outputting VCF file...
After filtering, kept 130368 out of a possible 130368 Sites
Cluster 2:
After filtering, kept 18 out of 18 Individuals
Outputting VCF file...
After filtering, kept 99445 out of a possible 99445 Sites
- Filter missing indiv post minDP =5
./filter_missing_ind.sh raw.g5mac3dp5.recode.vcf raw.g5mac3dplm
Cluster 1:
After filtering, kept 26 out of 26 Individuals
Outputting Individual Missingness
After filtering, kept 130368 out of a possible 130368 Sites
Run Time = 1.00 seconds
Histogram of % missing data per individual
14 +---------------------------------------------------------------------------------------------------------+
| + + + + + + + |
| **************************** 'totalmissing' using (bin($1,binwidth)):(1.0) ******* |
12 |-+ * * +-|
| * * |
| * * |
| * * |
10 |-+ * * +-|
| * * |
| * * |
8 |-+ * * +-|
| * * |
| * * |
6 |-+ * * +-|
| * * |
| * * *************|
4 |-+ * * **************************** +-|
| * * * * |
| * *************************** * |
| * * * * |
2 |-+ * * * * +-|
|************* * * * |
| * + * + * + * |
0 +---------------------------------------------------------------------------------------------------------+
0.115 0.12 0.125 0.13 0.135 0.14 0.145 0.15 0.155
% of missing data
The 85% cutoff would be 0.15118
Would you like to set a different cutoff, yes or no
yes
Please enter new cutoff
0.155
All individuals with more than 15.5% missing data will be removed.
After filtering, kept 26 out of 26 Individuals
Outputting VCF file...
After filtering, kept 130368 out of a possible 130368 Sites
Cluster 2:
After filtering, kept 18 out of 18 Individuals
Outputting Individual Missingness
After filtering, kept 99445 out of a possible 99445 Sites
Run Time = 1.00 seconds
Histogram of % missing data per individual
6 +---------------------------------------------------------------------------------------------------------+
| * * + + + + |
| * * 'totalmissing' using (bin($1,binwidth)):(1.0) ******* |
| * * |
5 |-+ * * ***************** +-|
| * * * * |
| * * * * |
| * * * * |
4 |-+ * * * * +-|
| * * * * |
| * * * * |
3 |-+ * **************** * +-|
| * * * * |
| * * * * |
| * * * * |
2 |*************** * * * +-|
| * * * * |
| * * * * |
| * * * * |
1 |-+ * * * ************************************** +-|
| * * * * * * |
| * * * * * * |
| * * * * + * + * |
0 +---------------------------------------------------------------------------------------------------------+
0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17
% of missing data
The 85% cutoff would be 0.137332
Would you like to set a different cutoff, yes or no
0.165
All individuals with more than 16.5% missing data will be removed.
After filtering, kept 18 out of 18 Individuals
Outputting VCF file...
After filtering, kept 99445 out of a possible 99445 Sites
- Restrict the data to variants called in a high percentage of individuals and filter by mean depth of genotypes
This applied a genotype call rate (95%) across all individuals. Setting maf = 0.001
vcftools --vcf raw.g5mac3dplm.recode.vcf --max-missing 0.95 --maf 0.001 --recode --recode-INFO-all --out DP3g95maf001 --min-meanDP 20
Cluster1:
After filtering, kept 26 out of 26 Individuals
Outputting VCF file...
After filtering, kept 74505 out of a possible 130368 Sites
Cluster 2:
After filtering, kept 18 out of 18 Individuals
Outputting VCF file...
After filtering, kept 52142 out of a possible 99445 Sites
- Filtering by population specific call rate when multiple localities are present
Cluster 1:
mawk '$2 == "EOB"' popmap > 1.keep && mawk '$2 == "PBF"' popmap > 2.keep && mawk '$2 == "WOF"' popmap > 3.keep && mawk '$2 == "WOB"' popmap > 4.keep
vcftools --vcf DP3g95maf001.recode.vcf --keep 1.keep --missing-site --out 1
vcftools --vcf DP3g95maf001.recode.vcf --keep 2.keep --missing-site --out 2
vcftools --vcf DP3g95maf001.recode.vcf --keep 3.keep --missing-site --out 3
vcftools --vcf DP3g95maf001.recode.vcf --keep 4.keep --missing-site --out 4
cat 1.lmiss 2.lmiss 3.lmiss 4.lmiss | mawk '!/CHR/' | mawk '$6 > 0.1' | cut -f1,2 >> badloci
vcftools --vcf DP3g95maf001.recode.vcf --exclude-positions badloci --recode --recode-INFO-all --out DP3g95p5maf001
After filtering, kept 26 out of 26 Individuals
Outputting VCF file...
After filtering, kept 69108 out of a possible 74505 Sites
Cluster 2:
mawk '$2 == "EOB"' popmap > 1.keep && mawk '$2 == "WOB"' popmap > 2.keep
vcftools --vcf DP3g95maf001.recode.vcf --keep 1.keep --missing-site --out 1
vcftools --vcf DP3g95maf001.recode.vcf --keep 2.keep --missing-site --out 2
cat 1.lmiss 2.lmiss | mawk '!/CHR/' | mawk '$6 > 0.1' | cut -f1,2 >> badloci
vcftools --vcf DP3g95maf001.recode.vcf --exclude-positions badloci --recode --recode-INFO-all --out DP3g95p5maf001
After filtering, kept 18 out of 18 Individuals
Outputting VCF file...
After filtering, kept 52142 out of a possible 52142 Sites
- Used dDocent_filters script
Cluster 1:
./dDocent_filters DP3g95p5maf001.recode.vcf dDocent_filters_out
This script will automatically filter a FreeBayes generated VCF file using criteria related to site depth,
quality versus depth, strand representation, allelic balance at heterzygous individuals, and paired read representation.
The script assumes that loci and individuals with low call rates (or depth) have already been removed.
Contact Jon Puritz (jpuritz@gmail.com) for questions and see script comments for more details on particular filters
Number of sites filtered based on allele balance at heterozygous loci, locus quality, and mapping quality / Depth
9446 of 69108
Are reads expected to overlap? In other words, is fragment size less than 2X the read length? Enter yes or no.
no
Number of additional sites filtered based on overlapping forward and reverse reads
13590 of 59662
Is this from a mixture of SE and PE libraries? Enter yes or no.
no
Number of additional sites filtered based on properly paired status
1683 of 46072
Number of sites filtered based on high depth and lower than 2*DEPTH quality score
3655 of 44389
Histogram of mean depth per site
600 +---------------------------------------------------------------------------------------------------------+
| + + + + + + + + + + + + + + + + + + + + |
| ** 'meandepthpersite' using (bin($1,binwidth)):(1.0) ******* |
| **** |
500 |-+ ******* +-|
| ******** |
| ** ********* * |
| ************ * |
400 |-+ **************** +-|
| ****************** |
| ******************* * |
300 |-+ ******************* * +-|
| ******************** * * |
| ***************************** |
| ***************************** ** |
200 |-+ ******************************** *** +-|
| * ********************************* ****** |
| ********************************************* ** |
| ************************************************** |
100 |-+ *************************************************** *** +-|
| *************************************************************** ** |
| ***************************************************************** ************ ****** *** * |
| + + ************************************************************************************************ |
0 +---------------------------------------------------------------------------------------------------------+
15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240 255 270 285 300
Mean Depth
If distrubtion looks normal, a 1.645 sigma cutoff (~90% of the data) would be 7688.64105
The 95% cutoff would be 278
Would you like to use a different maximum mean depth cutoff than 278, yes or no
no
Number of sites filtered based on maximum mean depth
2408 of 44389
Number of sites filtered based on within locus depth mismatch
45 of 41981
Total number of sites filtered
27172 of 69108
Remaining sites
41936
Filtered VCF file is called dDocent_filters_out.FIL.recode.vcf
Filter stats stored in dDocent_filters_out.filterstats
Cluster 2:
./dDocent_filters DP3g95p5maf001.recode.vcf dDocent_filters_out
This script will automatically filter a FreeBayes generated VCF file using criteria related to site depth,
quality versus depth, strand representation, allelic balance at heterzygous individuals, and paired read representation.
The script assumes that loci and individuals with low call rates (or depth) have already been removed.
Contact Jon Puritz (jpuritz@gmail.com) for questions and see script comments for more details on particular filters
Number of sites filtered based on allele balance at heterozygous loci, locus quality, and mapping quality / Depth
5890 of 52142
Are reads expected to overlap? In other words, is fragment size less than 2X the read length? Enter yes or no.
no
Number of additional sites filtered based on overlapping forward and reverse reads
10712 of 46252
Is this from a mixture of SE and PE libraries? Enter yes or no.
no
Number of additional sites filtered based on properly paired status
1322 of 35540
Number of sites filtered based on high depth and lower than 2*DEPTH quality score
2417 of 34218
Histogram of mean depth per site
450 +---------------------------------------------------------------------------------------------------------+
| + + + + + + * * * + + + + + + + + + + + + + |
| * ** *** 'meandepthpersite' using (bin($1,binwidth)):(1.0) ******* |
400 |-+ ******** +-|
| * ******** * |
350 |-+ * *********** +-|
| ************* |
| ************** |
300 |-+ **************** ** +-|
| ***************** *** |
250 |-+ *********************** * +-|
| *********************** * |
| ************************ * |
200 |-+ ****************************** +-|
| * ********************************* |
150 |-+ ************************************* +-|
| ************************************* * |
| ****************************************** |
100 |-+ ********************************************** +-|
| ********************************************** ** ** |
50 |-+ *********************************************************** *** +-|
| ************************************************************************ ** ** |
| + + ********************************************************************************************+***|
0 +---------------------------------------------------------------------------------------------------------+
15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240 255 270 285 300
Mean Depth
If distrubtion looks normal, a 1.645 sigma cutoff (~90% of the data) would be 5552.7585
The 95% cutoff would be 283
Would you like to use a different maximum mean depth cutoff than 283, yes or no
no
Number of sites filtered based on maximum mean depth
1825 of 34218
Number of sites filtered based on within locus depth mismatch
29 of 32393
Total number of sites filtered
19778 of 52142
Remaining sites
32364
Filtered VCF file is called dDocent_filters_out.FIL.recode.vcf
Filter stats stored in dDocent_filters_out.filterstats
- Convert our variant calls to SNPs
vcfallelicprimitives dDocent_filters_out.FIL.recode.vcf --keep-info --keep-geno > DP3g95p5maf001.prim.vcf
vcftools --vcf DP3g95p5maf001.prim.vcf --remove-indels --recode --recode-INFO-all --out SNP.DP3g95p5maf001
Cluster 1:
After filtering, kept 26 out of 26 Individuals
Outputting VCF file...
After filtering, kept 44838 out of a possible 47509 Sites
Cluster 2:
After filtering, kept 18 out of 18 Individuals
Outputting VCF file...
After filtering, kept 34905 out of a possible 37139 Sites
- HWE filter
Setting -h 0.001
./filter_hwe_by_pop.pl -v SNP.DP3g95p5maf001.recode.vcf -p popmap -o SNP.DP3g95p5maf001.HWE -h 0.001
Cluster 1:
Processing population: EOB (6 inds)
Processing population: PBF (6 inds)
Processing population: WOB (1 inds)
Processing population: WOF (13 inds)
Outputting results of HWE test for filtered loci to 'filtered.hwe'
Kept 44838 of a possible 44838 loci (filtered 0 loci)
CLuster 2:
Processing population: EOB (8 inds)
Processing population: WOB (10 inds)
Outputting results of HWE test for filtered loci to 'filtered.hwe'
Kept 34905 of a possible 34905 loci (filtered 0 loci)
- rad_haplotyper
Made symlinks for bam and their indexes from ddr_snp_calling to snp_filtering since they are required by rad_haplotyper
Cluster 1:
rad_haplotyper.pl -v SNP.DP3g95p5maf001.HWE.recode.vcf -x 20 -mp 1 -u 20 -ml 4 -n -r reference.fasta
Removed 77 loci (1898 SNPs) with more than 20 SNPs at a locus
Building haplotypes for EOB_174_ddr
Building haplotypes for EOB_175_ddr
Building haplotypes for EOB_185_ddr
Building haplotypes for EOB_189_ddr
Building haplotypes for EOB_190_ddr
Building haplotypes for EOB_192_ddr
Building haplotypes for PBF_158_ddr
Building haplotypes for PBF_161_ddr
Building haplotypes for PBF_164_ddr
Building haplotypes for PBF_169_ddr
Building haplotypes for PBF_171_ddr
Building haplotypes for PBF_172_ddr
Building haplotypes for WOB_35_ddr
Building haplotypes for WOF_224_ddr
Building haplotypes for WOF_225_ddr
Building haplotypes for WOF_232_ddr
Building haplotypes for WOF_234_ddr
Building haplotypes for WOF_236_ddr
Building haplotypes for WOF_237_ddr
Building haplotypes for WOF_238_ddr
Building haplotypes for WOF_239_ddr
Building haplotypes for WOF_240_ddr
Building haplotypes for WOF_241_ddr
Building haplotypes for WOF_243_dd
Building haplotypes for WOF_244_ddr
Building haplotypes for WOF_245_ddr
Filtered 170 loci below missing data cutoff
Filtered 1183 possible paralogs
Filtered 258 loci with low coverage or genotyping errors
Filtered 0 loci with an excess of haplotypes
head stats.out
rad_haplotyper -v SNP.DP3g95p5maf001.HWE.recode.vcf -x 20 -mp 1 -u 20 -ml 4 -n -r reference.fasta
Locus Sites Haplotypes Inds_Haplotyped Total_Inds Prop_Haplotyped Status Poss_Paralog Low_Cov/Geno_Err Miss_Geno Comment
dDocent_Contig_1 1 2 26 26 1.000 PASSED 0 0 0
dDocent_Contig_100 2 3 26 26 1.000 PASSED 0 0 0
dDocent_Contig_10000 5 8 26 26 1.000 PASSED 0 0 0
dDocent_Contig_10001 1 2 25 26 0.962 PASSED 0 0 1
dDocent_Contig_10003 3 3 26 26 1.000 PASSED 0 0 0
dDocent_Contig_10006 9 8 26 26 1.000 PASSED 0 0 0
dDocent_Contig_10009 3 4 26 26 1.000 PASSED 0 0 0
dDocent_Contig_10011 2 3 25 26 0.962 PASSED 0 1 0
grep FILTERED stats.out | mawk '!/Complex/' | cut -f1 > loci.to.remove
./remove.bad.hap.loci.sh loci.to.remove SNP.DP3g95p5maf001.HWE.recode.vcf
mawk '!/#/' SNP.DP3g95p5maf001.HWE.filtered.vcf | wc -l
30167
Number of loci post filtration in cluster 1: 30167
Cluster 2:
rad_haplotyper.pl -v SNP.DP3g95p5maf001.HWE.recode.vcf -x 20 -mp 1 -u 20 -ml 4 -n -r reference.fasta
Removed 57 loci (1377 SNPs) with more than 20 SNPs at a locus
Building haplotypes for EOB_177_ddr
Building haplotypes for EOB_178_ddr
Building haplotypes for EOB_182_ddr
Building haplotypes for EOB_183_ddr
Building haplotypes for EOB_184_ddr
Building haplotypes for EOB_186_ddr
Building haplotypes for EOB_188_ddr
Building haplotypes for EOB_191_ddr
Building haplotypes for WOB_41_ddr
Building haplotypes for WOB_42_ddr
Building haplotypes for WOB_47_ddr
Building haplotypes for WOB_48_ddr
Building haplotypes for WOB_59_ddr
Building haplotypes for WOB_49_ddr
Building haplotypes for WOB_50_ddr
Building haplotypes for WOB_60_ddr
Building haplotypes for WOB_62_ddr
Building haplotypes for WOB_65_ddr
Filtered 173 loci below missing data cutoff
Filtered 842 possible paralogs
Filtered 155 loci with low coverage or genotyping errors
Filtered 0 loci with an excess of haplotypes
head stats.out
rad_haplotyper -v SNP.DP3g95p5maf001.HWE.recode.vcf -x 20 -mp 1 -u 20 -ml 4 -n -r reference.fasta
Locus Sites Haplotypes Inds_Haplotyped Total_Inds Prop_Haplotyped Status Poss_Paralog Low_Cov/Geno_Err Miss_Geno Comment
dDocent_Contig_1 1 2 18 18 1.000 PASSED 0 0 0
dDocent_Contig_10000 1 2 18 18 1.000 PASSED 0 0 0
dDocent_Contig_10001 4 6 18 18 1.000 PASSED 0 0 0
dDocent_Contig_10003 18 10 14 18 0.778 FILTERED 0 4 0 Missing data
dDocent_Contig_10004 5 6 18 18 1.000 PASSED 0 0 0
dDocent_Contig_10006 2 3 18 18 1.000 PASSED 0 0 0
dDocent_Contig_10007 1 2 18 18 1.000 PASSED 0 0 0
dDocent_Contig_10008 4 3 17 18 0.944 PASSED 1 0 0
grep FILTERED stats.out | mawk '!/Complex/' | cut -f1 > loci.to.remove
./remove.bad.hap.loci.sh loci.to.remove SNP.DP3g95p5maf001.HWE.recode.vcf
mawk '!/#/' SNP.DP3g95p5maf001.HWE.filtered.vcf | wc -l
24135
Number of loci post filtration in cluster 2: 24135
Statistical analysis was performed using the script genetic_variation.R that takes the above vcf as input.