4_estimateEvolTimes module
Reads all PCS size distributions across the windows of a designated chromosome in
a given species (stored in one of the outputs files of 2_computeWindows.py
)
and, along with the setup files produced by the script 3_setupEvolTimes.py
,
estimates the evolutionary times for each window.
Use:
python3 4_estimateEvolTimes.py -sp_ucsc_name [UCSC name] -chr [chromosome name] -alpha [any number > 1] -cores [nb. of cores] [--overwrite, optional]
Example of Usage (human (reference genome) and mouse, all windows mapping to chromosome 16 in humans):
python3 ~/code/4_estimateEvolTimes.py -sp_ucsc_name mm39 -chr chr16 -alpha 1.1 -cores 1 --overwrite
Input Parameter (mandatory):
- -sp_ucsc_name:
UCSC name of the species that is being aligned with the reference species (e.g. mm39 for mouse).
- -chr:
Chromosome name in the reference species (e.g. chr1, chr2, …, chrX, chrY).
- -alpha:
It determines how frequent longer indels can occur. The parameter alpha can take any value above 1 (1 is not included): (1, ∞). If alpha is near 1, larger indels are more likely to occur. If alpha is above 5 (a hard upper limit set internally with
max_alpha
), the model turns into a substitution only model.- -cores:
If
cores=1
, the script will execute serially, which may take several hours to complete. It is advisable to utilize as many available cores as possible.- –overwrite:
Optional flag. If this flag is specified, any existing output files will be overwritten during the run.
Other Parameters taken from
dataset.py
:
- refsp_ucscName:
UCSC name of the reference species that is being aligned (e.g. hg38 for human).
- dirWindows:
Directory where windows and their PCS size distributions are saved (input files).
- dirSetupEvolTimes:
Directory where setup files are saved (input files).
- dirEstEvolTimes:
Directory where estimates of evolutionary time will be saved (output files).
Note
Make sure that the required parameters described above are correctly defined in the file utils/dataset.py
.
Output:
One
.pickle
file is created as output, detailing the posterior evolutionary times for each window. All windows mapped to the specified chromosome of the reference genome (human in our study) are included. The PCS size distributions from these windows, representing perfectly conserved sequences between the reference genome (e.g.hg38
) and the species specified in the input (e.g.mm39
), are used as input for the evolutionary time estimation method.The output
.pickle
file is structured as follows:A dictionary where each key represents a window identifier, defined as a tuple containing the start and end coordinates within the chromosome of the reference genome. The corresponding values are NumPy arrays of size
n
, where each indexi
corresponds to an evolutionary time, and the value at that index indicates the likelihood of the window being at evolutionary timei
.A dictionary that maps window identifiers to their related reference window sizes and reference sample values.
A dictionary that maps reference window sizes to selected evolutionary times.
Note
If you are interested in finding the evolutionary time at position
i
for a window, first check the reference window size in the second dictionary, then use it to get the evolutionary time values from the third dictionary. This indirect way of accessing information avoids saving the same data multiple times.
Pre-requisites
Before using this script, make sure all the required files were pre-computed:
a) Window file for the chromosome specified in the input
Make sure to run 2_computeWindows.py
for the chromosome specified in the
input parameter -chr
.
b) Setup files for the estimation method
Make sure to run 3_setupEvolTimes.py
for the alpha value specified in the
input parameter -alpha
.
Cluster resources
In case you want to run this Python script stand-alone in a cluster that uses Slurm to manage jobs:
srun -p compute -t 2:00:00 --mem 20G --nodes=1 --ntasks=1 --cpus-per-task=1 --pty bash
Otherwise you can use the script ../cluster/4_estimateEvolTimes_runAll.py
to run this script for all 40 species used in this study (whole genome).
Time, Memory & Disk space
For reference, here we include a run example, with runtime, memory usage, and disk space required for running this script on each pairwise alignment of the 40 vertebrate dataset examined in our study. 30 cores were used in these runs.
Desc. |
Parameter α=1.1 |
Parameter α=10 |
||||
---|---|---|---|---|---|---|
UCSC name |
Time |
Memory |
Disk |
Time |
Memory |
Disk |
panPan3 |
02:12:16 |
27GB |
4.030GB |
02:08:24 |
22GB |
3.710GB |
panTro6 |
02:07:07 |
29GB |
4.160GB |
02:08:34 |
23GB |
3.830GB |
gorGor6 |
02:12:49 |
31GB |
4.150GB |
02:14:49 |
26GB |
3.810GB |
ponAbe3 |
02:22:49 |
33GB |
3.880GB |
02:08:45 |
25GB |
3.560GB |
papAnu4 |
02:02:03 |
33GB |
3.620GB |
02:04:07 |
27GB |
3.320GB |
macFas5 |
01:50:55 |
33GB |
3.590GB |
01:44:34 |
26GB |
3.300GB |
rhiRox1 |
02:04:24 |
31GB |
3.830GB |
02:06:27 |
26GB |
3.520GB |
chlSab2 |
02:11:06 |
33GB |
3.850GB |
01:58:09 |
27GB |
3.530GB |
nasLar1 |
02:04:28 |
31GB |
3.110GB |
01:58:45 |
25GB |
2.860GB |
rheMac10 |
01:59:27 |
32GB |
3.640GB |
02:08:37 |
27GB |
3.340GB |
calJac4 |
02:02:40 |
27GB |
3.310GB |
02:00:49 |
24GB |
3.040GB |
tarSyr2 |
01:52:27 |
25GB |
3.240GB |
01:57:15 |
21GB |
2.970GB |
micMur2 |
01:56:30 |
25GB |
2.690GB |
01:51:04 |
20GB |
2.470GB |
galVar1 |
01:59:18 |
25GB |
3.220GB |
02:00:21 |
21GB |
2.950GB |
mm39 |
01:38:46 |
24GB |
2.140GB |
01:47:34 |
18GB |
1.960GB |
oryCun2 |
01:53:04 |
25GB |
2.570GB |
01:48:55 |
19GB |
2.360GB |
rn7 |
01:56:31 |
24GB |
2.120GB |
01:45:28 |
18GB |
1.940GB |
vicPac2 |
01:58:22 |
25GB |
2.990GB |
01:58:28 |
20GB |
2.740GB |
bisBis1 |
01:43:00 |
23GB |
2.710GB |
01:56:04 |
20GB |
2.480GB |
felCat9 |
01:28:46 |
25GB |
3.000GB |
01:56:15 |
20GB |
2.750GB |
manPen1 |
02:02:00 |
24GB |
2.820GB |
01:53:13 |
20GB |
2.580GB |
bosTau9 |
01:40:54 |
24GB |
2.260GB |
01:46:43 |
19GB |
2.070GB |
canFam6 |
02:06:19 |
25GB |
2.870GB |
01:57:34 |
20GB |
2.630GB |
musFur1 |
02:04:22 |
25GB |
3.080GB |
01:57:49 |
20GB |
2.830GB |
neoSch1 |
02:06:12 |
26GB |
3.220GB |
01:59:31 |
20GB |
2.960GB |
equCab3 |
02:06:21 |
25GB |
3.250GB |
01:59:49 |
21GB |
2.980GB |
myoLuc2 |
01:54:06 |
24GB |
2.290GB |
01:29:52 |
19GB |
2.100GB |
susScr11 |
02:01:27 |
25GB |
2.720GB |
01:41:52 |
19GB |
2.500GB |
enhLutNer1 |
02:03:34 |
25GB |
3.080GB |
01:54:00 |
20GB |
2.830GB |
triMan1 |
02:01:40 |
25GB |
2.990GB |
01:46:48 |
20GB |
2.740GB |
macEug2 |
01:40:15 |
20GB |
0.730GB |
01:24:21 |
16GB |
0.660GB |
ornAna2 |
01:40:16 |
18GB |
0.540GB |
01:30:07 |
16GB |
0.500GB |
aptMan1 |
01:35:15 |
18GB |
0.410GB |
01:34:37 |
16GB |
0.380GB |
galGal6 |
01:40:10 |
18GB |
0.330GB |
01:37:57 |
13GB |
0.300GB |
thaSir1 |
01:42:28 |
18GB |
0.260GB |
01:27:48 |
15GB |
0.240GB |
aquChr2 |
01:35:58 |
16GB |
0.390GB |
01:27:22 |
15GB |
0.350GB |
melGal5 |
01:40:43 |
19GB |
0.340GB |
01:19:10 |
15GB |
0.310GB |
xenLae2 |
01:40:43 |
18GB |
0.240GB |
01:22:27 |
13GB |
0.220GB |
xenTro10 |
01:36:51 |
19GB |
0.230GB |
01:30:52 |
15GB |
0.210GB |
danRer11 |
01:36:48 |
20GB |
0.170GB |
01:30:24 |
15GB |
0.160GB |
Time per Run: Details
Stats on time of a single run (human-mouse alignment, chromosome 16, 1 core): ~5 minutes Details on computational time are available in the log of the run.
Step |
Time (s) |
---|---|
Read windows |
14.10 |
Read setup |
91.50 |
Compute evol. times |
135.49 |
Total time |
241.34 |
Storage per Run: Details
Size of output files with PCSs for the human and mouse alignment, chromosome 16: ~61 MB.
Function details
Only relevant functions have been documented below. For more details on any function, check the comments in the souce code.
- 4_estimateEvolTimes.estimateEvolTimes_winRefSize(parallelInput)
- 4_estimateEvolTimes.getSampleSizeRef(sampleSizeRef_lst, pcsSizeDistrib, maxDiffSampSize)
- 4_estimateEvolTimes.initParallelInputs(my_dataset, winSizeRef_obslst, winSizeRef_filepaths, pcs_distrib_allwin, printStep)
- 4_estimateEvolTimes.printIgnoredWindows(pcs_distrib_allwin, minDiffPcsSizes)
- 4_estimateEvolTimes.processIteration(result, winSizeRefs_ts, winIds_refs, winIds_ests)
- 4_estimateEvolTimes.readSampleSizeRefs(winSizeRef_filepaths)
- 4_estimateEvolTimes.readSetup(my_dataset, pcs_distrib_allwin, alpha)
This function retrieves all reference window/sample sizes that were pre-calculated using the script
3_setupEvolTimes.py
and matches each window from the dataset with a corresponding reference window/sample size from the collected data.- Returns:
The function returns: 1. A dictionary mapping reference window/sample sizes to a list of windows identified by their IDs; 2. A dictionary mapping reference window sizes to the file paths where their setup details are stored.