4_estimateEvolTimes module

Reads all PCS size distributions across the windows of a designated chromosome in a given species (stored in one of the outputs files of 2_computeWindows.py) and, along with the setup files produced by the script 3_setupEvolTimes.py, estimates the evolutionary times for each window.

Use:

python3 4_estimateEvolTimes.py -sp_ucsc_name [UCSC name] -chr [chromosome name] -alpha [any number > 1] -cores [nb. of cores] [--overwrite, optional]

Example of Usage (human (reference genome) and mouse, all windows mapping to chromosome 16 in humans):

python3 ~/code/4_estimateEvolTimes.py -sp_ucsc_name mm39 -chr chr16 -alpha 1.1 -cores 1 --overwrite

Input Parameter (mandatory):

-sp_ucsc_name:: UCSC name of the species that is being aligned with the reference species (e.g. mm39 for mouse).
-chr:: Chromosome name in the reference species (e.g. chr1, chr2, …, chrX, chrY).
-alpha:: It determines how frequent longer indels can occur. The parameter alpha can take any value above 1 (1 is not included): (1, ∞). If alpha is near 1, larger indels are more likely to occur. If alpha is above 5 (a hard upper limit set internally with max_alpha), the model turns into a substitution only model.
-cores:: If cores=1, the script will execute serially, which may take several hours to complete. It is advisable to utilize as many available cores as possible.
–overwrite:: Optional flag. If this flag is specified, any existing output files will be overwritten during the run.

Other Parameters taken from dataset.py:

refsp_ucscName:: UCSC name of the reference species that is being aligned (e.g. hg38 for human).
dirWindows:: Directory where windows and their PCS size distributions are saved (input files).
dirSetupEvolTimes:: Directory where setup files are saved (input files).
dirEstEvolTimes:: Directory where estimates of evolutionary time will be saved (output files).

Note

Make sure that the required parameters described above are correctly defined in the file utils/dataset.py.

Output:
One .pickle file is created as output, detailing the posterior evolutionary times for each window. All windows mapped to the specified chromosome of the reference genome (human in our study) are included. The PCS size distributions from these windows, representing perfectly conserved sequences between the reference genome (e.g. hg38) and the species specified in the input (e.g. mm39), are used as input for the evolutionary time estimation method.

The output .pickle file is structured as follows:
1. A dictionary where each key represents a window identifier, defined as a tuple containing the start and end coordinates within the chromosome of the reference genome. The corresponding values are NumPy arrays of size n, where each index i corresponds to an evolutionary time, and the value at that index indicates the likelihood of the window being at evolutionary time i.
2. A dictionary that maps window identifiers to their related reference window sizes and reference sample values.
3. A dictionary that maps reference window sizes to selected evolutionary times.
Note

If you are interested in finding the evolutionary time at position i for a window, first check the reference window size in the second dictionary, then use it to get the evolutionary time values from the third dictionary. This indirect way of accessing information avoids saving the same data multiple times.

Pre-requisites

Before using this script, make sure all the required files were pre-computed:

a) Window file for the chromosome specified in the input

Make sure to run 2_computeWindows.py for the chromosome specified in the input parameter -chr.

b) Setup files for the estimation method

Make sure to run 3_setupEvolTimes.py for the alpha value specified in the input parameter -alpha.

Cluster resources

In case you want to run this Python script stand-alone in a cluster that uses Slurm to manage jobs:

srun -p compute -t 2:00:00 --mem 20G --nodes=1 --ntasks=1 --cpus-per-task=1 --pty bash

Otherwise you can use the script ../cluster/4_estimateEvolTimes_runAll.py to run this script for all 40 species used in this study (whole genome).

Time, Memory & Disk space

For reference, here we include a run example, with runtime, memory usage, and disk space required for running this script on each pairwise alignment of the 40 vertebrate dataset examined in our study. 30 cores were used in these runs.

Desc.	Parameter α=1.1			Parameter α=10
UCSC name	Time	Memory	Disk	Time	Memory	Disk
panPan3	02:12:16	27GB	4.030GB	02:08:24	22GB	3.710GB
panTro6	02:07:07	29GB	4.160GB	02:08:34	23GB	3.830GB
gorGor6	02:12:49	31GB	4.150GB	02:14:49	26GB	3.810GB
ponAbe3	02:22:49	33GB	3.880GB	02:08:45	25GB	3.560GB
papAnu4	02:02:03	33GB	3.620GB	02:04:07	27GB	3.320GB
macFas5	01:50:55	33GB	3.590GB	01:44:34	26GB	3.300GB
rhiRox1	02:04:24	31GB	3.830GB	02:06:27	26GB	3.520GB
chlSab2	02:11:06	33GB	3.850GB	01:58:09	27GB	3.530GB
nasLar1	02:04:28	31GB	3.110GB	01:58:45	25GB	2.860GB
rheMac10	01:59:27	32GB	3.640GB	02:08:37	27GB	3.340GB
calJac4	02:02:40	27GB	3.310GB	02:00:49	24GB	3.040GB
tarSyr2	01:52:27	25GB	3.240GB	01:57:15	21GB	2.970GB
micMur2	01:56:30	25GB	2.690GB	01:51:04	20GB	2.470GB
galVar1	01:59:18	25GB	3.220GB	02:00:21	21GB	2.950GB
mm39	01:38:46	24GB	2.140GB	01:47:34	18GB	1.960GB
oryCun2	01:53:04	25GB	2.570GB	01:48:55	19GB	2.360GB
rn7	01:56:31	24GB	2.120GB	01:45:28	18GB	1.940GB
vicPac2	01:58:22	25GB	2.990GB	01:58:28	20GB	2.740GB
bisBis1	01:43:00	23GB	2.710GB	01:56:04	20GB	2.480GB
felCat9	01:28:46	25GB	3.000GB	01:56:15	20GB	2.750GB
manPen1	02:02:00	24GB	2.820GB	01:53:13	20GB	2.580GB
bosTau9	01:40:54	24GB	2.260GB	01:46:43	19GB	2.070GB
canFam6	02:06:19	25GB	2.870GB	01:57:34	20GB	2.630GB
musFur1	02:04:22	25GB	3.080GB	01:57:49	20GB	2.830GB
neoSch1	02:06:12	26GB	3.220GB	01:59:31	20GB	2.960GB
equCab3	02:06:21	25GB	3.250GB	01:59:49	21GB	2.980GB
myoLuc2	01:54:06	24GB	2.290GB	01:29:52	19GB	2.100GB
susScr11	02:01:27	25GB	2.720GB	01:41:52	19GB	2.500GB
enhLutNer1	02:03:34	25GB	3.080GB	01:54:00	20GB	2.830GB
triMan1	02:01:40	25GB	2.990GB	01:46:48	20GB	2.740GB
macEug2	01:40:15	20GB	0.730GB	01:24:21	16GB	0.660GB
ornAna2	01:40:16	18GB	0.540GB	01:30:07	16GB	0.500GB
aptMan1	01:35:15	18GB	0.410GB	01:34:37	16GB	0.380GB
galGal6	01:40:10	18GB	0.330GB	01:37:57	13GB	0.300GB
thaSir1	01:42:28	18GB	0.260GB	01:27:48	15GB	0.240GB
aquChr2	01:35:58	16GB	0.390GB	01:27:22	15GB	0.350GB
melGal5	01:40:43	19GB	0.340GB	01:19:10	15GB	0.310GB
xenLae2	01:40:43	18GB	0.240GB	01:22:27	13GB	0.220GB
xenTro10	01:36:51	19GB	0.230GB	01:30:52	15GB	0.210GB
danRer11	01:36:48	20GB	0.170GB	01:30:24	15GB	0.160GB

Time per Run: Details

Stats on time of a single run (human-mouse alignment, chromosome 16, 1 core): ~5 minutes Details on computational time are available in the log of the run.

Step	Time (s)
Read windows	14.10
Read setup	91.50
Compute evol. times	135.49
Total time	241.34

Storage per Run: Details

Size of output files with PCSs for the human and mouse alignment, chromosome 16: ~61 MB.

Function details

Only relevant functions have been documented below. For more details on any function, check the comments in the souce code.

4_estimateEvolTimes.estimateEvolTimes_winRefSize(parallelInput)

4_estimateEvolTimes.getSampleSizeRef(sampleSizeRef_lst, pcsSizeDistrib, maxDiffSampSize)

4_estimateEvolTimes.initParallelInputs(my_dataset, winSizeRef_obslst, winSizeRef_filepaths, pcs_distrib_allwin, printStep)

4_estimateEvolTimes.printIgnoredWindows(pcs_distrib_allwin, minDiffPcsSizes)

4_estimateEvolTimes.processIteration(result, winSizeRefs_ts, winIds_refs, winIds_ests)

4_estimateEvolTimes.readSampleSizeRefs(winSizeRef_filepaths)

4_estimateEvolTimes.readSetup(my_dataset, pcs_distrib_allwin, alpha)

This function retrieves all reference window/sample sizes that were pre-calculated using the script 3_setupEvolTimes.py and matches each window from the dataset with a corresponding reference window/sample size from the collected data.

Returns:: The function returns: 1. A dictionary mapping reference window/sample sizes to a list of windows identified by their IDs; 2. A dictionary mapping reference window sizes to the file paths where their setup details are stored.