2_computeWindows module
Computes the PCS size distribution for every window in a chromosome, considering all species present in the dataset.
Use:
python3 2_computeWindows.py -chr [chromosome name]
Example of Usage (vertebrate dataset with 40 species):
python3 ~/code/2_computeWindows.py -chr chr16
Input Parameter (mandatory):
- -chr:
Chromosome name in the reference species (e.g. chr1, chr2, …, chrX, chrY).
Other Parameters taken from
dataset.py
:
- -refsp_ucscname:
UCSC name of the reference species that is being aligned (e.g. hg38 for human).
- -win_size:
Size of windows in base pairs.
- -species_UCSC_names:
list of UCSC names of all species included in the dataset (40 vertebrates in our study).
- -pcs_dir:
Directory where the PCSs for each pairwise alignment included in the dataset are saved.
- -win_dir:
Directory where windows and their PCS size distributions will be saved.
Note
Make sure that the required parameters described above are correctly defined in the file utils/dataset.py
.
- Output:
One
.pickle
file for the specified chromosome with a dictionary. In this dictionary, the keys are the UCSC names of the species, while the values are dictionaries that map window coordinates on the chromosome to their PCS size distributions. These distributions are also organized in a dictionary format, where the keys indicate size and the values denote counts.
Pre-requisites
Before using this script, make sure all the required files were computed:
a) PCSs from each pairwise alignment in the dataset
Make sure to run 1_extractPCS.py
for every species included in
the attribute speciesUCSCnames
declared in file dataset.py
.
In our study, this list includes the UCSC names of the 40 vertebrate species
used in our analysis, whose data (fasta files and chain files) was retrieved
from the UCSC website.
Cluster resources
In case you want to run this Python script stand-alone:
srun -p compute -t 2:00:00 --mem 20G --nodes=1 --ntasks=1 --cpus-per-task=1 --pty bash
Otherwise you can use the script ../cluster/2_computeWindows_runAll.py
to run this script for all 24 chromosomes (chr1, chr2, …, chrX, chrY).
Time, Memory & Disk space
For reference, here we include an upper limit on runtime, memory usage, and disk space required for running this script on the 40 vertebrate dataset examined in our study.
Desc. |
Time |
Memory |
Disk |
---|---|---|---|
chr1 |
01:09:13 |
9GB |
0.340GB |
chr2 |
01:29:36 |
11GB |
0.360GB |
chr3 |
00:59:03 |
7GB |
0.280GB |
chr4 |
01:01:40 |
7GB |
0.280GB |
chr5 |
00:54:51 |
7GB |
0.260GB |
chr6 |
01:00:43 |
5GB |
0.230GB |
chr7 |
00:51:09 |
6GB |
0.190GB |
chr8 |
00:47:49 |
5GB |
0.200GB |
chr9 |
00:32:51 |
3GB |
0.150GB |
chr10 |
00:36:37 |
4GB |
0.170GB |
chr11 |
00:29:28 |
5GB |
0.190GB |
chr12 |
00:30:16 |
5GB |
0.190GB |
chr13 |
00:20:57 |
3GB |
0.140GB |
chr14 |
00:19:24 |
3GB |
0.140GB |
chr15 |
00:17:15 |
3GB |
0.120GB |
chr16 |
00:17:42 |
3GB |
0.120GB |
chr17 |
00:18:00 |
3GB |
0.100GB |
chr18 |
00:21:40 |
3GB |
0.110GB |
chr19 |
00:12:39 |
2GB |
0.080GB |
chr20 |
00:18:12 |
3GB |
0.100GB |
chr21 |
00:10:03 |
2GB |
0.060GB |
chr22 |
00:07:28 |
1GB |
0.050GB |
chrX |
00:35:28 |
5GB |
0.200GB |
chrY |
00:04:18 |
1GB |
0.030GB |
Time per Run: Details
Stats on time of a single run (chr16, 40 vertebrate species): ~15 minutes
More details on computational time can be found in the log of the run.
Step |
Time (s) |
---|---|
Merging PCSs |
715.09147525 |
Computing windows |
0.57280803 |
Computing PCS size distribution |
172.84002709 |
Total time |
902.55868173 |
Storage per Run: Details
Size of output files with all windows of one chromosome (chr16): ~134 MB.
Output files |
Size |
---|---|
hg38.chr16.1000.windows.pickle |
119M |
hg38.chr16.mergedPCSs.pickle |
15M |
Note
The temporary file hg38.chr16.mergedPCSs.pickle
can be removed after the output file hg38.chr16.1000.windows.pickle
is succesfully computed.
Function details
Only relevant functions have been documented below. For more details on any function, check the comments in the souce code.
- 2_computeWindows.checkInputFiles(qChrom, my_dataset)
- 2_computeWindows.computeWindows(pcs_lst, windowSize)
This method computes coordinates for windows in the human/reference genome. This method ensures that the window coordinates are defined in such a way as to prevent the disruption of any PCS found in the species included in the dataset.
- returns:
A list of window coordinates, each approximately equal to the specified
windowSize
(either exactly that size or slightly larger). The coordinates are formatted as[begPos, endPos)
, whereendPos
is excluded from the interval.
- 2_computeWindows.consistencyCheck(pcs_lst)
- 2_computeWindows.distribPCS_all(qChrom, win_lst, my_dataset)
This method computes the PCS size distribution for every window in a chromosome, considering all species present in the dataset.
- 2_computeWindows.distribPCS_single(win_lst, pcs_lst)
This method computes the PCS size distribution for each window given in a list.
- Parameters:
win_lst – List of positions representing non-overlapping consecutive windows. Each position is a tuple in the format
(begPos, endPos)
, wherebegPos
andendPos
are positive integers indicating absolute positions on the chromosome. The list is sorted in ascending order by these positions.pcs_lst (list of named tuples
Pcs
) – list of PCSs, containing their positions in the chromosome. List should be sorted by position.
- Returns:
a dictionary that maps tuples of window coordinates
(begPos, endPos)
on the chromosome to another dictionary. This inner dictionary contains the distribution of PCS sizes, with PCS sizes as keys and their respective counts as values.
- 2_computeWindows.isContiguous(pcs1, pcs2)
- 2_computeWindows.isOverlap(pcs1, pcs2)
- 2_computeWindows.mergePCS_all(qChrom, my_dataset)
- 2_computeWindows.mergePCS_pairwise(pcs_lst_cur, pcs_lst_new)
This function adds a list of new PCSs to the current list of already-merged PCSs.
A new PCS can be:
Discarded, if it is encompassed by another PCS in the list;
Added, if its position does not overlap any PCS position in the list;
Merged, if its position overlaps one or more PCSs in the list.
- 2_computeWindows.mergePCS_pairwise_check(pcs_lst_cur, idx_to_add)
This function checks if the new PCS was properly added to the list. Two constrains are checked:
The new PCS should not overlap the previous or the next PCS;
The list must keep its property of being sorted by position.
- 2_computeWindows.mergePCS_pairwise_findPos(p_new, pcs_lst_cur, idx_to_add)
This function determines the index to insert the new PCS while keeping the list sorted and the PCSs non-overlapping.
- 2_computeWindows.mergePCS_pairwise_updLst(p_new, pcs_lst_cur, idx_to_add, nb_pcs_merged)
This function updates the PCS list with a new PCS (if needed).
- 2_computeWindows.printPCS(p)