1_extractPCS module
Reads a chain file (extension ‘.all.chain’) downloaded from the UCSC website and outputs one pickle file per chromosome with all PCSs of a pair of species.
To make sure that PCSs are being properly extracted, this script also needs the FASTA files of both species, which can be downloaded in the UCSC website.
Use:
python3 1_extractPCS.py -sp_ucsc_name [UCSC name]
Example of Usage (human (reference genome) and mouse):
python3 ~/code/1_extractPCS.py -sp_ucsc_name mm39
Input Parameter (mandatory):
- -sp_ucsc_name:
UCSC name of the species that is being aligned with the reference species (e.g. mm39 for mouse).
Other Parameters taken from
dataset.py
:
- refsp_ucscName:
UCSC name of the reference species that is being aligned (e.g. hg38 for human).
- chain_file:
file with extension
.all.chain
downloaded from UCSC (e.g.: hg38.papAnu4.all.chain.gz)- refsp_fastadir:
directory with FASTA files following the format
genomeName.chrName.bXXXeXXX.fa
(e.g. mm39.chrY.b51040000e51060000.fa). FASTA files can be downloaded in the UCSC website, and then split into smaller files for more efficient reading.- othsp_fastadir:
directory with FASTA files following the format
genomeName.chrName.bXXXeXXX.fa
(e.g. mm39.chrY.b51040000e51060000.fa). FASTA files can be downloaded in the UCSC website, and then split into smaller files for more efficient reading.- pcs_dir:
Directory where files with PCSs are going to be saved.
- pcs_minsize:
minimum size (nb. of base pairs) for the PCS to be considered.
Note
Make sure that the required parameters described above are correctly defined in the file utils/dataset.py
.
- Output:
A separate
.pickle
file for each chromosome, each including all PCSs positioned on the corresponding chromosome of the reference genome. Thepickle
file saves a list ofPcs
named tuples, where each tuple has the size of a PCS and its coordinates in both genomes.
Note
Required Directory Structure for FASTA files: a single FASTA file
from UCSC (e.g. mm39.fa
) must be split in many files,
following the pattern genomeName.chrName.bXXXeXXX.fa
(see Pre-Requisites below).
Pre-requisites
Before using this script, make sure all the required files were downloaded:
a) Chain files from UCSC
1. Download the “chain” file for the analyzed pairwise species (chained lastz alignments) in the UCSC website. For example, for human and mouse (hg38 and mm39), go to:
https://hgdownload.soe.ucsc.edu/goldenPath/mm39/vsHg38/
and download the file mm39.hg38.all.chain.gz
.
Unzip the download file:
gzip -d mm39.hg38.all.chain.gz
3. The full directory of the unziped chain file (e.g. mm39.hg38.all.chain
)
will be given to this script through the parameter -chain_file
.
b) FASTA files from UCSC
4. Download the FASTA files for the analyzed pairwise species in the UCSC website (e.g. mm39.fa.gz
).
The FASTA files are needed to validate the PCS coordinates obtained from the chain file.
5. Run the script breakFasta.py
using the unziped FASTA file (e.g. mm39.fa
).
This script will break the unziped FASTA file into smaller files, following the pattern:
genomeName.chrName.bXXXeXXX.fa
These files must be saved in a directory structure organized as follows:
genomeName/chr/chrName/genomeName.chrName.bXXXeXXX.fa
Cluster resources
In case you want to run this Python script stand-alone in a cluster that uses Slurm to manage jobs:
srun -p compute -t 2:00:00 --mem 20G --nodes=1 --ntasks=1 --cpus-per-task=1 --pty bash
Otherwise you can use the script ../cluster/1_extractPCS_runAll.py
to run this script for all 40 species used in this study.
Time, Memory & Disk space
For reference, here we include an upper limit on runtime, memory usage, and disk space required for running this script on each pairwise alignment of the 40 vertebrate dataset examined in our study.
UCSC name |
Time |
Memory |
Disk |
---|---|---|---|
panPan3 |
01:01:19 |
5GB |
0.95GB |
panTro6 |
00:59:09 |
6GB |
0.98GB |
gorGor6 |
00:50:02 |
5GB |
1.14GB |
ponAbe3 |
01:20:13 |
5GB |
1.98GB |
papAnu4 |
06:27:13 |
38GB |
2.76GB |
macFas5 |
02:42:13 |
20GB |
2.71GB |
rhiRox1 |
03:29:32 |
23GB |
2.98GB |
chlSab2 |
00:59:52 |
6GB |
2.99GB |
nasLar1 |
04:26:24 |
15GB |
2.25GB |
rheMac10 |
01:01:55 |
6GB |
2.80GB |
calJac4 |
02:50:19 |
36GB |
3.14GB |
tarSyr2 |
04:57:09 |
42GB |
2.62GB |
micMur2 |
02:24:55 |
17GB |
2.25GB |
galVar1 |
09:48:31 |
40GB |
2.56GB |
mm39 |
00:58:45 |
9GB |
1.10GB |
oryCun2 |
01:23:21 |
19GB |
1.67GB |
rn7 |
01:27:01 |
24GB |
1.07GB |
vicPac2 |
01:23:54 |
17GB |
2.19GB |
bisBis1 |
02:50:09 |
24GB |
1.86GB |
felCat9 |
01:57:19 |
28GB |
2.19GB |
manPen1 |
03:18:57 |
18GB |
1.99GB |
bosTau9 |
01:10:49 |
15GB |
1.59GB |
canFam6 |
01:00:37 |
18GB |
2.07GB |
musFur1 |
01:00:55 |
17GB |
2.21GB |
neoSch1 |
01:20:42 |
19GB |
2.46GB |
equCab3 |
01:30:38 |
17GB |
2.60GB |
myoLuc2 |
01:46:55 |
20GB |
1.60GB |
susScr11 |
01:20:55 |
18GB |
1.93GB |
enhLutNer1 |
01:27:55 |
18GB |
2.26GB |
triMan1 |
01:20:32 |
20GB |
2.16GB |
macEug2 |
05:58:51 |
18GB |
0.22GB |
ornAna2 |
01:20:04 |
13GB |
0.17GB |
aptMan1 |
00:45:15 |
3GB |
0.12GB |
galGal6 |
00:45:15 |
3GB |
0.09GB |
thaSir1 |
01:30:08 |
5GB |
0.07GB |
aquChr2 |
00:30:03 |
3GB |
0.12GB |
melGal5 |
01:15:00 |
3GB |
0.09GB |
xenLae2 |
01:15:46 |
6GB |
0.06GB |
xenTro10 |
02:40:30 |
40GB |
0.06GB |
danRer11 |
04:08:26 |
14GB |
0.04GB |
Time per Run: Details
Stats on time of a single run (human-mouse alignment, whole-genome): ~65 minutes Details on computational time are available in the log of the run.
Step |
Time (s) |
---|---|
Reading chains |
164.45332742 |
Compatible chains |
236.70236850 |
PCSs from chr1 |
246.49562716 |
PCSs from chr2 |
283.03650451 |
PCSs from chr3 |
218.50386620 |
PCSs from chr4 |
198.70389938 |
PCSs from chr5 |
206.89367223 |
PCSs from chr6 |
199.71864271 |
PCSs from chr7 |
204.75429463 |
PCSs from chr8 |
199.65774012 |
PCSs from chr9 |
150.57288957 |
PCSs from chr10 |
173.72448373 |
PCSs from chr11 |
156.16801929 |
PCSs from chr12 |
168.79578829 |
PCSs from chr13 |
113.74493265 |
PCSs from chr14 |
93.78934121 |
PCSs from chr15 |
99.19000816 |
PCSs from chr16 |
91.97542357 |
PCSs from chr17 |
85.33002925 |
PCSs from chr18 |
83.99309540 |
PCSs from chr19 |
102.53493857 |
PCSs from chr20 |
72.90905333 |
PCSs from chr21 |
37.42282581 |
PCSs from chr22 |
44.24664927 |
PCSs from chrX |
223.61548066 |
PCSs from chrY |
21.73898244 |
Total time |
3949.23640776 |
Storage per Run: Details
Size of output files with PCSs for the human and mouse alignment (whole-genome): ~1085 MB.
Output file |
Size |
---|---|
hg38.mm39.chr1.pcs.pickle |
98M |
hg38.mm39.chr2.pcs.pickle |
104M |
hg38.mm39.chr3.pcs.pickle |
72M |
hg38.mm39.chr4.pcs.pickle |
59M |
hg38.mm39.chr5.pcs.pickle |
74M |
hg38.mm39.chr6.pcs.pickle |
66M |
hg38.mm39.chr7.pcs.pickle |
56M |
hg38.mm39.chr8.pcs.pickle |
55M |
hg38.mm39.chr9.pcs.pickle |
44M |
hg38.mm39.chr10.pcs.pickle |
46M |
hg38.mm39.chr11.pcs.pickle |
65M |
hg38.mm39.chr12.pcs.pickle |
50M |
hg38.mm39.chr13.pcs.pickle |
33M |
hg38.mm39.chr14.pcs.pickle |
39M |
hg38.mm39.chr15.pcs.pickle |
34M |
hg38.mm39.chr16.pcs.pickle |
30M |
hg38.mm39.chr17.pcs.pickle |
24M |
hg38.mm39.chr18.pcs.pickle |
30M |
hg38.mm39.chr19.pcs.pickle |
14M |
hg38.mm39.chr20.pcs.pickle |
28M |
hg38.mm39.chr21.pcs.pickle |
12M |
hg38.mm39.chr22.pcs.pickle |
11M |
hg38.mm39.chrX.pcs.pickle |
41M |
hg38.mm39.chrY.pcs.pickle |
387K |
Genome Browser
Here there are instructions in case you wish to check the corresponding DNA sequences of a PCS in UCSC’s Genome Browser. Take for example the following PCS found in the pairwise alignment between human (hg38) and bonobo (panPan3):
Pcs(size=45, tChrom='chr10', tStrand='+', tPosBeg=87682, qChrom='chr16', qStrand='-', qPosBeg=21852)
Here, qChrom
, qStrand
, and qPosBeg
correspond to the chromosome, strand,
and starting coordinate in the query genome. In our dataset, the query genome is
always the human genome (hg38). Similarly, tChrom
, tStrand
, and tPosBeg
refer to the same information in the target genome (panPan3 in this example).
To compute the coordinates, first add +1
to tPosBeg
and qPosBeg
, as the
coordinates in Genome Browser start at 1, whereas here they start at 0.
Next, by summing the size of the PCS (size=45) to
the initial coordinates (tPosBeg+1
and qPosBeg+1
), it is possible
to retrieve the DNA sequence in Genome Browser for both genomes.
For this example, by accessing:
https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38
and typing chr16:21853-21897
, we obtain:
TAATGAGTGTACTGCTCCTAAACAAGGAGTATCTGCATTTATTTA
If it is indicated that sequence should be read in the negative strand (qStrand=’-‘), the reverse complement should be computed:
TAAATAAATGCAGATACTCCTTGTTTAGGAGCAGTACACTCATTA
Similarly for the bonobo genome, by accessing:
https://genome.ucsc.edu/cgi-bin/hgTracks?db=panPan3
and typing chr10:87683-87727
, we obtain:
TAAATAAATGCAGATACTCCTTGTTTAGGAGCAGTACACTCATTA
Function details
Only relevant functions have been documented below. For more details on any function, check the comments in the souce code.
- class 1_extractPCS.PcsSingleCheck(desc, cntUB, pcsSingleCheckFunc)
Bases:
object
This class makes sure that PCSs are being correctly processed. For every new PCS, it determines whether the PCS is a good candidate to be checked into details. Good candidates are long PCSs, PCSs in the reverse strand of one of the genomes, among others. It keeps a history of how many PCSs have been checked so far.
- Parameters:
desc (str) – Brief description of which type of check is being performed.
cntUB (int) – Maximum number of PCSs that will be checked in details.
pcsSingleCheckFunc (function) – Function that receives a named tuple PCS and returns True (PCS should be checked) or False.
- checkPCS(pcs)
- print()
- reset()
- 1_extractPCS.checkPCSrevQ(pcs)
- 1_extractPCS.checkPCSrevT(pcs)
- 1_extractPCS.checkPCSsize(pcs)
- 1_extractPCS.checkSpecies(dataset, ucscName)
- 1_extractPCS.computePCScoords(begAbs, posRel, isRevComp, sizePCS, sizeMis)
- 1_extractPCS.computePos(strand, size, start, end)
- 1_extractPCS.initializePCSchecks(cntUB=100, prefixTarget='tar', prefixQuery='qry')
- 1_extractPCS.isCompatible(pos, pos_lst)
Check if a sequence is compatible (non-intersecting) to a list of sequences.
- 1_extractPCS.performPCSchecks(pcs, checks)
- 1_extractPCS.readChains(chainsFilename)
This function parses the chains from a chain file (check the UCSC website for more information on the format).
- 1_extractPCS.readDNA(dirFASTA, prefix, chrName, strand, posBeg, posEnd, debug=False)
This function reads a DNA segment in a FASTA file.
- 1_extractPCS.reverseComplement(seq)
This function returns a reverse complement of a DNA.