Implementation¶
vc_prep.py¶
This CLI estimates the prior lambdas which is needed for all subsequent analyses.
usage: vc_prep [-h] [-name NAME] -data DATA [-cv CV] a lPositional Arguments¶
a alphabet size l sequence length Named Arguments¶
-name name of output folder -data path to input data -cv estimate lambdas using regularization with regularization parameter chosen with 10-fold crossvalidation
Default: True
Output files¶
Running this command will output a .csv file lambdas.csv in the output directory. This file contains the optimal lambdas. Here is an example lambdas.csv file:
| order_0 | 3356027.47 |
| order_1 | 153814.18 |
| order_2 | 44970.01 |
| order_3 | 3453.46 |
| order_4 | 363.17 |
| order_5 | 318.42 |
| order_6 | 68.85 |
| order_7 | 43.17 |
| order_8 | 29.17 |
This CLI also outputs an variance_component.txt file. Here is an example:
| order_1 | 0.127 |
| order_2 | 0.390 |
| order_3 | 0.180 |
| order_4 | 0.071 |
| order_5 | 0.149 |
| order_6 | 0.048 |
| order_7 | 0.026 |
| order_8 | 0.006 |
vc_map_estimate.py¶
This CLI calculates the maximum a posterior (MAP) estimate
Output file¶
Executing this command output the MAP estimate map.csv. Here is the top 10 rows of an example:
| sequence | phenotype |
| AAAAAAAA | 2.038 |
| AAAAAAAC | 1.887 |
| AAAAAAAG | 3.017 |
| AAAAAAAU | 3.021 |
| AAAAAACA | 4.438 |
| AAAAAACC | 2.237 |
| AAAAAACG | 4.232 |
| AAAAAACU | 2.520 |
| AAAAAAGA | 34.346 |
| AAAAAAGC | 34.681 |
vc_pos_var.py¶
This CLI calculates the analytical posterior variance for a list of sequences.
input sequences¶
The user needs to specify the sequences for which to estimate the posterior variance. The sequences needs to be provided in a .csv file with one sequence per row. For example
| AAAUGAUA |
| CAUUCGUC |
| UAGCGUCU |
| GCGCGAUC |
| AACCACGU |
| CGACUCGA |
| UCCCGUUU |
| CAACGCAA |
| CGCUAGGA |
| AAAUCGAG |
Output file¶
Executing this command will ouput a .csv file named varpos.txt.
| sequence | variance |
| AAAUGAUA | 0.082 |
| CAUUCGUC | 0.576 |
| UAGCGUCU | 0.023 |
| GCGCGAUC | 108.691 |
| AACCACGU | 1.879 |
| CGACUCGA | 9.012 |
| UCCCGUUU | 0.370 |
| CAACGCAA | 0.290 |
| CGCUAGGA | 0.087 |
| AAAUCGAG | 88.807 |
vc_hmc.py¶
This CLI is used to perform posterior sampling using the Hamitonian Monte Carlo method.
usage: vc_hmc [-h] [-name NAME] -data DATA -lambdas LAMBDAS [-MAP MAP] [-step_size STEP_SIZE] -n_steps N_STEPS -n_samples N_SAMPLES [-n_tunes N_TUNES] [-starting_position {mode,random,custom}] [-starting_position_path QPATH] [-intermediate_output INTERMEDIATE_OUTPUT] [-sample_name SAMPLE_NAME] a lPositional Arguments¶
a alphabet size l sequence length Named Arguments¶
-name name of output folder -data path to input data -lambdas path to lambdas -MAP path to MAP estimate -step_size initial leapfrog stepsize
Default: 1e-06
-n_steps number of leapfrog steps per iteration -n_samples number of samples to draw from the posterior -n_tunes number of HMC steps used for tuning the step_size parameter
Default: 100
-starting_position Possible choices: mode, random, custom
starting position
Default: “mode”
-starting_position_path path to starting position vector if using custom starting position -intermediate_output output intemediate samples and variance
Default: False
-sample_name name of the hmc sample
Output files¶
This command output 3 files:
Summary file¶
This is a file with the suffix _hmc_summary.txt containing parameters for the HMC run.
Here is an example of a file’s content:
| initial_step_size | 1e-05 |
| final_step_size | 0.04301 |
| ntunes | 100 |
| n_steps | 100 |
| n_samples | 200 |
| start_time | 2020-07-28-16-27 |
| finish_time | 2020-07-28-16-59 |
| total_time | 0:32:00 |
Sample file¶
This is a file with the suffix _hmc_sample.txt containing all HMC samples of the run. The number of columns is equal to the number of HMC samples specified by the user and the number of rows is equal to the total number of possible sequences.
Variance file¶
This is a file with the suffix _hmc_variances.txt containing the variance of all sequences calculated using the HMC samples.
vc_hmc_diagnosis.py¶
This CLI calculates the potential reduction factors for all sequences given multiple hmc samples output by vc_hmc.py.
usage: vc_hmc_diagnosis.py [-h] -samples SAMPLEPATHS [SAMPLEPATHS ...] [-name NAME]Named Arguments¶
-samples tuple of paths to HMC samples -name project name
This command outputs R_hat.txt which contains the potential scale reduction factors for all sequences.