Type: | Package |
Title: | Protein Feature Extraction from Profile Hidden Markov Models |
Version: | 0.1.1 |
Maintainer: | Shayaan Emran <shayaan.emran@gmail.com> |
Description: | Calculates a comprehensive list of features from profile hidden Markov models (HMMs) of proteins. Adapts and ports features for use with HMMs instead of Position Specific Scoring Matrices, in order to take advantage of more accurate multiple sequence alignment by programs such as 'HHBlits' Remmert et al. (2012) <doi:10.1038/nmeth.1818> and 'HMMer' Eddy (2011) <doi:10.1371/journal.pcbi.1002195>. Features calculated by this package can be used for protein fold classification, protein structural class prediction, sub-cellular localization and protein-protein interaction, among other tasks. Some examples of features extracted are found in Song et al. (2018) <doi:10.3390/app8010089>, Jin & Zhu (2021) <doi:10.1155/2021/8629776>, Lyons et al. (2015) <doi:10.1109/tnb.2015.2457906> and Saini et al. (2015) <doi:10.1016/j.jtbi.2015.05.030>. |
License: | GPL (≥ 3) |
Encoding: | UTF-8 |
RoxygenNote: | 7.2.3 |
Imports: | gtools, utils, stats, phonTools |
Suggests: | covr, knitr, rmarkdown, testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
VignetteBuilder: | knitr |
URL: | https://semran9.github.io/protHMM/ |
BugReports: | https://github.com/semran9/protHMM/issues |
NeedsCompilation: | no |
Packaged: | 2023-07-05 03:26:50 UTC; shayaanemran |
Author: | Shayaan Emran [aut, cre, cph] |
Repository: | CRAN |
Date/Publication: | 2023-07-05 18:20:02 UTC |
IM_psehmm
Description
The first twenty numbers of this feature correspond to the means of each column of the HMM matrix
H
. The rest of the features in the feature vector are found in matrix T[i,j]
, where T[i,j] =
\frac{1}{L-i}\sum_{n = 1}^{20-i} [H_{m,n}-H_{m, n+i}]^2, m = 1:L,\space i = 1:d\space and\space j = 1:20
.
Usage
IM_psehmm(hmm, d = 13)
Arguments
hmm |
The name of a profile hidden markov model file. |
d |
The maximum distance between residues column-wise. |
Value
A vector of length 20+20\times d-d\times\frac{d+1}{2}
Note
d must be less than 20.
References
Ruan, X., Zhou, D., Nie, R., & Guo, Y. (2020). Predictions of Apoptosis Proteins by Integrating Different Features Based on Improving Pseudo-Position-Specific Scoring Matrix. BioMed Research International, 2020, 1–13.
Examples
h<- IM_psehmm(system.file("extdata", "1DLHA2-7", package="protHMM"))
chmm
Description
This feature begins by creating a CHMM, which is created by constructing 4 matrices, A, B, C, D
from
the original HMM H
. A
contains the first 75 percent of the original matrix H
row-wise, B
the
last 75 percent, C
the middle 75 percent and D
the entire original matrix. These are then merged to create the new
CHMM Z
. From there, the Bigrams feature is calculated with a flattened 20 x 20 matrix B
, in which B[i, j] = \sum_{a = 1}^{L-1} Z_{a, i} \times Z_{a+1, j}
.
H
corresponds to the original HMM matrix, and L
is the number of rows in Z
. Local Average Group,
or LAG is then calculated by splitting up the CHMM into 20 groups along the length of the protein sequence and calculating
the sums of each of the columns of each group, making a 1 x 20 vector per group, and a length 20 x 20 vector for all groups. These features are then fused.
Usage
chmm(hmm)
Arguments
hmm |
The name of a profile hidden markov model file. |
Value
A fusion vector of length 800.
A LAG vector of length 400.
A Bigrams vector of length 400.
References
An, J., Zhou, Y., Zhao, Y., & Yan, Z. (2019). An Efficient Feature Extraction Technique Based on Local Coding PSSM and Multifeatures Fusion for Predicting Protein-Protein Interactions. Evolutionary Bioinformatics, 15, 117693431987992.
Examples
h<- chmm(system.file("extdata", "1DLHA2-7", package="protHMM"))
fp_hmm
Description
This feature consists of two vectors, d, s
. Vector d
corresponds to the sums across the sequence
for each of the 20 amino acid columns. Vector s
corresponds to a flattened matrix S[i, j] = \sum_{k = 1}^{L}
H[k, j] \times \delta[k, i]
in which \delta[k, i] = 1
when A_i = H[k, j]
. A
refers to a list of
all possible amino acids, i, j
span from 1:20
.
Usage
fp_hmm(hmm)
Arguments
hmm |
The name of a profile hidden markov model file. |
Value
A vector of length 20.
A vector of length 400.
References
Zahiri, J., Yaghoubi, O., Mohammad-Noori, M., Ebrahimpour, R., & Masoudi-Nejad, A. (2013). PPIevo: Protein–protein interaction prediction from PSSM based evolutionary information. Genomics, 102(4), 237–242.
Examples
h<- fp_hmm(system.file("extdata", "1DLHA2-7", package="protHMM"))
hmm_GA
Description
This feature calculates the Geary autocorrelation of each amino acid type for each distance d less than or equal to the lag value and greater than or equal to 1.
Usage
hmm_GA(hmm, lg = 9)
Arguments
hmm |
The name of a profile hidden markov model file. |
lg |
The lag value, which indicates the distance between residues. |
Value
A vector of length lg \times
20, by default this is 180.
Note
The lag value must be less than the length of the protein sequence
References
Liang, Y., Liu, S., & Zhang. (2015). Prediction of Protein Structural Class Based on Different Autocorrelation Descriptors of Position–Specific Scoring Matrix. MATCH: Communications in Mathematical and in Computer Chemistry, 73(3), 765–784.
Examples
h<- hmm_GA(system.file("extdata", "1DLHA2-7", package="protHMM"))
hmm_GSD
Description
This feature initially creates a grouping matrix G
by assigning each position a number 1:3
based on
the value at each position of HMM matrix H
; 1
represents the low probability group, 2
the medium and 3
the high probability group.
The number of total points in each group for each column is then calculated, and the sequence is then split
based upon the the positions of the 1st, 25th, 50th, 75th and 100th percentile (last) points for each of the three groups,
in each of the 20 columns of the grouping matrix. Thus for column j
, S(k, j, z) = \sum_{i = 1}^{(z)*.25*N} |G[i, j] = k|
,
where k
is the group number, z = 1:4
and N
corresponds to number of rows in matrix G
.
Usage
hmm_GSD(hmm)
Arguments
hmm |
The name of a profile hidden markov model file. |
Value
A vector of length 300.
References
Jin, D., & Zhu, P. (2021). Protein Subcellular Localization Based on Evolutionary Information and Segmented Distribution. Mathematical Problems in Engineering, 2021, 1–14.
Examples
h<- hmm_GSD(system.file("extdata", "1DLHA2-7", package="protHMM"))
hmm_LBP
Description
This feature uses local binary pattern with a neighborhood of radius 1 and 8 sample points to extract features from the HMM. A 256 bin histogram is extracted as a 256 length feature vector.
Usage
hmm_LBP(hmm)
Arguments
hmm |
The name of a profile hidden markov model file. |
Value
A vector of length 256.
References
Li, Y., Li, L., Wang, L., Yu, C., Wang, Z., & You, Z. (2019). An Ensemble Classifier to Predict Protein–Protein Interactions by Combining PSSM-based Evolutionary Information with Local Binary Pattern Model. International Journal of Molecular Sciences, 20(14), 3511.
Examples
h<- hmm_LBP(system.file("extdata", "1DLHA2-7", package="protHMM"))
hmm_LPC
Description
This feature uses linear predictive coding (LPC) to map each HMM to a 20 \times 14 = 280
dimensional vector,
where for each of the 20 columns of the HMM, LPC is used to extract a 14 dimensional vector D_n
Usage
hmm_LPC(hmm)
Arguments
hmm |
The name of a profile hidden markov model file. |
Value
A vector of length 280.
References
Qin, Y., Zheng, X., Wang, J., Chen, M., & Zhou, C. (2015). Prediction of protein structural class based on Linear Predictive Coding of PSI-BLAST profiles. Central European Journal of Biology, 10(1).
Examples
h<- hmm_LPC(system.file("extdata", "1DLHA2-7", package="protHMM"))
hmm_MA
Description
This feature calculates the normalized Moran autocorrelation of each amino acid type, for each distance d less than or equal to the lag value and greater than or equal to 1.
Usage
hmm_MA(hmm, lg = 9)
Arguments
hmm |
The name of a profile hidden markov model file. |
lg |
The lag value, which indicates the distance between residues. |
Value
A vector of length lg \times
20, by default this is 180.
Note
The lag value must be less than the length of the protein sequence
References
Liang, Y., Liu, S., & Zhang. (2015). Prediction of Protein Structural Class Based on Di fferent Autocorrelation Descriptors of Position–Specific Scoring Matrix. MATCH: Communications in Mathematical and in Computer Chemistry, 73(3), 765–784.
Examples
h<- hmm_MA(system.file("extdata", "1DLHA2-7", package="protHMM"))
hmm_MB
Description
This feature calculates the normalized Moreau-Broto autocorrelation of each amino acid type, for each distance d less than or equal to the lag value and greater than or equal to 1.
Usage
hmm_MB(hmm, lg = 9)
Arguments
hmm |
The name of a profile hidden markov model file. |
lg |
The lag value, which indicates the distance between residues. |
Value
A vector of length lg \times
20, by default this is 180.
Note
The lag value must be less than the length of the protein sequence
References
Liang, Y., Liu, S., & Zhang. (2015). Prediction of Protein Structural Class Based on Different Autocorrelation Descriptors of Position–Specific Scoring Matrix. MATCH: Communications in Mathematical and in Computer Chemistry, 73(3), 765–784.
Examples
h<- hmm_MB(system.file("extdata", "1DLHA2-7", package="protHMM"))
hmm_SCSH
Description
This feature returns the 2 and 3-mer compositions of the protein sequence. This is done by first
finding all possible 2 and 3-mers for any protein (20^2
and 20^3
permutations for 2 and 3-mers respectively).
With those permutations, vectors of length 400 and 8000 are created, each point corresponding to one 2 or 3-mer.
Then, the protein sequence that corresponds to the HMM scores is extracted, and put into a bipartite graph with the protein sequence.
Each possible path of length 1 or 2 is found, and the corresponding vertices on the graph are noted as 2 and 3-mers.
For each 2 or 3-mer found from these paths, 1 is added to the position that responds to that 2/3-mer in the
2-mer and 3-mer vectors , which are the length 400 and 8000 vectors created previously. The vectors are then returned.
Usage
hmm_SCSH(hmm)
Arguments
hmm |
The name of a profile hidden markov model file. |
Value
A vector of length 400.
A vector of length 8000.
References
Mohammadi, A. M., Zahiri, J., Mohammadi, S., Khodarahmi, M., & Arab, S. S. (2022). PSSMCOOL: a comprehensive R package for generating evolutionary-based descriptors of protein sequences from PSSM profiles. Biology Methods and Protocols, 7(1).
Examples
h_400<- hmm_SCSH(system.file("extdata", "1DLHA2-7", package="protHMM"))[[1]]
h_8000<- hmm_SCSH(system.file("extdata", "1DLHA2-7", package="protHMM"))[[2]]
hmm_SepDim
Description
This feature calculates the probabilistic expression of amino acid dimers that are spatially separated by a distance l
.
Mathematically, this is done with a 20 x 20 matrix F
, in which F[m, n] = \sum_{i = 1}^{L-l} H_{i, m}H_{i+k, n}
.
H
corresponds to the original HMM matrix, and L
is the number of rows in H
. Matrix F
is then flattened to
a feature vector of length 400, and returned.
Usage
hmm_SepDim(hmm, l = 7)
Arguments
hmm |
The name of a profile hidden markov model file. |
l |
Spatial distance between dimer residues. |
Value
A vector of length 400
References
Saini, H., Raicar, G., Sharma, A., Lal, S. K., Dehzangi, A., Lyons, J., Paliwal, K. K., Imoto, S., & Miyano, S. (2015). Probabilistic expression of spatially varied amino acid dimers into general form of Chou's pseudo amino acid composition for protein fold recognition. Journal of Theoretical Biology, 380, 291–298.
Examples
h<- hmm_SepDim(system.file("extdata", "1DLHA2-7", package="protHMM"))
hmm_Single_Average
Description
This feature groups together rows that are related to the same amino acid. This is done using a vector
SA(k)
, in which k
spans 1:400
and SA(k) = avg_{i = 1, 2... L}H[i, j] \times \delta(P(i), A(z))
,
in which H
is the HMM matrix, P
in the protein sequence, A
is an ordered set of amino acids,
the variables j, z = 1:20
, the variable k = j + 20 \times (z-1)
when creating the vector,
and \delta()
represents Kronecker's delta.
Usage
hmm_Single_Average(hmm)
Arguments
hmm |
The name of a profile hidden markov model file. |
Value
A vector of length 400.
References
Nanni, L., Lumini, A., & Brahnam, S. (2014). An Empirical Study of Different Approaches for Protein Classification. The Scientific World Journal, 2014, 1–17.
Examples
h<- hmm_Single_Average(system.file("extdata", "1DLHA2-7", package="protHMM"))
hmm_ac
Description
This feature calculates the covariance between two residues separated by a lag value within the same amino acid emission frequency column along the protein sequence.
Usage
hmm_ac(hmm, lg = 4)
Arguments
hmm |
The name of a profile hidden markov model file. |
lg |
The lag value, which indicates the distance between residues. |
Value
A vector of length 20 \times
the lag value; by default this is a vector of length 80.
Note
The lag value must be less than the length of the protein sequence
References
Dong, Q., Zhou, S., & Guan, J. (2009). A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation. Bioinformatics, 25(20), 2655–2662.
Examples
h<- hmm_ac(system.file("extdata", "1DLHA2-7", package="protHMM"))
hmm_bigrams
Description
This feature is calculated with a 20 x 20 matrix B
, in which B[i, j] = \sum_{a = 1}^{L-1} H_{a, i}H_{a+1, j}
.
H
corresponds to the original HMM matrix, and L
is the number of rows in H
. Matrix B
is then flattened to
a feature vector of length 400, and returned.
Usage
hmm_bigrams(hmm)
Arguments
hmm |
The name of a profile hidden markov model file. |
Value
A vector of length 400
References
Lyons, J., Dehzangi, A., Heffernan, R., Yang, Y., Zhou, Y., Sharma, A., & Paliwal, K. K. (2015). Advancing the Accuracy of Protein Fold Recognition by Utilizing Profiles From Hidden Markov Models. IEEE Transactions on Nanobioscience, 14(7), 761–772.
Examples
h<- hmm_bigrams(system.file("extdata", "1DLHA2-7", package="protHMM"))
hmm_cc
Description
The feature calculates the covariance between different residues separated along the protein sequences by a lag value across different amino acid emission frequency columns.
Usage
hmm_cc(hmm, lg = 4)
Arguments
hmm |
The name of a profile hidden markov model file. |
lg |
The lag value, which indicates the distance between residues. |
Value
A vector of length 20 x 19 x the lag value; by default this is a vector of length 1520.
Note
The lag value must less than the length of the amino acid sequence.
References
Dong, Q., Zhou, S., & Guan, J. (2009). A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation. Bioinformatics, 25(20), 2655–2662.
Examples
h<- hmm_cc(system.file("extdata", "1DLHA2-7", package="protHMM"))
hmm_distance
Description
This feature calculates the cosine distance matrix between two HMMs A
and B
before dynamic time warp is applied to
the distance matrix calculate the cumulative distance between the HMMs, which acts as a measure of similarity,
The cosine distance matrix D
is found to be D[a_i, b_j] = 1 - \frac{a_ib_j^{T}}{a_ia_i^Tb_jb_j^T}
,
in which a_i
and a_i
refer to row vectors of A
and B
respectively.
This in turn means that D
is of dimensions nrow(A), nrow(b)
. Dynamic time warp then calculates the
cumulative distance by calculating matrix C[i, j] = min(C[i-1, j], C[i, j-1], C[i-1, j-1]) + D[i, j]
,
where C_{i,j}
is 0 when i
or j
are less than 1. The lower rightmost point of the matrix C
is then returned as the cumulative distance between proteins.
Usage
hmm_distance(hmm_1, hmm_2)
Arguments
hmm_1 |
The name of a profile hidden markov model file. |
hmm_2 |
The name of another profile hidden markov model file. |
Value
A double that indicates distance between the two proteins.
References
Lyons, J., Paliwal, K. K., Dehzangi, A., Heffernan, R., Tsunoda, T., & Sharma, A. (2016). Protein fold recognition using HMM–HMM alignment and dynamic programming. Journal of Theoretical Biology, 393, 67–74.
Examples
h<- hmm_distance(system.file("extdata", "1DLHA2-7", package="protHMM"),
system.file("extdata", "1TEN-7", package="protHMM"))
hmm_read
Description
Reads in the amino acid emission frequency columns of a profile hidden markov model matrix and converts each position to frequencies.
Usage
hmm_read(hmm)
Arguments
hmm |
The name of a profile hidden markov model file. |
Value
A 20 x L matrix, in which L is the sequence length.
Examples
h<- hmm_read(system.file("extdata", "1DLHA2-7", package="protHMM"))
hmm_smooth
Description
This feature smooths the HMM matrix H
by using sliding window of length sw
to incorporate information
from up and downstream residues into each row of the HMM matrix. Each HMM row r_i
is made into the summation
of r_{i-(sw/2)}+... r_i...+r_{i+(sw/2)}
, for i = 1:L
, where L
is the number of rows in H
.
For rows such as the beginning and ending rows, 0
matrices of dimensions sw/2, 20
are appended to the
original matrix H
.
Usage
hmm_smooth(hmm, sw = 7)
Arguments
hmm |
The name of a profile hidden markov model file. |
sw |
The size of the sliding window. |
Value
A matrix of dimensions L \times
20.
References
Fang, C., Noguchi, T., & Yamana, H. (2013). SCPSSMpred: A General Sequence-based Method for Ligand-binding Site Prediction. IPSJ Transactions on Bioinformatics, 6(0), 35–42.
Examples
h<- hmm_smooth(system.file("extdata", "1DLHA2-7", package="protHMM"))
hmm_svd
Description
This feature uses singular value decomposition (SVD) to reduce the dimensionality of the inputted hidden
markov model matrix. SVD factorizes a matrix C of dimensions i, j
to U[i, r] \times \Sigma[r, r] \times V[r, j]
.
The diagonal values of \Sigma
are known as the singular values of matrix C, and are what are returned with this function.
Usage
hmm_svd(hmm)
Arguments
hmm |
The name of a profile hidden markov model file. |
Value
A vector of length 20.
References
Song, X., Chen, Z., Sun, X., You, Z., Li, L., & Zhao, Y. (2018). An Ensemble Classifier with Random Projection for Predicting Protein–Protein Interactions Using Sequence and Evolutionary Information. Applied Sciences, 8(1), 89.
Examples
h<- hmm_svd(system.file("extdata", "1DLHA2-7", package="protHMM"))
hmm_trigrams
Description
This feature is calculated with a 20 x 20 x 20 block B
, in which B[i, j, k] = \sum_{a = 1}^{L-2} H_{a, i}H_{a+1, j}H_{a+2, k}
.
H
corresponds to the original HMM matrix, and L
is the number of rows in H
. Matrix B
is then flattened to
a feature vector of length 8000, and returned.
Usage
hmm_trigrams(hmm)
Arguments
hmm |
The name of a profile hidden markov model file. |
Value
A vector of length 8000
References
Lyons, J., Dehzangi, A., Heffernan, R., Yang, Y., Zhou, Y., Sharma, A., & Paliwal, K. K. (2015). Advancing the Accuracy of Protein Fold Recognition by Utilizing Profiles From Hidden Markov Models. IEEE Transactions on Nanobioscience, 14(7), 761–772.
Examples
h<- hmm_trigrams(system.file("extdata", "1DLHA2-7", package="protHMM"))
pse_hmm
Description
The first twenty numbers of this feature correspond to the means of each column of the HMM matrix
H
. The rest of the features in the feature vector are given by correlation of the ith
most
contiguous values along the chain per each amino acid column, where 0<i<g+1
. This creates a vector of 20 \times
g, and this combines with the first 20 features to create the final feature vector.
Usage
pse_hmm(hmm, g = 15)
Arguments
hmm |
The name of a profile hidden markov model file. |
g |
The contiguous distance between residues. |
Value
A vector of length 20 + g \times 20
, by default this is 320.
Note
g must be less than the length of the protein sequence
References
Chou, K., & Shen, H. (2007). MemType-2L: A Web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochemical and Biophysical Research Communications, 360(2), 339–345.
Examples
h<- pse_hmm(system.file("extdata", "1DLHA2-7", package="protHMM"))